README.md 16.6 KB
Newer Older
Yaghob Jakub's avatar
Yaghob Jakub committed
1
# KSI/DSE clusters
Yaghob Jakub's avatar
Yaghob Jakub committed
2

Yaghob Jakub's avatar
Yaghob Jakub committed
3
4
5
6
7
8
9
10
We have two computational clusters **parlab** and **gpulab**.
They are used for exercises, homework assignments, and research, where you need high-performance computing.
The clusters are managed by the SLURM resource manager.
Additionally, the Charliecloud is available on all worker nodes.

> Do not execute any of your code on the front-end nodes directly, use worker nodes instead!

All unknown terms (front-end node, worker node) will be explained later. Front-end nodes do not have installed any development software.
Jakub Yaghob's avatar
Jakub Yaghob committed
11
12
13

## SLURM Crash Course

Jakub Yaghob's avatar
Jakub Yaghob committed
14
15
All informations about SLURM can be found on its
[SLURM documentation](https://slurm.schedmd.com/documentation.html)
Jakub Yaghob's avatar
Jakub Yaghob committed
16
17
18
19
20
21
page or on [SLURM tutorials](https://slurm.schedmd.com/tutorials.html) page.
Anyway, we have provided a short description of SLURM and of our clusters.

### Terminology

A **cluster** is a bunch of **nodes**.
22
23
Nodes are grouped together to **partitions**.
Partitions may overlap, ie. one node can be in more partitions.
Jakub Yaghob's avatar
Jakub Yaghob committed
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
**Feature** is a string describing a feature of a node, e.g. avx2 for a node capable of executing AVX2 instructions.
Each node has two sets of features: current features and available features.
Usually, they are same.
But in some cases, the node is capable of changing current features set on demand.

SLURM manages **resources** and schedules jobs according to available resources.
The most important resources are CPUs and memory, but it is able to manage other generic resources (GRES), like GPUs, etc.
Moreover, the time is resource as well.

A **user** is identified by his/her login.
**Account** is a billing entity (well, we won't charge you for using our clusters).
Each user must have assigned an account. Moreover, user can be assigned to more accounts and use them depending on what he/she is doing.
Accounts can only be allowed access to some partitions.

A user can launch a **job**.
Job has a state, has some reserved and assigned resources, and returns an error code after completition.
Job consistes from **steps** (usually 1 step).
Each step is executed by a bunch of **tasks** (usually 1 task), where resources are equally assigned to the tasks of a step.
Jobs are inserted to a **scheduling queue**, where you can find them.

Partition has a priority.
45
46
47
48
49
50
A job submitted to a partition with higher **priority** can preempt an another job submitted to a partition with lower priority.
Our clusters use only two kind of preemption:

- **SUSPEND** - the preempted job is suspended and releases CPUs. Unfortunately, it does not release memory or GPUs.
- **REQUEUE** - the preempted job is killed and requeued to the scheduling queue, waiting for available resources.
All resources including GPUs are released.
51
Jobs submitted to the partition should use some form of checkpointing, otherwise all work can be lost, when the job is killed during preemption.
Jakub Yaghob's avatar
Jakub Yaghob committed
52

Jakub Yaghob's avatar
Jakub Yaghob committed
53
54
### Important commands

Jakub Yaghob's avatar
Jakub Yaghob committed
55
There are many commands (see
Jakub Yaghob's avatar
Jakub Yaghob committed
56
57
58
59
[SLURM man pages](https://slurm.schedmd.com/man_index.html) or [SLURM command summary](https://slurm.schedmd.com/pdfs/summary.pdf)
). The most important commands are:


Jakub Yaghob's avatar
Jakub Yaghob committed
60
61
- **srun** - reserves and assigns resources and runs an interactive job, use it only for short or debug jobs
- **sbatch** - reserves and assigns resources and submit a batch as a job, use it for long jobs (>1 hour)
Jakub Yaghob's avatar
Jakub Yaghob committed
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
- **salloc** - only reserves and does not assign resources, use srun or sbatch for assigning resources subsequently
- **sinfo** - get info about cluster state
- **scancel** - cancels a job
- **squeue** - view info about scheduling queue
- **sbcast** - distributes a file to the nodes allocated to a job
- **sstat** - display info about a job

Job submission commands (srun, sbatch, salloc) have a common set of important options:

| Shrt | Long                 | Description |
| ---- | -------------------- |------------ |
| `-A` | `--acount=` 	        | Charge resources used by the job to the specified account. If not sepcified, user's default account is charged |
| `-B` | `--extra-node-info=` | Select only nodes with at least specified number of sockets, cores per socket, and threads per core. This option does not specify the resource allocation, its just constraint. |
| `-C` | `--constraint=`      | Select only nodes with matching features |
| `-c` | `--cpus-per-task=`   | Number of CPUs per 1 task |
| `-e` | `--error=`           | Standard error stream is redirected to the specified file |
|      | `--gpus=`            | Specifies the number of GPUs required for the job in the form `[type:]count`. It is a shortcut for --gres=gpu:type:count. |
|      | `--gres=`            | Specifies comma delimited list of GRES. Each entry on the list is in form `name[[:type]:count]` |
| `-i` | `--input=`           | Standard input stream is redirected from the specified file |
| `-J` | `--job-name=`        | Job name |
| `-L` | `--licenses=`        | Specifies comma delimited list of licenses allocated to the job. Each entry on the list is in form `name[:count]` |
| `-m` | `--distribution=`    | Select distribution method for tasks and resources. For more info see documentation |
|      | `--mem=`             | Specify the real memory required per node |
|      | `--mem-per-cpu=`     | Specify the memory required per allocated CPU |
|      | `--mem-bind=`        | Specify the memory binding. For more info see documentation |
| `-N` | `--nodes=`           | A minimum of allocated nodes. Default is 1 |
| `-n` | `--ntasks=`          | Number of tasks. Default is 1 |
| `-o` | `--output=`          | Standard output stream is redirected to the specified file |
| `-p` | `--partition=`       | Request a specific partition for the resource allocation. If not specified, default partition os chosen |
| `-t` | `--time=`            | Set a limit on the total run time of a job. If not specified, default time for a selected partition is used. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours". |

### Description of available clusters

As mentioned above, we have two clusters **parlab** and **gpulab**.
Access to the cluster is always through the front-end server using SSH on port 42222.
Front-end servers have the same name as the cluster, i.e. **parlab.ms.mff.cuni.cz** and **gpulab.ms.mff.cuni.cz**.

#### Course users

Student logins will be created before the first assignment.
The login is in the form *s_sislogin*, where *sislogin* is a student's login to our student infomation system (SIS).
The generated password will be sent to student's official e-mail.

#### Other users

The login will be created on demand by the cluster administrator.

Yaghob Jakub's avatar
Yaghob Jakub committed
109
#### For everyone
Jakub Yaghob's avatar
Jakub Yaghob committed
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

Both clusters use one common LDAP, so you will need to change the password only once.
Each course tought on our clusters has its account.
Any research group or project have their account as well.
Logins are assigned to the corresponding account depending on visiting relevant courses or working in a research group.

Both clusters have access to the same disk array using NFS.
You may find your home mounted on `/mnt/home`.
Moreover, research projects can have an additional space mounted on `/mnt/research`.

You may use an environment variable **TMPDIR** set to a private local temporary directory for a job.
It is created on every node allocated to the job before the job starts and it is completely removed after the job finishes.
The temporary directory can by used as a scratchpad.
Moreover, it resides on a local SSD RAID, therefore it is faster access data here then accessing data on a remote NFS disk.
On the other hand, **the space is limited, be careful!**

Jakub Yaghob's avatar
Jakub Yaghob committed
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
#### Parlab cluster specification

All nodes are interconected by InfiniBand FDR (56 Gb/s) for high-performance messaging using MPI.
Moreover, they are interconnected by 10 GbE for all other traffic.
The front-end server is connected by 10 GbE to the external world.
The latest version of OpenMPI is installed on all nodes.

Parlab nodes

| Node names | CPU                  | Sockets | Cores | HT | RAM    | GRES | Additional info |
| ---------- | -------------------- | ------- | ----- | -- | ------ | ---- | --------------- |
| w[401-404] | Intel Xeon E7-4820   | 4       | 8     | 2  | 128 GB | | |
| w[201-208] | Intel Xeon Gold 6130 | 2       | 16    | 2  | 128 GB | | |
| phi[01-02] | Intel Xeon Phi 7230  | 1       | 64    | 4  | 96 GB  |hbm 16 GB | can change feature set after rebooting |

Parlab partitions

143
144
145
146
147
148
149
150
151
| Name     | Nodes      | Priority | Timelimit | Preemption | Intended use |
| -------- | ---------- | -------- | --------- | ---------- | ------------ |
| big-lp   | w[401-404] | low      | 1 day     | SUSPEND    | default, general or MPI debugging, long jobs |
| big-hp   | w[401-404] | high     | 1 hour    | SUSPEND    | executing short jobs on 4-socket system, MPI jobs |
| small-lp | w[201-208] | low      | 1 day     | SUSPEND    | debugging on newer CPUs, MPI debugging, long jobs |
| small-hp | w[201-208] | high     | 30 mins   | SUSPEND    | executng short jobs on 2-socket system, MPI jobs |
| phi-lp   | phi[01-02] | low      | 1 day     | SUSPEND    | KNL debugging, long jobs |
| phi-hp   | phi[01-02] | high     | 30 mins   | SUSPEND    | executing short jobs on KNL |
| all      | all        | high     | 30 mins   | SUSPEND    | executing short jobs on all nodes, used primarily for testing heterogeneous MPI computing |
Jakub Yaghob's avatar
Jakub Yaghob committed
152
153
154
155
156
157
158
159
160
161
162

#### Gpulab cluster specification

All nodes are interconnected by 10 GbE. The front-end server is connected by 10 GbE to the external world.

Gpulab nodes

| Node names   | CPU                    | Sockets | Cores | HT | RAM    | GRES | Additional info |
| ------------ | ---------------------- | ------- | ----- | -- | ------ | ---- | --------------- |
| dw[01-02]    | Intel Xeon E5450       | 2       | 4     | 1  | 32 GB  |      | Docker installed |
| dw03         | Intel Xeon E5640       | 2       | 4     | 2  | 96 GB  |      | Docker installed |
Jakub Yaghob's avatar
Jakub Yaghob committed
163
164
| dw[04-05]    | Intel Xeon E5-2660v2   | 2       | 10    | 2  | 256 GB |      | Docker installed |
| varjag       | Intel Xeon E7-4830     | 4       | 8     | 2  | 256 GB |      |                  |
Yaghob Jakub's avatar
Yaghob Jakub committed
165
| volta[01-02] | Intel Xeon Silver 4110 | 2       | 8     | 2  | 256 GB | gpu volta [0-1] | 2x NVIDIA Tesla V100 PCIe 16 GB, latest CUDA |
Jakub Yaghob's avatar
Jakub Yaghob committed
166
| volta03      | Intel Xeon Silver 4110 | 2       | 8     | 2  | 192 GB | gpu volta [0-1] | 2x NVIDIA Tesla V100 PCIe 16 GB, latest CUDA |
Jakub Yaghob's avatar
Jakub Yaghob committed
167
168
| volta04      | Intel Xeon Gold 5218   | 2       | 16    | 2  | 384 GB | gpu volta [0]   | 1x NVIDIA Tesla V100 SXM2 32 GB, latest CUDA |
| volta05      | Intel Xeon Gold 5218   | 2       | 16    | 2  | 384 GB | gpu volta [0-3] | 4x NVIDIA Tesla V100 SXM2 32 GB, latest CUDA |
Jakub Yaghob's avatar
Jakub Yaghob committed
169
170
171

Gpulab partitions

172
173
| Name      | Nodes            | Priority  | Timelimit | Preemption | Intended use |
| --------- | ---------------- | --------- | --------- | ---------- |------------ |
Jakub Yaghob's avatar
Jakub Yaghob committed
174
175
176
177
178
| debug-lp  | dw[01-05],varjag | low       | 7 days    | SUSPEND    | default, general debugging, long jobs, build Docker image |
| debug-hp  | dw[01-05],varjag | high      | 1 hour    | SUSPEND    | short jobs, build Docker image |
| volta-elp | volta[01-05]     | extra low | 7 days    | REQUEUE    | extra long GPU jobs |
| volta-lp  | volta[01-05]     | low       | 1 day     | REQUEUE    | long GPU jobs |
| volta-hp  | volta[01-05]     | high      | 1 hour    | SUSPEND    | debugging GPU task, executing short GPU jobs |
Jakub Yaghob's avatar
Jakub Yaghob committed
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220

### Useful examples

`salloc`

Starts a shell in a minimal environment (resources). Useful for making small builds or debugging in a restricted environment.

`sbatch -p volta-lp --gres=gpu:volta:1 mygpucode.sh` or newer syntax `sbatch -p volta-lp --gpus=1 mygpucode.sh`

Starts a batch job which will have one NVIDIA V100 card available.

`sinfo -o "%P %L %l"`

Prints info about default and maximum job time for partitions in a cluster.

`sinfo -o "%n %b %f"`

Prints info about current and available feature set for nodes in a cluster.

`srun -p phi-lp -C flat,snc2 --gres=hbm:8G myphicode`

Selects a KNL node with required features and assigns 8 GB of HBM to the job.
If there is no node with required features, one free node will be selected (may involve waiting for finishing all jobs on the selected node)
and rebooted for changing current set of features. **BE PATIENT, THE REBOOT TAKES LOOOOOONG TIME** (~10 mins).

`srun -p small-hp -n 128 -N 8 --mem-per-cpu=2G mympijob`

Starts a MPI job with 128 tasks/ranks spanning over 8 nodes in the small partition assigning 1 CPU and 2 GB RAM to each task.

### Recomendations

- Prefer using `sbatch` for long jobs.
Moreover `sbatch` allows setting of SLURM parameters in the exectued shell-script,
you don't need to write them always on the command-line.
Moreover, jobs executed by `sbatch` can be requeued, when cancelled (e.g. for priority reasons).

- Don't forget to request GPU using `--gpus=1`, when running on volta-xxx partitions.

- Do not use `srun --pty bash` followed by executing a long computation from the command-line.
When the job finishes, you will block resources, until it either timeouts or you will exit the bash.
Again, prefer using `sbatch` for long jobs.

Jakub Yaghob's avatar
Jakub Yaghob committed
221
222
- Set mail-user and mail-type option in `sbatch` shell using `#SBATCH --mail-user=mymail@isp.edu` and `#SBATCH --mail-type=END,FAIL`.
SLURM will send you an email, when the job finishes. There are many other mail-type variants.
Jakub Yaghob's avatar
Jakub Yaghob committed
223
224
225
226
227
228
229
230
231
232
233

## Charliecloud

Charliecloud provides user-defined software stacks (UDSS) for HPC.
It allows you to run nearly any software stack (like TensorFlow) on the cluster even it is not system-wide installed and available.
All informations about Charlicloud can be found on its [Charliecloud documentation](https://hpc.github.io/charliecloud/) page.

### Simple workflow

1. #### Create or get Docker image

Jakub Yaghob's avatar
Jakub Yaghob committed
234
Docker is installed on dw[01-05] workers in gpulab cluster. You can access them using
Jakub Yaghob's avatar
Jakub Yaghob committed
235
236
237
238
239
240
241

`srun -p debug-lp --pty bash` command.

You can either pull already prepared Docker image (e.g. for TensorFlow) or you may create your own one.
You will make this step only once for the given UDSS.
Of course, you must begin the whole workflow, if there is a new version of the UDSS.

Jakub Yaghob's avatar
Jakub Yaghob committed
242
243
244
245
If you are pulling a Docker image, use `sudo` command

`sudo docker pull dockertag`

Jakub Yaghob's avatar
Jakub Yaghob committed
246
247
248
249
If you are building your own Docker image, Charliecloud offers simplified version of Docker invocation

`ch-build -t dockertag .`

Jakub Yaghob's avatar
Jakub Yaghob committed
250
251
252
253
254
which must be run on dw[01-05] workers.

Latest versions of Charliecloud added a new command `ch-grow`, which builds an image in unprivileged mode from Dockerfile.
This command doesn't need Docker, it can be run anywhere.
Moreover, using `ch-grow` command allows you to skip the workflow step 2, because it immediately builds a directory image.
Jakub Yaghob's avatar
Jakub Yaghob committed
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287

2. #### Create tar/directory or SquashFS image from Docker image

You must convert prepared Docker image to either a TAR file and then to a directory structure or
to a SquashFS file. You will make this step only once for the given UDSS.
All commands for this step must be again run on dw[01-04] nodes.

For the first case, run

`ch-builder2tar dockerimage ${TMPDIR}`

which creates a TAR file in your temporary directory. Then you must convert the created TAR file to a directory structure using

`ch-tar2dir ${TMPDIR}/myudss.tar.gz imgdir`

which expands the TAR file to your output image directory (usually your home directory on the shared volume).

For the second case, run

`ch-builder2squash dockerimage outdir`

which creates a SquashFS file in your output directory (usually your home directory on the shared volume).

3. #### Import CUDA libraries

This step is required only for UDSS with CUDA requirement (like TensorFlow).
If your UDSS does not require CUDA, skip this step.
You will make this step only once for the given UDSS.
It works only with tar/directory structure.
All commands for this step must be run on volta[01-03] nodes.

Execute on gpulab

Kruliš Martin's avatar
Kruliš Martin committed
288
`srun -p volta-hp --gpus=1 ch-fromhost --nvidia imgdir/<dockertag>`
Jakub Yaghob's avatar
Jakub Yaghob committed
289
290
291
292
293
294
295
296
297

which copies some necessary CUDA files from the host to your image directory structure.

4. #### Execute created UDSS by SLURM

This step is executed many times as necessary on any node of parlab and gpulab clusters.

For the tar/directory structure, run

Kruliš Martin's avatar
Kruliš Martin committed
298
`srun <slurm params> ch-run <charlie options> imgdir/<dockertag>`
Jakub Yaghob's avatar
Jakub Yaghob committed
299

Kruliš Martin's avatar
Kruliš Martin committed
300
301
which will execute your UDSS using SLURM in interactive mode. Beware, that by default, your home is binded to the image, which results overload of entire `/home` directory. You may disable this by specifying `--no-home` option.
Moreover, you may bind additional directories by `--bind=/some/dir` (which will appear as `/mnt/0` in your UDSS environment) or by `--bind=/source/dir:/dest/dir`.
Jakub Yaghob's avatar
Jakub Yaghob committed
302
303
304

The SquashFS case is better used in a batch mode, e.g. prepare a shell-script, which looks something like

Jakub Yaghob's avatar
Jakub Yaghob committed
305
306
307
    ch-mount mysquashimg ${TMPDIR}
    ch-run ${TMPDIR}/mysquashimg --bind=/mnt/home/myhome
    ch-umount ${TMPDIR}/mysquashimg
Jakub Yaghob's avatar
Jakub Yaghob committed
308
309
310
311

Then execute the script using

`sbatch <slurm params> myudss.sh`