README.md 14.8 KB
Newer Older
Yaghob Jakub's avatar
Yaghob Jakub committed
1
# KSI/DSE clusters
Yaghob Jakub's avatar
Yaghob Jakub committed
2

Yaghob Jakub's avatar
Yaghob Jakub committed
3
4
5
6
7
8
9
10
We have two computational clusters **parlab** and **gpulab**.
They are used for exercises, homework assignments, and research, where you need high-performance computing.
The clusters are managed by the SLURM resource manager.
Additionally, the Charliecloud is available on all worker nodes.

> Do not execute any of your code on the front-end nodes directly, use worker nodes instead!

All unknown terms (front-end node, worker node) will be explained later. Front-end nodes do not have installed any development software.
Jakub Yaghob's avatar
Jakub Yaghob committed
11
12
13

## SLURM Crash Course

Jakub Yaghob's avatar
Jakub Yaghob committed
14
15
All informations about SLURM can be found on its
[SLURM documentation](https://slurm.schedmd.com/documentation.html)
Jakub Yaghob's avatar
Jakub Yaghob committed
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
page or on [SLURM tutorials](https://slurm.schedmd.com/tutorials.html) page.
Anyway, we have provided a short description of SLURM and of our clusters.

### Terminology

A **cluster** is a bunch of **nodes**.
Nodes are grouped together to partitions.
Partitions may overlap, ie. one node can be in more **partitions**.
**Feature** is a string describing a feature of a node, e.g. avx2 for a node capable of executing AVX2 instructions.
Each node has two sets of features: current features and available features.
Usually, they are same.
But in some cases, the node is capable of changing current features set on demand.

SLURM manages **resources** and schedules jobs according to available resources.
The most important resources are CPUs and memory, but it is able to manage other generic resources (GRES), like GPUs, etc.
Moreover, the time is resource as well.

A **user** is identified by his/her login.
**Account** is a billing entity (well, we won't charge you for using our clusters).
Each user must have assigned an account. Moreover, user can be assigned to more accounts and use them depending on what he/she is doing.
Accounts can only be allowed access to some partitions.

A user can launch a **job**.
Job has a state, has some reserved and assigned resources, and returns an error code after completition.
Job consistes from **steps** (usually 1 step).
Each step is executed by a bunch of **tasks** (usually 1 task), where resources are equally assigned to the tasks of a step.
Jobs are inserted to a **scheduling queue**, where you can find them.

Partition has a priority.
A job submitted to a partition with higher **priority** can suspend an another job submitted to a partition with lower priority.
Jakub Yaghob's avatar
Jakub Yaghob committed
46

Jakub Yaghob's avatar
Jakub Yaghob committed
47
48
### Important commands

Jakub Yaghob's avatar
Jakub Yaghob committed
49
There are many commands (see
Jakub Yaghob's avatar
Jakub Yaghob committed
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
[SLURM man pages](https://slurm.schedmd.com/man_index.html) or [SLURM command summary](https://slurm.schedmd.com/pdfs/summary.pdf)
). The most important commands are:


- **srun** - reserves and assigns resources and runs an interactive job
- **sbatch** - reserves and assigns resources and submit a batch as a job
- **salloc** - only reserves and does not assign resources, use srun or sbatch for assigning resources subsequently
- **sinfo** - get info about cluster state
- **scancel** - cancels a job
- **squeue** - view info about scheduling queue
- **sbcast** - distributes a file to the nodes allocated to a job
- **sstat** - display info about a job

Job submission commands (srun, sbatch, salloc) have a common set of important options:

| Shrt | Long                 | Description |
| ---- | -------------------- |------------ |
| `-A` | `--acount=` 	        | Charge resources used by the job to the specified account. If not sepcified, user's default account is charged |
| `-B` | `--extra-node-info=` | Select only nodes with at least specified number of sockets, cores per socket, and threads per core. This option does not specify the resource allocation, its just constraint. |
| `-C` | `--constraint=`      | Select only nodes with matching features |
| `-c` | `--cpus-per-task=`   | Number of CPUs per 1 task |
| `-e` | `--error=`           | Standard error stream is redirected to the specified file |
|      | `--gpus=`            | Specifies the number of GPUs required for the job in the form `[type:]count`. It is a shortcut for --gres=gpu:type:count. |
|      | `--gres=`            | Specifies comma delimited list of GRES. Each entry on the list is in form `name[[:type]:count]` |
| `-i` | `--input=`           | Standard input stream is redirected from the specified file |
| `-J` | `--job-name=`        | Job name |
| `-L` | `--licenses=`        | Specifies comma delimited list of licenses allocated to the job. Each entry on the list is in form `name[:count]` |
| `-m` | `--distribution=`    | Select distribution method for tasks and resources. For more info see documentation |
|      | `--mem=`             | Specify the real memory required per node |
|      | `--mem-per-cpu=`     | Specify the memory required per allocated CPU |
|      | `--mem-bind=`        | Specify the memory binding. For more info see documentation |
| `-N` | `--nodes=`           | A minimum of allocated nodes. Default is 1 |
| `-n` | `--ntasks=`          | Number of tasks. Default is 1 |
| `-o` | `--output=`          | Standard output stream is redirected to the specified file |
| `-p` | `--partition=`       | Request a specific partition for the resource allocation. If not specified, default partition os chosen |
| `-t` | `--time=`            | Set a limit on the total run time of a job. If not specified, default time for a selected partition is used. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours". |

### Description of available clusters

As mentioned above, we have two clusters **parlab** and **gpulab**.
Access to the cluster is always through the front-end server using SSH on port 42222.
Front-end servers have the same name as the cluster, i.e. **parlab.ms.mff.cuni.cz** and **gpulab.ms.mff.cuni.cz**.

#### Course users

Student logins will be created before the first assignment.
The login is in the form *s_sislogin*, where *sislogin* is a student's login to our student infomation system (SIS).
The generated password will be sent to student's official e-mail.

#### Other users

The login will be created on demand by the cluster administrator.

Yaghob Jakub's avatar
Yaghob Jakub committed
103
#### For everyone
Jakub Yaghob's avatar
Jakub Yaghob committed
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119

Both clusters use one common LDAP, so you will need to change the password only once.
Each course tought on our clusters has its account.
Any research group or project have their account as well.
Logins are assigned to the corresponding account depending on visiting relevant courses or working in a research group.

Both clusters have access to the same disk array using NFS.
You may find your home mounted on `/mnt/home`.
Moreover, research projects can have an additional space mounted on `/mnt/research`.

You may use an environment variable **TMPDIR** set to a private local temporary directory for a job.
It is created on every node allocated to the job before the job starts and it is completely removed after the job finishes.
The temporary directory can by used as a scratchpad.
Moreover, it resides on a local SSD RAID, therefore it is faster access data here then accessing data on a remote NFS disk.
On the other hand, **the space is limited, be careful!**

Jakub Yaghob's avatar
Jakub Yaghob committed
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
#### Parlab cluster specification

All nodes are interconected by InfiniBand FDR (56 Gb/s) for high-performance messaging using MPI.
Moreover, they are interconnected by 10 GbE for all other traffic.
The front-end server is connected by 10 GbE to the external world.
The latest version of OpenMPI is installed on all nodes.

Parlab nodes

| Node names | CPU                  | Sockets | Cores | HT | RAM    | GRES | Additional info |
| ---------- | -------------------- | ------- | ----- | -- | ------ | ---- | --------------- |
| w[401-404] | Intel Xeon E7-4820   | 4       | 8     | 2  | 128 GB | | |
| w[201-208] | Intel Xeon Gold 6130 | 2       | 16    | 2  | 128 GB | | |
| phi[01-02] | Intel Xeon Phi 7230  | 1       | 64    | 4  | 96 GB  |hbm 16 GB | can change feature set after rebooting |

Parlab partitions

| Name     | Nodes      | Priority | Timelimit | Intended use |
| -------- | ---------- | -------- | --------- | ------------ |
| big-lp   | w[401-404] | low      | 1 day     | default, general or MPI debugging, long jobs |
| big-hp   | w[401-404] | high     | 1 hour    | executing short jobs on 4-socket system, MPI jobs |
| small-lp | w[201-208] | low      | 1 day     | debugging on newer CPUs, MPI debugging, long jobs |
| small-hp | w[201-208] | high     | 30 mins   | executng short jobs on 2-socket system, MPI jobs |
| phi-lp   | phi[01-02] | low      | 1 day     | KNL debugging, long jobs |
| phi-hp   | phi[01-02] | high     | 30 mins   | executing short jobs on KNL |
| all      | all        | high     | 30 mins   | executing short jobs on all nodes, used primarily for testing heterogeneous MPI computing |

#### Gpulab cluster specification

All nodes are interconnected by 10 GbE. The front-end server is connected by 10 GbE to the external world.

Gpulab nodes

| Node names   | CPU                    | Sockets | Cores | HT | RAM    | GRES | Additional info |
| ------------ | ---------------------- | ------- | ----- | -- | ------ | ---- | --------------- |
| dw[01-02]    | Intel Xeon E5450       | 2       | 4     | 1  | 32 GB  |      | Docker installed |
| dw03         | Intel Xeon E5640       | 2       | 4     | 2  | 96 GB  |      | Docker installed |
| dw04         | Intel Xeon E5-2660v2   | 2       | 10    | 2  | 256 GB |      | Docker installed |
| varjag       | Intel Xeon E7-4830     | 4       | 8     | 2  | 256 GB | | |
| volta[01-02] | Intel Xeon Silver 4110 | 2       | 8     | 2  | 128 GB | gpu volta [0-1] | 2x NVIDIA Tesla V100 PCIe 16 GB, latest CUDA |
| volta03      | Intel Xeon Silver 4110 | 2       | 8     | 2  | 192 GB | gpu volta [0-1] | 2x NVIDIA Tesla V100 PCIe 16 GB, latest CUDA |

Gpulab partitions

| Name      | Nodes            | Priority  | Timelimit | Intended use |
| --------- | ---------------- | --------- | --------- | ------------ |
| debug-lp  | dw[01-04],varjag | low       | 7 days    | default, general debugging, long jobs, build Docker image |
| debug-hp  | dw[01-04],varjag | high      | 1 hour    | short jobs, build Docker image |
| volta-elp | volta03          | extra low | 7 days    | extra long GPU jobs |
| volta-lp  | volta[02-03]     | low       | 1 day     | long GPU jobs |
| volta-hp  | volta[01-03]     | high      | 1 hour    | debugging GPU task, executing short GPU jobs |

### Useful examples

`salloc`

Starts a shell in a minimal environment (resources). Useful for making small builds or debugging in a restricted environment.

`sbatch -p volta-lp --gres=gpu:volta:1 mygpucode.sh` or newer syntax `sbatch -p volta-lp --gpus=1 mygpucode.sh`

Starts a batch job which will have one NVIDIA V100 card available.

`sinfo -o "%P %L %l"`

Prints info about default and maximum job time for partitions in a cluster.

`sinfo -o "%n %b %f"`

Prints info about current and available feature set for nodes in a cluster.

`srun -p phi-lp -C flat,snc2 --gres=hbm:8G myphicode`

Selects a KNL node with required features and assigns 8 GB of HBM to the job.
If there is no node with required features, one free node will be selected (may involve waiting for finishing all jobs on the selected node)
and rebooted for changing current set of features. **BE PATIENT, THE REBOOT TAKES LOOOOOONG TIME** (~10 mins).

`srun -p small-hp -n 128 -N 8 --mem-per-cpu=2G mympijob`

Starts a MPI job with 128 tasks/ranks spanning over 8 nodes in the small partition assigning 1 CPU and 2 GB RAM to each task.

### Recomendations

- Prefer using `sbatch` for long jobs.
Moreover `sbatch` allows setting of SLURM parameters in the exectued shell-script,
you don't need to write them always on the command-line.
Moreover, jobs executed by `sbatch` can be requeued, when cancelled (e.g. for priority reasons).

- Don't forget to request GPU using `--gpus=1`, when running on volta-xxx partitions.

- Do not use `srun --pty bash` followed by executing a long computation from the command-line.
When the job finishes, you will block resources, until it either timeouts or you will exit the bash.
Again, prefer using `sbatch` for long jobs.

- Set mail option in `sbatch` shell using `#SBATCH --mail=mymail@isp.edu`.
SLURM will send you an email, when the job finishes.

## Charliecloud

Charliecloud provides user-defined software stacks (UDSS) for HPC.
It allows you to run nearly any software stack (like TensorFlow) on the cluster even it is not system-wide installed and available.
All informations about Charlicloud can be found on its [Charliecloud documentation](https://hpc.github.io/charliecloud/) page.

### Simple workflow

1. #### Create or get Docker image

Docker is installed on dw[01-04] workers in gpulab cluster. You can access them using

`srun -p debug-lp --pty bash` command.

You can either pull already prepared Docker image (e.g. for TensorFlow) or you may create your own one.
You will make this step only once for the given UDSS.
Of course, you must begin the whole workflow, if there is a new version of the UDSS.

If you are building your own Docker image, Charliecloud offers simplified version of Docker invocation

`ch-build -t dockertag .`

which must be run on dw[01-04] workers.

2. #### Create tar/directory or SquashFS image from Docker image

You must convert prepared Docker image to either a TAR file and then to a directory structure or
to a SquashFS file. You will make this step only once for the given UDSS.
All commands for this step must be again run on dw[01-04] nodes.

For the first case, run

`ch-builder2tar dockerimage ${TMPDIR}`

which creates a TAR file in your temporary directory. Then you must convert the created TAR file to a directory structure using

`ch-tar2dir ${TMPDIR}/myudss.tar.gz imgdir`

which expands the TAR file to your output image directory (usually your home directory on the shared volume).

For the second case, run

`ch-builder2squash dockerimage outdir`

which creates a SquashFS file in your output directory (usually your home directory on the shared volume).

3. #### Import CUDA libraries

This step is required only for UDSS with CUDA requirement (like TensorFlow).
If your UDSS does not require CUDA, skip this step.
You will make this step only once for the given UDSS.
It works only with tar/directory structure.
All commands for this step must be run on volta[01-03] nodes.

Execute on gpulab

`srun -p volta-hp --gpus=1 ch-fromhost --nvidia imgdir`

which copies some necessary CUDA files from the host to your image directory structure.

4. #### Execute created UDSS by SLURM

This step is executed many times as necessary on any node of parlab and gpulab clusters.

For the tar/directory structure, run

`srun <slurm params> ch-run imgdir --bind=/mnt/home/myhome`

which will execute your UDSS using SLURM in interactive mode.
Moreover, your home directory will be mounted as /mnt/0 in your UDSS environment.

The SquashFS case is better used in a batch mode, e.g. prepare a shell-script, which looks something like

`ch-mount mysquashimg ${TMPDIR}`
`ch-run ${TMPDIR}/mysquashimg --bind=/mnt/home/myhome`
`ch-umount ${TMPDIR}/mysquashimg`

Then execute the script using

`sbatch <slurm params> myudss.sh`