Commit fa010cac authored by Jakub Yaghob's avatar Jakub Yaghob
Browse files

updated to MFF metacentrum

parent 79cf4c9b
......@@ -9,6 +9,32 @@ Additionally, the Charliecloud is available on all worker nodes.
All unknown terms (front-end node, worker node) will be explained later. Front-end nodes do not have installed any development software.
## MFF HPC metacenter
Clusters are now (25.8.2021) part of the MFF HPC metacenter, which is [SLURM in a multi-cluster setup](https://slurm.schedmd.com/multi_cluster.html).
Due to the integration of clusters into the metacenter, several significant changes have been made to our cluster setup.
There has been a change in logging in and some important paths have been changed.
### Authentication
Authentication is now done using the MFF LDAP server, which is a selection of appropriate accounts from [CAS UK](https://idp.cuni.cz/cas/login).
Use the same username and password to log in as you use to log in to SIS.
### Paths
The home directory is now automatically created during the first login.
The path to the home directory is `/home/<CASlogin>`.
You can ask the administrators to create a personal directory for big data or a directory for shared project files in `/work` directory.
The `/home` and `/work` directories are on the local disk array for each cluster.
To facilitate data sharing between clusters in the metacenter, the `/home` and `/work` directories are shared with other clusters
using the tunneled NFS protocol and are available on fixed paths according to the cluster location,
e.g. the `/home` directory of the Chimera cluster located in Troja has the path `/troja/home` on all clusters (including the Chimera cluster).
> Use these shared directories with caution, as they could overwhelm the network connection for that location.
**Copy your frequently used data to the local cluster directories (`/home` or `/work`) where the computation will take place.**
## SLURM Crash Course
All informations about SLURM can be found on its
......@@ -21,7 +47,7 @@ Anyway, we have provided a short description of SLURM and of our clusters.
A **cluster** is a bunch of **nodes**.
Nodes are grouped together to **partitions**.
Partitions may overlap, ie. one node can be in more partitions.
**Feature** is a string describing a feature of a node, e.g. avx2 for a node capable of executing AVX2 instructions.
**Feature** is a string describing a feature of a node, e.g. `avx2` for a node capable of executing AVX2 instructions.
Each node has two sets of features: current features and available features.
Usually, they are same.
But in some cases, the node is capable of changing current features set on demand.
......@@ -33,7 +59,7 @@ Moreover, the time is resource as well.
A **user** is identified by his/her login.
**Account** is a billing entity (well, we won't charge you for using our clusters).
Each user must have assigned an account. Moreover, user can be assigned to more accounts and use them depending on what he/she is doing.
Accounts can only be allowed access to some partitions.
Accounts can only be allowed access to certain partitions.
A user can launch a **job**.
Job has a state, has some reserved and assigned resources, and returns an error code after completition.
......@@ -75,11 +101,12 @@ Job submission commands (srun, sbatch, salloc) have a common set of important op
| `-C` | `--constraint=` | Select only nodes with matching features |
| `-c` | `--cpus-per-task=` | Number of CPUs per 1 task |
| `-e` | `--error=` | Standard error stream is redirected to the specified file |
| | `--gpus=` | Specifies the number of GPUs required for the job in the form `[type:]count`. It is a shortcut for --gres=gpu:type:count. |
| `-G` | `--gpus=` | Specifies the number of GPUs required for the job in the form `[type:]count`. It is a shortcut for --gres=gpu:type:count. |
| | `--gres=` | Specifies comma delimited list of GRES. Each entry on the list is in form `name[[:type]:count]` |
| `-i` | `--input=` | Standard input stream is redirected from the specified file |
| `-J` | `--job-name=` | Job name |
| `-L` | `--licenses=` | Specifies comma delimited list of licenses allocated to the job. Each entry on the list is in form `name[:count]` |
| `-M` | `--clusters=` | Clusters to issue commands to (comma separated list or `all`) |
| `-m` | `--distribution=` | Select distribution method for tasks and resources. For more info see documentation |
| | `--mem=` | Specify the real memory required per node |
| | `--mem-per-cpu=` | Specify the memory required per allocated CPU |
......@@ -94,31 +121,29 @@ Job submission commands (srun, sbatch, salloc) have a common set of important op
As mentioned above, we have two clusters **parlab** and **gpulab**.
Access to the cluster is always through the front-end server using SSH on port 42222.
Front-end servers have the same name as the cluster, i.e. **parlab.ms.mff.cuni.cz**
(fingerprint `SHA256:pTXSzeYjwoQ254t4w1GT8MwwOaT1cXgpgIc1/I2xdtI`)
Front-end servers have the same name as the cluster, i.e. **parlab.ms.mff.cuni.cz**
(fingerprint `SHA256:2nYWhAiilVBhVMzLbERL/2TR1JBPhoH16gAnS+z0VkQ`)
and **gpulab.ms.mff.cuni.cz**
(fingerprint `SHA256:wfxm3EqpHA67AuylZILt6x77gJCUFJEnxBjMEQJY1Hw`).
(fingerprint `SHA256:mMTCRuZVMDYkzQBjac+49KmR+DIBC3+1riwx0I+/120`).
#### Course users
Student logins will be created before the first assignment.
The login is in the form *s_sislogin*, where *sislogin* is a student's login to our student infomation system (SIS).
The generated password will be sent to student's official e-mail.
Student logins will be added to the appropriate course account after the first week of classes.
#### Other users
The login will be created on demand by the cluster administrator.
Use your CAS login.
#### For everyone
Both clusters use one common LDAP, so you will need to change the password only once.
Both clusters use one external read-only MFF LDAP server, so you will not be able to change your password, use CAS instead.
Each course tought on our clusters has its account.
Any research group or project have their account as well.
Logins are assigned to the corresponding account depending on visiting relevant courses or working in a research group.
Logins are assigned to the corresponding account depending on visiting relevant courses or working in a research/project group.
Both clusters have access to the same disk array using NFS.
You may find your home mounted on `/mnt/home`.
Moreover, research projects can have an additional space mounted on `/mnt/research`.
You may find your home mounted on `/home`.
Moreover, users or projects can have an additional space mounted on `/work`.
You may use an environment variable **TMPDIR** set to a private local temporary directory for a job.
It is created on every node allocated to the job before the job starts and it is completely removed after the job finishes.
......@@ -143,15 +168,19 @@ Parlab nodes
Parlab partitions
| Name | Nodes | Priority | Timelimit | Preemption | Intended use |
| -------- | ---------- | -------- | --------- | ---------- | ------------ |
| big-lp | w[401-404] | low | 1 day | SUSPEND | default, general or MPI debugging, long jobs |
| big-hp | w[401-404] | high | 1 hour | SUSPEND | executing short jobs on 4-socket system, MPI jobs |
| small-lp | w[201-208] | low | 1 day | SUSPEND | debugging on newer CPUs, MPI debugging, long jobs |
| small-hp | w[201-208] | high | 30 mins | SUSPEND | executng short jobs on 2-socket system, MPI jobs |
| phi-lp | phi[01-02] | low | 1 day | SUSPEND | KNL debugging, long jobs |
| phi-hp | phi[01-02] | high | 30 mins | SUSPEND | executing short jobs on KNL |
| all | all | high | 30 mins | SUSPEND | executing short jobs on all nodes, used primarily for testing heterogeneous MPI computing |
| Name | Nodes | Priority | Timelimit | Preemption | Allowed accounts | Intended use |
| ---------------- | ---------- | -------- | --------- | ---------- | ------------------------------ | ------------ |
| ffa | all | 10 | infinite | REQUEUE | any | default, free-for-all, any job |
| mpi-hetero-ffa | all | 20 | infinite | REQUEUE | any | free-for-all, MPI jobs on heterogeneous nodes |
| mpi-homo-ffa | w[201-208] | 30 | infinite | REQUEUE | any | free-for-all, MPI jobs on homogeneous nodes |
| debug-long | w[401-404] | 100 | 7 days | REQUEUE | ksi, nprg042, nprg058, nprg054 | general or MPI debugging, long jobs |
| debug-short | w[401-404] | 150 | 2 hours | REQUEUE | ksi, nprg042, nprg058, nprg054 | executing short jobs on 4-socket system, MPI jobs |
| mpi-homo-long | w[201-208] | 100 | 7 days | REQUEUE | ksi, nprg042, nprg058, nprg054 | debugging on newer CPUs, MPI debugging, long jobs |
| mpi-homo-short | w[201-208] | 150 | 2 hours | REQUEUE | ksi, nprg042, nprg058, nprg054 | executing short jobs on 2-socket system, MPI jobs on homogeneous nodes |
| phi-long | phi[01-02] | 100 | 7 days | REQUEUE | ksi, nprg042, nprg058, nprg054 | KNL debugging, long jobs |
| phi-short | phi[01-02] | 150 | 2 hours | REQUEUE | ksi, nprg042, nprg058, nprg054 | executing short jobs on KNL |
| mpi-hetero-long | all | 100 | 7 days | REQUEUE | ksi, nprg042, nprg058, nprg054 | long MPI jobs on heterogeneous nodes |
| mpi-hetero-short | all | 150 | 2 hours | REQUEUE | ksi, nprg042, nprg058, nprg054 | short MPI jobs on heterogeneous nodes |
#### Gpulab cluster specification
......@@ -173,13 +202,14 @@ Gpulab nodes
Gpulab partitions
| Name | Nodes | Priority | Timelimit | Preemption | Intended use |
| --------- | ---------------- | --------- | --------- | ---------- |------------ |
| debug-lp | dw[01-05],varjag | low | 7 days | SUSPEND | default, general debugging, long jobs, build Docker image |
| debug-hp | dw[01-05],varjag | high | 1 hour | SUSPEND | short jobs, build Docker image |
| volta-elp | volta[01-05] | extra low | 7 days | REQUEUE | extra long GPU jobs |
| volta-lp | volta[01-05] | low | 1 day | REQUEUE | long GPU jobs |
| volta-hp | volta[01-05] | high | 1 hour | SUSPEND | debugging GPU task, executing short GPU jobs |
| Name | Nodes | Priority | Timelimit | Preemption | Allowed accounts | Intended use |
| ---------------- | ---------------- | --------- | --------- | ---------- | ------------------------------ | ------------ |
| ffa | all | 10 | infinite | REQUEUE | any | default, free-for-all, any job |
| gpu-ffa | volta[01-05] | 20 | infinite | REQUEUE | any | free-for-all, any GPU job |
| debug-long | dw[01-05],varjag | 100 | 7 days | REQUEUE | ksi, ksiprj, kdss, kdsstudent | general debugging, long jobs, build Docker image |
| debug-short | dw[01-05],varjag | 150 | 2 hours | REQUEUE | ksi, ksiprj, kdss, kdsstudent | general debugging, long jobs, build Docker image |
| gpu-long | volta[01-05] | 100 | 7 days | REQUEUE | ksi, ksiprj, kdss, kdsstudent | long GPU jobs |
| gpu-short | volta[01-05] | 150 | 2 hours | REQUEUE | ksi, ksiprj, kdss, kdsstudent, nprg042, nprg058, nprg054 | debugging GPU tasks, executing short GPU jobs |
### Useful examples
......@@ -187,7 +217,7 @@ Gpulab partitions
Starts a shell in a minimal environment (resources). Useful for making small builds or debugging in a restricted environment.
`sbatch -p volta-lp --gres=gpu:volta:1 mygpucode.sh` or newer syntax `sbatch -p volta-lp --gpus=1 mygpucode.sh`
`sbatch -p gpu-long --gpus=1 mygpucode.sh`
Starts a batch job which will have one NVIDIA V100 card available.
......@@ -199,15 +229,15 @@ Prints info about default and maximum job time for partitions in a cluster.
Prints info about current and available feature set for nodes in a cluster.
`srun -p phi-lp -C flat,snc2 --gres=hbm:8G myphicode`
`srun -p phi-long -C flat,snc2 --gres=hbm:8G myphicode`
Selects a KNL node with required features and assigns 8 GB of HBM to the job.
If there is no node with required features, one free node will be selected (may involve waiting for finishing all jobs on the selected node)
and rebooted for changing current set of features. **BE PATIENT, THE REBOOT TAKES LOOOOOONG TIME** (~10 mins).
`srun -p small-hp -n 128 -N 8 --mem-per-cpu=2G mympijob`
`srun -p mpi-homo-short -n 128 -N 8 --mem-per-cpu=2G mympijob`
Starts a MPI job with 128 tasks/ranks spanning over 8 nodes in the small partition assigning 1 CPU and 2 GB RAM to each task.
Starts a MPI job with 128 tasks/ranks spanning over 8 nodes in the homogeneous partition assigning 1 CPU and 2 GB RAM to each task.
### Recomendations
......@@ -216,7 +246,7 @@ Moreover `sbatch` allows setting of SLURM parameters in the exectued shell-scrip
you don't need to write them always on the command-line.
Moreover, jobs executed by `sbatch` can be requeued, when cancelled (e.g. for priority reasons).
- Don't forget to request GPU using `--gpus=1`, when running on volta-xxx partitions.
- Don't forget to request GPU using `--gpus=1`, when running on gpu-xxx partitions.
- Do not use `srun --pty bash` followed by executing a long computation from the command-line.
When the job finishes, you will block resources, until it either timeouts or you will exit the bash.
......@@ -233,13 +263,13 @@ All informations about Charliecloud can be found on its [Charliecloud documentat
### Basic workflow
This workflow is valid for Charliecloud version 0.21.
This workflow is valid for Charliecloud version 0.24.
1. #### Get or create Docker image
Docker is installed on dw[01-05] workers in gpulab cluster. You can access them using
`srun -p debug-lp --pty bash` command.
`salloc -C docker` command.
You can either pull already prepared Docker image (e.g. for TensorFlow) or you may create your own one.
You will make this step only once for the given UDSS.
......@@ -305,7 +335,9 @@ in your shell script passed to the `sbatch` SLURM command.
Beware, that by default, your home is binded to the image, which results in overload of entire `/home` directory.
You may disable this by specifying `--no-home` option.
Moreover, you may bind additional directories by using options `--bind=/some/dir`
(which will appear as `/mnt/0` in your UDSS environment) or by `--bind=/source/dir:/dest/dir`.
(which will appear as `/some/dir` in your UDSS environment) or by `--bind=/source/dir:/dest/dir`.
All Charliecloud images has free mountpoints `/mnt/[0-9]`.
Older versions of Charliecloud have used these mountpoints for binding without destination path.
### Advanced techniques, troubleshooting, and notes
......@@ -353,6 +385,11 @@ If `nvidia-smi` prints the CUDA version correctly, then CUDA is functional. Howe
In this case, follow the checklist:
- Did you correctly import the CUDA libraries? See step 3 of the basic workflow.
- Is `libcuda.so` (or `libcuda.so.1`) loadable? Check `LD_LIBRARY_PATH` environment variable inside your container.
- Is `libcuda.so` (or `libcuda.so.1`) loadable? Check `LD_LIBRARY_PATH` environment variable inside your container.
If not set to the CUDA library directory, set it to the correct path, e.g. `export LD_LIBRARY_PATH=/usr/local/cuda/lib64`.
Be careful, if the variable is already set to some additional paths.
#### Not able to use Docker, permission denied
If you want to access Docker, ask the administrators.
They must add you to the appropriate Docker access group.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment