Commit 29fa6416 authored by Jakub Yaghob's avatar Jakub Yaghob
Browse files

working on README

parent 8197e2b8
......@@ -11,8 +11,8 @@ All unknown terms (front-end node, worker node) will be explained later. Front-e
## SLURM Crash Course
All informations about SLURM can be found on its
[SLURM documentation](https://slurm.schedmd.com/documentation.html)
All informations about SLURM can be found on its
[SLURM documentation](https://slurm.schedmd.com/documentation.html)
page or on [SLURM tutorials](https://slurm.schedmd.com/tutorials.html) page.
Anyway, we have provided a short description of SLURM and of our clusters.
......@@ -43,10 +43,10 @@ Jobs are inserted to a **scheduling queue**, where you can find them.
Partition has a priority.
A job submitted to a partition with higher **priority** can suspend an another job submitted to a partition with lower priority.
### Important commands
There are many commands (see
There are many commands (see
[SLURM man pages](https://slurm.schedmd.com/man_index.html) or [SLURM command summary](https://slurm.schedmd.com/pdfs/summary.pdf)
). The most important commands are:
......@@ -117,3 +117,179 @@ The temporary directory can by used as a scratchpad.
Moreover, it resides on a local SSD RAID, therefore it is faster access data here then accessing data on a remote NFS disk.
On the other hand, **the space is limited, be careful!**
#### Parlab cluster specification
All nodes are interconected by InfiniBand FDR (56 Gb/s) for high-performance messaging using MPI.
Moreover, they are interconnected by 10 GbE for all other traffic.
The front-end server is connected by 10 GbE to the external world.
The latest version of OpenMPI is installed on all nodes.
Parlab nodes
| Node names | CPU | Sockets | Cores | HT | RAM | GRES | Additional info |
| ---------- | -------------------- | ------- | ----- | -- | ------ | ---- | --------------- |
| w[401-404] | Intel Xeon E7-4820 | 4 | 8 | 2 | 128 GB | | |
| w[201-208] | Intel Xeon Gold 6130 | 2 | 16 | 2 | 128 GB | | |
| phi[01-02] | Intel Xeon Phi 7230 | 1 | 64 | 4 | 96 GB |hbm 16 GB | can change feature set after rebooting |
Parlab partitions
| Name | Nodes | Priority | Timelimit | Intended use |
| -------- | ---------- | -------- | --------- | ------------ |
| big-lp | w[401-404] | low | 1 day | default, general or MPI debugging, long jobs |
| big-hp | w[401-404] | high | 1 hour | executing short jobs on 4-socket system, MPI jobs |
| small-lp | w[201-208] | low | 1 day | debugging on newer CPUs, MPI debugging, long jobs |
| small-hp | w[201-208] | high | 30 mins | executng short jobs on 2-socket system, MPI jobs |
| phi-lp | phi[01-02] | low | 1 day | KNL debugging, long jobs |
| phi-hp | phi[01-02] | high | 30 mins | executing short jobs on KNL |
| all | all | high | 30 mins | executing short jobs on all nodes, used primarily for testing heterogeneous MPI computing |
#### Gpulab cluster specification
All nodes are interconnected by 10 GbE. The front-end server is connected by 10 GbE to the external world.
Gpulab nodes
| Node names | CPU | Sockets | Cores | HT | RAM | GRES | Additional info |
| ------------ | ---------------------- | ------- | ----- | -- | ------ | ---- | --------------- |
| dw[01-02] | Intel Xeon E5450 | 2 | 4 | 1 | 32 GB | | Docker installed |
| dw03 | Intel Xeon E5640 | 2 | 4 | 2 | 96 GB | | Docker installed |
| dw04 | Intel Xeon E5-2660v2 | 2 | 10 | 2 | 256 GB | | Docker installed |
| varjag | Intel Xeon E7-4830 | 4 | 8 | 2 | 256 GB | | |
| volta[01-02] | Intel Xeon Silver 4110 | 2 | 8 | 2 | 128 GB | gpu volta [0-1] | 2x NVIDIA Tesla V100 PCIe 16 GB, latest CUDA |
| volta03 | Intel Xeon Silver 4110 | 2 | 8 | 2 | 192 GB | gpu volta [0-1] | 2x NVIDIA Tesla V100 PCIe 16 GB, latest CUDA |
Gpulab partitions
| Name | Nodes | Priority | Timelimit | Intended use |
| --------- | ---------------- | --------- | --------- | ------------ |
| debug-lp | dw[01-04],varjag | low | 7 days | default, general debugging, long jobs, build Docker image |
| debug-hp | dw[01-04],varjag | high | 1 hour | short jobs, build Docker image |
| volta-elp | volta03 | extra low | 7 days | extra long GPU jobs |
| volta-lp | volta[02-03] | low | 1 day | long GPU jobs |
| volta-hp | volta[01-03] | high | 1 hour | debugging GPU task, executing short GPU jobs |
### Useful examples
`salloc`
Starts a shell in a minimal environment (resources). Useful for making small builds or debugging in a restricted environment.
`sbatch -p volta-lp --gres=gpu:volta:1 mygpucode.sh` or newer syntax `sbatch -p volta-lp --gpus=1 mygpucode.sh`
Starts a batch job which will have one NVIDIA V100 card available.
`sinfo -o "%P %L %l"`
Prints info about default and maximum job time for partitions in a cluster.
`sinfo -o "%n %b %f"`
Prints info about current and available feature set for nodes in a cluster.
`srun -p phi-lp -C flat,snc2 --gres=hbm:8G myphicode`
Selects a KNL node with required features and assigns 8 GB of HBM to the job.
If there is no node with required features, one free node will be selected (may involve waiting for finishing all jobs on the selected node)
and rebooted for changing current set of features. **BE PATIENT, THE REBOOT TAKES LOOOOOONG TIME** (~10 mins).
`srun -p small-hp -n 128 -N 8 --mem-per-cpu=2G mympijob`
Starts a MPI job with 128 tasks/ranks spanning over 8 nodes in the small partition assigning 1 CPU and 2 GB RAM to each task.
### Recomendations
- Prefer using `sbatch` for long jobs.
Moreover `sbatch` allows setting of SLURM parameters in the exectued shell-script,
you don't need to write them always on the command-line.
Moreover, jobs executed by `sbatch` can be requeued, when cancelled (e.g. for priority reasons).
- Don't forget to request GPU using `--gpus=1`, when running on volta-xxx partitions.
- Do not use `srun --pty bash` followed by executing a long computation from the command-line.
When the job finishes, you will block resources, until it either timeouts or you will exit the bash.
Again, prefer using `sbatch` for long jobs.
- Set mail option in `sbatch` shell using `#SBATCH --mail=mymail@isp.edu`.
SLURM will send you an email, when the job finishes.
## Charliecloud
Charliecloud provides user-defined software stacks (UDSS) for HPC.
It allows you to run nearly any software stack (like TensorFlow) on the cluster even it is not system-wide installed and available.
All informations about Charlicloud can be found on its [Charliecloud documentation](https://hpc.github.io/charliecloud/) page.
### Simple workflow
1. #### Create or get Docker image
Docker is installed on dw[01-04] workers in gpulab cluster. You can access them using
`srun -p debug-lp --pty bash` command.
You can either pull already prepared Docker image (e.g. for TensorFlow) or you may create your own one.
You will make this step only once for the given UDSS.
Of course, you must begin the whole workflow, if there is a new version of the UDSS.
If you are building your own Docker image, Charliecloud offers simplified version of Docker invocation
`ch-build -t dockertag .`
which must be run on dw[01-04] workers.
2. #### Create tar/directory or SquashFS image from Docker image
You must convert prepared Docker image to either a TAR file and then to a directory structure or
to a SquashFS file. You will make this step only once for the given UDSS.
All commands for this step must be again run on dw[01-04] nodes.
For the first case, run
`ch-builder2tar dockerimage ${TMPDIR}`
which creates a TAR file in your temporary directory. Then you must convert the created TAR file to a directory structure using
`ch-tar2dir ${TMPDIR}/myudss.tar.gz imgdir`
which expands the TAR file to your output image directory (usually your home directory on the shared volume).
For the second case, run
`ch-builder2squash dockerimage outdir`
which creates a SquashFS file in your output directory (usually your home directory on the shared volume).
3. #### Import CUDA libraries
This step is required only for UDSS with CUDA requirement (like TensorFlow).
If your UDSS does not require CUDA, skip this step.
You will make this step only once for the given UDSS.
It works only with tar/directory structure.
All commands for this step must be run on volta[01-03] nodes.
Execute on gpulab
`srun -p volta-hp --gpus=1 ch-fromhost --nvidia imgdir`
which copies some necessary CUDA files from the host to your image directory structure.
4. #### Execute created UDSS by SLURM
This step is executed many times as necessary on any node of parlab and gpulab clusters.
For the tar/directory structure, run
`srun <slurm params> ch-run imgdir --bind=/mnt/home/myhome`
which will execute your UDSS using SLURM in interactive mode.
Moreover, your home directory will be mounted as /mnt/0 in your UDSS environment.
The SquashFS case is better used in a batch mode, e.g. prepare a shell-script, which looks something like
`ch-mount mysquashimg ${TMPDIR}`
`ch-run ${TMPDIR}/mysquashimg --bind=/mnt/home/myhome`
`ch-umount ${TMPDIR}/mysquashimg`
Then execute the script using
`sbatch <slurm params> myudss.sh`
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment