The goal of this short tutorial is to introduce new users to the ScienceCluster environment. It assumes that the readers already have experience with remote Linux servers but not necessarily with clusters. If you have never worked with remote servers before, you may want to start with the detailed instructions instead.
Connecting to the cluster¶
The cluster can be reached via
cluster.s3it.uzh.ch. The load balancer redirects the requests in round-robin fashion to one of the several login nodes. The username that you use is your UZH Active Directory (AD) shortname. The password in most cases will be the same as your Email/Collaboration password. If you are unable to log in using your Email/Collaboration password, you will need to update your AD password in the Identity Manager.
There are three filesystems where you can store your data.
Your home filesystem (
/home/cluster/<shortname>) has a quota of 15 GB / 100,000 files. Typically, it is used to store configuration and small important files.
For persistent storage of larger files, you can use the data filesystem (
/data/<shortname>). It has a limit of 200 GB but it is not backed up. This filesystem is also appropriate for software installation (e.g., Python modules or R packages).
If you need additional space for persistent data beyond the
data filesystem, you can request scalable storage. It is not subject to quota but it requires cost contributions based on the actual usage.
Large input data and computational results can be stored on the scratch filesystem (
/scratch/<shortname>), which has a quota of 20 TB and is not backed up. Please note that this filesystem is meant for temporary storage and the files may be automatically deleted if they have not been accessed within one month.
Our cluster has been partitioned according to its hardware capabilities. The partitions are as follows:
- generic: jobs requiring at most 32 vCPUs and/or 123 GB of RAM.
- hpc: jobs requiring a high speed inter-connect or high CPU/memory (> 32 vCPUs or > 123 GB RAM per job).
- hydra: jobs requiring more than 377 GB of RAM.
- vesta: jobs requiring GPUs (equipped with Nvidia K80 cards).
- volta: jobs requiring GPUs (equipped with Nvidia V100 cards).
You can switch to a specific partition by loading one of the partition modules. For example, the following command selects the generic partition.
module load generic
No partition is selected by default. So, if you do not load a partition module and do not specify the partition explicitly as an
sbatch parameter, your job will be rejected. You can see the list of available partitions by listing all available modules with
module avail or
module av. Partitions will be listed in the section titled
/sapps/etc/modules/start. The following command can be used to display the partitions that you can access.
sacctmgr show assoc format=partition,account%20,qos%30 user=<username>
After loading a partition, you can also use
module av command to see the list of software available on that partition.
Click here for more detailed information about the partitions.
Jobs are submitted with the
sbatch command. The default values for resource allocations are very low. If you do not specify any parameters, Slurm (the automatic job allocation system) will allocate 1 vCPU, 1 MB of memory, and 1 second for execution time. Therefore, you need to specify at minimum the amount of memory and the expected runtime. For example, to run a
hostname command on the cluster, you can create a file named
test.job with the following contents:
#!/usr/bin/env bash hostname
Then you can submit it for execution with the following command (assuming that you have already loaded a partition module).
sbatch --time=0:10:0 --mem=7800 --cpus-per-task=2 test.job
This will request 2 CPUs and 7800 MB of RAM for 10 minutes. Alternatively, you can specify these parameters in your job file; e.g.,
#!/usr/bin/env bash #SBATCH --time=0:10:0 #SBATCH --mem=7800 #SBATCH --cpus-per-task=2 hostname
The memory per vCPU ratio is the same on all nodes of a particular partition. It is 4 GB/vCPU on generic, 8 GB/vCPU on hpc, and 24 GB/vCPU on hydra. However, Slurm has to reserve some memory for system operations. Therefore, the optimal hardware utilisation can only be achieved when a smaller amount of memory is requested per vCPU, namely
3850 for generic,
8000 for hpc, and
24100 for hydra. If you request multiples of those numbers with
--mem or the exact numbers with
--mem-per-cpu, you might be able to schedule more jobs to run in parallel than with
24G on generic, hpc, and hydra respectively.
For testing or debugging purposes, you can run your job in an interactive session. Any other use of interactive sessions is generally discouraged. You can start an interactive session with the following command.
srun --pty --time=1:0:0 --mem-per-cpu=8000 --cpus-per-task=2 bash -l
For more detailed information on job submission, click here.
Maximum running time¶
You should strive to split your calculations into jobs that can finish in fewer than 24 hours. Short jobs are easier to schedule; i.e., they are likely to start earlier than long jobs. If something goes wrong, you might be able to detect it earlier. In case of a failure, you will be able to restart calculations from the last checkpoint rather than from the beginning. Finally, long jobs fill up the queue for extended periods and prevent other users from running their smaller jobs.
A job's runtime is controlled by the
--time parameter. If your job runs beyond the specified time limit, Slurm will terminate it. Depending on the value of the
--time parameter, slurm automatically places jobs into one of the quality of service (QOS) groups, which in turn affects job scheduling priority as well as some other limits and properties. ScienceCluster has four different QOS groups.
- normal: 24 hours
- medium: 48 hours
- long (vesta for GPU partitions): 7 days
- verylong: 28 days
In order to be able to use the
verylong QOS (i.e., running times over 7 days), please request access via the UZH Help Desk(mentioning S3IT in the subject line). A single user can run only one job with the
verylong QOS at a time. If you schedule multiple
verylong jobs, they will run serially regardless of the resource availability.
You can view the list of currently scheduled and running jobs with the
squeue command. Without any parameters, it will display all the jobs that are currently scheduled or running on the cluster. If you loaded a partition module, then the output will be limited to the jobs scheduled or running on that particular partition. To see only your jobs, you need to specify the
squeue -u <username>
If you want to delete a job from the queue you can do so with
scancel, and you need to specify the Job ID as an argument. The Job ID is always reported when you schedule a job. You can also find it in the output of
squeue. Multiple jobs can be deleted at once. For example,
scancel 2850610 2850611
You can also cancel all your jobs at once without specifying and Job IDs. The following two commands delete all your jobs or all your pending jobs, respectively.
scancel scancel --state=PENDING
For more information about job management, click here.
There are four main approaches to parallelisation.
- Single program that runs multiple processes each with private memory allocation
- Several program instances that run in parallel (i.e., job arrays)
- Single master program that launches several slave programs
- Single program that runs multiple processes with shared memory (MPI)
For the first approach, you do not need to do anything special. You just submit a job requesting the number of vCPUs that your program can efficiently use. The other three approaches are described in the Job Scheduling section of the documentation.
In addition to the documentation provided on this site, you can also find the following external resources useful.
- Slurm Quick Start Guide
- Slurm Documentation
- CECI Slurm Quick Start Tutorial (Although it is written specifically for CECI users, the tutorial is excellent and can be used as a general Slurm guide.)