Parallelisation¶

By parallelization it is meant performing parallel computing, i.e., simultaneously using multiple compute resources to solve a problem. In order to have an application benefiting from parallel computing it has to be development explicitly with this in mind. In terms of the number of nodes involved in the computation, one can distinguish between:

Single-node parallelism: A program that runs similar as on your local computer or a Virtual Machine, that might benefit from shared memory
Multi-node parallelism: A single program that runs multiple processes which can benefit from the distributed memory. MPI is one library that can enable such a use case. For more information about MPI on ScienceCluster please see MPI section.

A special case of parallelism is when the same program is executed many times (dozens to millions of times) but with a slightly different input. The standard approach is to submit each execution as a separate independent job. But keeping track of these jobs is not easy for the user nor for the schedule system (slurm). For this use case there is a better way:

Job arrays: each job is submitted separately but the array of jobs can be managed as a whole

Please contact us if you would like assistance.

Single-node parallelism¶

This is the simplest case. You only need to request the number of CPUs that your application can take advantage of, and force the execution on a single node via the option -N=1(or --nodes=1). In most cases, there is a parameter that you would need to specify when calling the application. The application's documentation should explain what parameter to use. You can find sample job scripts in the Job Submission section.

Job arrays¶

In general, job arrays are useful for applying the same processing routine to a collection of multiple input data files or different sets of parameters to the same input data files. Job arrays offer a very simple way to submit a large number of independent processing jobs. In this example, the --array=1-16 option will cause 16 array tasks (numbered 1, 2, ..., 16) to be spawned when this master job script is submitted. The array tasks are simply copies of this master script that are automatically submitted to the scheduler on your behalf. Thus, exactly the same amount of resources is requested for each array task. However, in each array task an environment variable called SLURM_ARRAY_TASK_ID will be set to a unique value. In this example, the number will be in the range 1, 2, ..., 16. In your script, you can use this value to select, for example, a specific data file that each array task will be responsible for processing.

#!/usr/bin/bash -l
#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-16
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=3850

your_application $SLURM_ARRAY_TASK_ID

Job array indices can be specified in a number of ways. For example:

A job array with index values between 0 and 31
```
#SBATCH --array=0-31
```
A job array with index values of 1, 2, 5, 19, 27
```
#SBATCH --array=1,2,5,19,27
```
A job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5, 7)
```
#SBATCH --array=1-7:2
```

In the example above, you can see that the output and error file names have special placeholders %A and %a. For each array tasks, they will be replaced with job ID and array task ID respectively.

Warning

Please make sure that each array task writes to its own unique set of files. For example, you can add SLURM_ARRAY_TASK_ID as a file name suffix or append it to the output directory name.

Below is a sample script that runs a parameter sweep with a hypothetical application. It iterates over two parameters: alpha and gamma. The first parameter takes values from 0 to 6 with a step of 2 (i.e., 0, 2, 4, 6). If a step of 1 is desired, the step value can be omitted; e.g., {0..10}. The second parameter can take values 'Aa', 'Bb', or 'Cc'. In total, there are 4 * 3 = 12 parameter combinations. So, the value for the --array parameter should be 0-11.

The nested loops generate two arrays that allow reconstruction of all parameter combinations. In this case, alphaArr contains 0 0 0 2 2 2 4... while gammaArr has Aa Bb Cc Aa Bb Cc Aa.... Later, $SLURM_ARRAY_TASK_ID is used to retrieve a particular combination. The array indices start with 0. Thus, when the $SLURM_ARRAY_TASK_ID variable is 2, for example, alpha is set to 0 and gamma takes the value Cc, which are subsequently passed to the myapp application.

The output file is saved to the results directory. To make the output file unique, both parameter values are added to its name. For example, when $SLURM_ARRAY_TASK_ID is 2, the output path will be results/output_a0_gCc.txt. Note the use of curly braces (e.g., {}) around the variable names. Since variable names can contain an underscore, without the braces bash would identify the first variable as $alpha_g. The curly braces around gamma are technically redundant but they can help to prevent a potential bug if another variable is added to the file name.

The back slash (i.e., \) is a line continuation character. There must be no spaces after back slashes.

Slurm exports several environment variables when it submits a job for execution. You may find SLURM_CPUS_PER_TASK particularly useful. In this example, the variable is used to specify the number of threads available to the application.

#!/usr/bin/bash -l
#SBATCH --time=1:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=3850
#SBATCH --job-name=param_sweep
#SBATCH --output=param_sweep_%A_%a.out
#SBATCH --array=0-11

alphas=({0..6..2})
gammas=(Aa Bb Cc)

alphaArr=()
gammaArr=()
for alpha in "${alphas[@]}"; do
   for gamma in "${gammas[@]}"; do
      alphaArr+=($alpha)
      gammaArr+=($gamma)
   done
done

alpha=${alphaArr[$SLURM_ARRAY_TASK_ID]}
gamma=${gammaArr[$SLURM_ARRAY_TASK_ID]}
myapp \
   --alpha=$alpha \
   --gamma=$gamma \
   --threads=$SLURM_CPUS_PER_TASK \
   --output=results/output_a${alpha}_g${gamma}.txt

Before submitting the job, it would be helpful to test the script to ensure that it translates the task id correctly into the parameters. This can be done by setting $SLURM_ARRAY_TASK_ID to the first command line argument before its first use and adding echo before myapp.

# ...
SLURM_ARRAY_TASK_ID=$1
alpha=${alphaArr[$SLURM_ARRAY_TASK_ID]}
gamma=${gammaArr[$SLURM_ARRAY_TASK_ID]}
echo myapp \
   --alpha=$alpha \
   --gamma=$gamma \
   --threads=$SLURM_CPUS_PER_TASK \
   --output=results/output_a${alpha}_g${gamma}.txt

Suppose the job script has been saved as myjob and added execution permissions chmod u+x myjob. Now, when you call it with different ids it will print the application call string that would have been executed. Notice that you need to run the script directly without sbatch.

./myjob 2
./myjob 5
./myjob 9

Caution

Make sure you remove both changes before submitting the job!

MPI¶

Message Passing Interface (MPI) is a library of routines that can be used to create parallel programs. The MPI standard was developed to ameliorate interoperability problems between programming language constructs. It is a library that supplies commonly-available operating system services to create parallel processes and exchange information among these processes. MPI is typically utilized for multi-node parallelism but can also be used for single-node parallelism.

MPI is designed to allow users to create programs that can run efficiently on most parallel architectures. The design process included vendors (such as IBM, Intel, TMC, Cray, Convex, etc.), parallel library authors (involved in the development of PVM, Linda, etc.), and applications specialists. The final version for the draft standard became available in May of 1994.

There are various flavours of MPI available on the cluster, these all come with limited support. Currently the following are installed:

OpenMPI
IntelMPI

OpenMPI and IntelMPI and cluster modules¶

By default ethernet optimized versions of MPI are loaded when loading an MPI module without setting a hardware constraint. This is because ethernet is available on all nodes in the cluster. Infiniband is only available on a subset of nodes.

Ethernet MPI: OpenMPI or IntelMPI
Infiniband MPI: OpenMPI
Infiniband MPI + CUDA: OpenMPI

Examples for loading various ethernet mpi variants:

# OpenMPI
module load openmpi

# IntelMPI
module load intel-oneapi-mpi

If one wants to use a version of MPI that is optimized for Infiniband:

module load infiniband

# OpenMPI
module load openmpi

or Infiniband with CUDA, you can choose the following:

module load multigpu

# OpenMPI
module load openmpi

Take note that the commands to load specific hardware resources (module load infiniband or module load multigpu) automatically constrains the resources on which the computation will be allowed to run (e.g. the MPI version is optimized for the hardware and should not run on other nodes).

Example MPI script¶

An MPI slurm script that uses 4 MPI tasks; 1 MPI task per node (4 nodes total); 16GB memory per node; and 8 CPU cores per task (assuming a "hybrid" MPI application that is designed to also use single-node parallelism):

#!/usr/bin/bash -l
#SBATCH --time=1:00:00
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16G
#SBATCH --job-name=MPI_application
#SBATCH --output=mpi_app.out

module load openmpi

export OMPI_MCA_pml=ucx

# MPI applications are launched with srun
#
#  The argument "--nthreads" is just an example of the various ways that
#  specific applications have of specifying the number cores or threads per task.

srun my_mpi_app --nthreads=$SLURM_CPUS_PER_TASK