Supercomputer Training¶

Introduction to the Grace Hopper Daint Cluster¶

on the Alps Supercomputer¶

Science IT Overview¶

We provide computing infrastructure:
- ScienceCloud
- ScienceCluster
  - ScienceApps – web-based application portal to the ScienceCluster
- Supercomputer – Alps cluster and Eiger cluster on the Alps supercomputer at CSCS
  - docs
We provide support, training, and consulting:
- Application support
- Training to use infrastructure
- Expert Consulting examples:
  - Specialized advice for Science IT hardware;
  - Assistance with workflows or scripts;
  - Scaling up compute work from laptop to cluster;
  - Code optimization including porting to GPU or enabling parallelization;

Further info:

Science IT Computing Homepage

Science IT Computing Terms & Conditions (UZH Login required)

Documentation for ScienceCloud, ScienceCluster, and ScienceApps, located in our on-premises data center.

Supercomputer Service Summary¶

Daint and Eiger are clusters on the larger Alps system
- managed by CSCS and available through a partnership between CSCS and UZH
Alps was recently ranked as the 8^th fastest supercomputer in the world¹

Daint	Eiger
500+ Nodes	500+ Nodes
Node: 4 Nvidia Grace Hopper modules 288 ARM cores 4 x Nvidia Hopper GPUs 800+GB unified memory	Node: 2 x AMD Epyc Rome CPUs 128 x86-64 cores 256 GB RAM

UZH share: 24 Nodes 50k Node hours per quarter	UZH share: 65 Nodes 140k Node hours per quarter

Further info:

"Supercomputer"" – The Alps supercomputer (CSCS, Lugano)

Alps - general info

Daint docmentatation - details about the Daint cluster, managed by CSCS

Eiger documentation - details about the Eiger cluster, managed by CSCS

¹ top500 - June 2025

Depending on usage patterns, the ratio of UZH Daint to Eiger Share could be adjusted in future.

How do I choose between Supercomputer or ScienceCluster?¶

Type of computation	Suitable Resource
Single Core CPU	ScienceCluster
Many Single Core CPU jobs (or job array)	ScienceCluster
Multi-Core CPU parallel job, < ~100 cores	ScienceCluster
Multi-Core CPU parallel job, > ~100 cores	Eiger, Daint?
Job uses 1 GPU	ScienceCluster
Job uses 2 GPUs	ScienceCluster
Job uses 4+ GPUs	Daint, ScienceCluster?
Job uses many GPUs	Daint
Need fast low latency network between nodes (e.g. MPI)	Eiger, Daint
Application requires x86-64 (AMD/Intel) CPU	ScienceCluster, Eiger
Application requires on ARM CPU	Daint

Further info:

MPI (Message Passing Interface) is a communication library used by many parallel applications (e.g., physical simulations).

ScienceCloud may also be suitable for single node (<32 cores) or single GPU workloads

GPU jobs: Choosing Daint versus ScienceCluster¶

Daint	ScienceCluster
Hopper GPUs	Hopper, Ampere, Volta GPUs
Integrated Grace (ARM) CPU	x86-64 (Intel or AMD) CPU
Grace-Hopper Unified memory	NVlink CPU-to-GPU for H100, A100, V100
Larger multi-node jobs possible	Can be long wait for multiple GPUs
Low latency, high performance network (Cray Slingshot)	some high performance Infiniband; some ethernet
High performance parallel filesystem

Further info:

More details about memory layout?

Getting access to Supercomputer¶

Current Eiger users get access to Daint.
For new projects, Contact Science IT us to setup a Supercomputer project.
- We coordinate with CSCS to create the project.
- Once a project is created, PIs and deputies can add or remove users to their project (separately for Eiger and Daint) using the CSCS Account and resources management tool.
Cost contributions are pay-per-use for Supercomputer service
- Daint and Eiger usage is included in the monthly usage reports (along with ScienceCluster/ScienceApps and ScienceCloud).
If you need very large amount of compute time on Daint or Eiger, please consider submitting a proposal to one of the allocation schemes available through CSCS.

Further info:

Subsidized cost contribution rates can be found here: Science IT Computing Terms & Conditions (UZH Login required)

CSCS allocation schemes are not associated with Science IT; awarded projects are not part of the UZH Supercomputer service, and hence there are no cost contributions applied.

Getting Started on Daint - background skills/knowledge¶

Important background skills
- Bash/Command line
  - If you have no experience with the linux/unix command line/bash, we offer a regular linux command line training
- Slurm
  - Use Slurm to submit compute jobs to the cluster
  - Allows cluster to be shared fairly and efficiently
  - We cover slurm in the ScienceCluster training (registration link)
- Parallel Computing/HPC
  - To utilize Daint efficiently you will need an application that can utilize multiple GPUs
  - Preferably your application should also utilize many cores per node

Further info:

Simple Linux Utility for Resource Management

Fun fact: the acronym is derived from a popular cartoon series by Matt Groening titled Futurama. Slurm is the name of the most popular soda in the galaxy.

Logging onto Daint¶

Details are found in the CSCS Daint documentation under Getting started
ssh access is only via ssh key, which is obtained from CSCS, and is valid for only 24 hours.
- Recommended to configure ssh on your laptop with a ~/.ssh/config file for daint
Daint and Eiger are reached from the same "jump host"
- Daint and Eiger are separate clusters with separate slurm queues
  - i.e., login to daint.alps in order to run job on Daint; login to eiger.alps to run job on Eiger.

Further info:

File systems on Daint¶

Storage at CSCS
Filesystems on Alps
scratch: (/capstore/scratch) on Daint/Eiger for I/O is temporary
- Strict 30 day automatic deletion policy
- 150TB maximum quota
- No cost contributions
A very small quota is available on /store, which is a persistent high performance filesystem.

Further info:

If you need expanded storage on /store, please Contact Science IT. We can help request a quote with CSCS for a reserved quota for the desired length of time. Note that additional storage will not be subsidized by UZH and is subject to availability.

Migrating from Eiger to Daint¶

Make sure that your software/application has the capability to utilize multiple GPUs -- if it only can use CPUs, then Eiger is preferred.
Getting started
Software, Applications, and Environments:
- Use uenv instead of the old simple module load
  - modules are available within uenv (hint: --view=modules)
  - If using community applications, check whether what you use is available
  - Containers can be built and run on Daint
  - If installing software, consider Spack
Any software, container, or virtual environment that you built for Eiger will need to be rebuilt/recompiled for Daint because of the different CPU architecture.
Daint has no multi-threading. No need to set --ntasks-per-core; it is 1 by default.

Further info:

uenv is also available on Eiger

Migrating from ScienceCluster to Daint¶

Getting started - CSCS User Docs
A slurm job on Daint (and Eiger) always gets allocated an entire node (unlike ScienceCluster).
- Make sure that your software can reasonably efficiently utilize a node (4 GPUs and 288 CPU cores) before deciding to migrate workflows to Daint.
Any software, container, or virtual environment that you built for ScienceCluster will need to be rebuilt/recompiled for Daint because of the different CPU architecture.
Use uenv instead of a simple "module load"
- Modules are available within uenv (hint: --view=modules)
- If using community applications, check whether what you use is available
- Containers can be built and run on Daint Containers
- If installing software, consider Spack
Use uv instead of conda/mamba following these [uv instructions](https://docs.cscs.ch/guides/storage/#python-virtual-environments-with-uenv
Daint login nodes have GPUs, suitable for compiling (not true for ScienceCluster)

Similarities to Eiger and ScienceCluster¶

Upon a successful login to Daint, you will arrive at one of multiple login nodes, each of which has identical hardware to the compute nodes.

Don't: execute your code directly on login nodes.
Do: submit a job to the job scheduler / resource manager (Slurm)

Further info:

The login nodes are for text editing, file, code and data management on the cluster. Nothing computationally intensive should be run on the login nodes.

Warning

Executing or running your code or scripts directly on login nodes will compromise the integrity of the system and harm your own and other users' experience on the cluster.

Tools for setting up your environment¶

Many choices available - Use what works and is most comfortable (or simplest)
- Programming environments
- uenv
  - access to "programming environments" (tools, compilers, libraries)
  - access applications maintained by CSCS
  - modules available within uenv
  - possible (advanced) to build your own uenv
- uv (python virtual environments)
- uenv+spack (building software from a recipe)
- containers (run 3^rd party containers or build your own)
Tip: Consider using /dev/shm for performance - (RAM as a temporary virtual filesystem)

Setting up your environment: uenv¶

uenv images exist for programming environments (compilers, libraries, etc.), tools, applications
- uenv documentation
- Some commonly-used applications are provided.
  - check the listed software
  - & run uenv image find
Useful commands:
- uenv image find
- uenv image pull
- uenv status
- uenv image ls;
- uenv start uenv_name
Modules (e.g, prgenv-gnu) with uenv start prgenv-gnu/24.11:v1 --view=modules, then module avail

Further info:

¶

Python virtual environments on Daint¶

Use cases for virtual environmentsL
- When is it better to use than a container? Or in addtion to container?
- Want latest pytorch?
  - install with uv (instead of the pre-installed pytorch uenv)
Recommended to use uv, a fast python package manager
- Then use squashfs to compress the resulting directory tree into a single file
- Instructions to use uv on Daint

Further info:

instructions to install uv Can also use venv, conda, mamba

Spack: Building software with uenv and Spack¶

Spack is a package manager for Scientific software
- Many applications have a spack recipe, which allows you to build from source, including all dependencies.
- If there is no uenv image for your application, check whether it has a spack build recipe
- Instructions on building with spack+uenv

Further info:

Containers on Daint¶

You can use containers to run pre-built software
- Container engine at CSCS
You can build your own container on Daint/Eiger for full customization
- Guide to building container with podman
- Similar to docker (or singularity)
- build from a recipe file (syntax similar to docker)
- Guide to building container with podman
- Example to setup and run pytorch in a container

Further info:

jupyter web portal¶

Jupyter sessions on Daint or Eiger for interactive notebooks. Check here for details: jupyter

Further info:

Copy data to/from Daint or Eiger¶

Use Globus
High bandwidth transfers
A globus transfer requires 2 globus endpoints (one of which can be your personal machine with globus personal connect).
CSCS and Science IT both maintain a Globus endpoint.
- Note that the CSCS endpoint requires CSCS (not UZH) credentials.
Instructions on using Globus to/from UZH (Science IT ScienceCluster endpoint)
- Globus high speed file transfer service
- More on setting up and using Globus
- CSCS endpoint (Daint/Eiger) <---> UZH Science IT endpoint (ScienceCluster)

Further info:

Globus can transfer between any 2 Globus endpoints, including globus personal endpoint

How to run jobs on Daint¶

Slurm job submission script for Daint¶

Key ingredients:
Specify number of --nodes (1 or more)
Various ways to specify the number of GPUs (per node) and tasks (or MPI ranks)
- --ntasks-per-node - This is the number parallel tasks (MPI tasks, for example)
- alternatively --ntasks-per-gpu
- --cpus-per-task or --cpus-per-gpu - Number of CPU cores per task or per GPU
Specify run time - 24 hour maximum. Shorter time allows backfilling.
memory does NOT need to be specified because you get entire node
number of GPUs do NOT need to specified because you get all 4 GPUs on node
optional: --job-name; --output; --mail-user
optional: --gpus-per-task - Number of GPUs per task,
optional: Procedure for multiple tasks (ranks) per GPU

Further info:

More details of Daint slurm scripts

Warning

Daint is a powerful resource. Please check your job submission script very carefully before submitting and then monitor your job with squeue --me to make sure that you only requesting the intended amount of resources.

Note

A task is a process or a unit of program that can be placed on any node. Each unit of program can use multiple CPUs that can be specified by --cpus-per-task

Daint slurm script with uenv¶

running a uenv

#!/usr/bin/bash -l

### Comment lines start with ## or #+space
### Slurm option lines start with #SBATCH

### Here are the SBATCH parameters that you should always consider:
#SBATCH --time=0-00:05:00     ## days-hours:minutes:seconds
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4  
#SBATCH --gpus-per-task=1    
#SBATCH --cpus-per-task=72   
#SBATCH --job-name=uenvtest1    ## optional


## start the uenv
## in this case the "default" view of prgenv-gnu provides python, cray-mpich,
## and other useful tools

uenv start prgenv-gnu/24.11:v1 --view=default

### Optional info to print

echo 'Starting.'
hostname    ## Prints the system hostname
date        ## Prints the system date
nvidia-smi
lscpu

srun <application> <command line arguments>

echo 'Finished.'

Daint slurm script with uv python virtual environment¶

#!/usr/bin/bash -l

### Here are the SBATCH parameters that you should always consider:
#SBATCH --time=0-00:05:00     ## days-hours:minutes:seconds
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4  ##
#SBATCH --gpus-per-task=1    ##
#SBATCH --cpus-per-task=72   ##  
#SBATCH --job-name=uvtest1  ## optional

## start the uenv
## in this case the "default" view of prgenv-gnu provides python, cray-mpich,
## and other useful tools


### optional info to print
echo 'hello starting.'
hostname    ## Prints the system hostname
date        ## Prints the system date
nvidia-smi
lscpu

## example to run a uv environment  (script needs debugging?)

cd $SCRATCH/sqfs-demo       ## location of your prepared uv environment
uenv start --view=default \
    prgenv-gnu/24.11:v1,$PWD/py_venv.squashfs:$SCRATCH/sqfs-demo/.venv
source .venv/bin/activate

srun python myscript.py


echo 'finished'

Submit a compute job and monitor it¶

Submit the submission script myjob.sh using the sbatch command:
```
sbatch myjob.sh
```
Upon a successful job submission, you will receive a message that reads Submitted batch job <number> where <number> is the job's assigned ID. The jobID can be used for other operations (i.e., monitoring or canceling the job).
If the resources are available the job starts immediately. If not, it waits in the queue.

Tips on job monitoring

pending or running jobs: squeue --jobs jobID or squeue --me
completed jobs: sacct
Look for file written in the directory: since the job is run in batch mode, the output from Slurm is put in a file.

Further info:

Warning

Consider using an array if you need to submit a large set of jobs. Limit your simultaneous jobs with e.g. #SBATCH --array=1-16%4. If each of your individual jobs are less than a few minutes, especially I/O heavy jobs, consider putting them into a single job (i.e., batches of analyses that run sequentially within a script).

Determining the resource requirements¶

How do I choose the number of nodes?
- If code needs lots of memory, too few nodes means running out of memory
- Use resources efficiently
- Fewer nodes might result in better efficiency, longer run times, shorter queue wait.
  - Is 1 node enough?
- More nodes might result in shorter run time
- Benchmarking
  - Some codes do not scale well
  - Feel free to refer to the Scientific Workflow Course GitLab Repo.

Further info:

You should spend at least a small amount of time benchmarking your code before you scale it across your entire dataset, otherwise your resource requests are made with incomplete information and you may be requesting an inappropriate amount of resources.

Please keep in mind the following notes when selecting your resource requests for each job submission:

Requesting more nodes may not make your code run faster

Request an amount of time based on your best estimates of how long the code will need plus a small buffer.

When possible, implement checkpointing in your code so you can easily restart the job and continue where your code stopped.

Science IT teaches a semester course called Scientific Workflows wherein we teach the basics of monitoring and benchmarking. We'll notify users via our Newsletter every time this course is offered, so make sure to read your monthly newsletters from us.

Understanding job priorities: Why is my job not running?¶

Check squeue for possible issues
The next job to run will be the job with the highest priority, with exceptions:
- Backfilling -- a small short job skips the queue if it fits before the next scheduled job.
- Debug partition for small very short jobs, <30:00 (use --partition=debug)
- A higher priority project could submit a new job and land near the front of the queue
Job priority depends both on the recent usage within your project and total UZH usage.
- Job size does not affect priority but small jobs fit more easily
Have patience. It will start eventually; if soon, squeue may give estimated start time.
- The system can be very busy at times; jobs could wait in the queue for many days.
- The priority of a pending job increases with time.
- UZH shares Daint and Eiger with a large number of other users.
UZH quarterly quota is finite. If used up, you will get an error at job submission.

!¶

The system is very powerful, so mistakes can be expensive and can impact all UZH users
Check job scripts carefully before submitting
Monitor your jobs and usage
- After submitting a job, check that you have requested the intended resources
- squeue --me; scontrol show job JOBID
- Babysit/test new workflows
- Are your running jobs progressing as expected (i.e. job is not hanging)?
- Regularly check project usage on the Accounting and resources management tool
Files & data
Use Scratch for I/O - Scratch files are deleted automatically after 30 days
- transfer results/output files immediately (consider unexpected delays/downtime)
- scratch is not backed up. What will you do if system disaster destroys all files?
Home/Store for small important files (scripts, source code, param files, etc.)
- use git for version control (and backup)
- do not rely on filesystem backups (Home, Store)
- what if: backup fails? you accidentally delete critical files? etc.?
Try to avoid directories with thousands of small files - (and check your quota)
Regular maintenance every wednesday morning 08:00-10:00

Further info:

git and other best practices for ScienceCluster

Communications about outages, issues, extra maintenance via email from CSCS

Be aware how your usage affects other users or the system¶

Limit your number of simultaneous jobs
Be very careful with automation
- e.g. Do not run squeue (or other slurm queries) every second
No workloads on the login nodes

Further info:

Getting help¶

Search the CSCS docs.
Check the CSCS Daint slack channel discussions. Your question may already be answered there.
Contact Science IT
- If we cannot help, we will direct you to open a ticket with CSCS who manages the hardware and software on Daint & Eiger
To get support from CSCS open a jira ticket