Supercomputer Training¶
Introduction to the Grace Hopper Daint Cluster¶
on the Alps Supercomputer¶
Science IT Overview¶
- We provide computing infrastructure:
- ScienceCloud
- ScienceCluster
- ScienceApps – web-based application portal to the ScienceCluster
- Supercomputer – Alps cluster and Eiger cluster on the Alps supercomputer at CSCS
- We provide support, training, and consulting:
- Application support
- Training to use infrastructure
- Expert Consulting examples:
- Specialized advice for Science IT hardware;
- Assistance with workflows or scripts;
- Scaling up compute work from laptop to cluster;
- Code optimization including porting to GPU or enabling parallelization;
Further info:
- Science IT Computing Homepage
- Science IT Computing Terms & Conditions (UZH Login required)
- Documentation for ScienceCloud, ScienceCluster, and ScienceApps, located in our on-premises data center.
Supercomputer Service Summary¶
- Daint and Eiger are clusters on the larger Alps system
- managed by CSCS and available through a partnership between CSCS and UZH
- Alps was recently ranked as the 8th fastest supercomputer in the world1
Daint | Eiger |
---|---|
500+ Nodes | 500+ Nodes |
Node: 4 Nvidia Grace Hopper modules 288 ARM cores 4 x Nvidia Hopper GPUs 800+GB unified memory | Node: 2 x AMD Epyc Rome CPUs 128 x86-64 cores 256 GB RAM |
UZH share: 24 Nodes 50k Node hours per quarter | UZH share: 65 Nodes 140k Node hours per quarter |
Further info:
- "Supercomputer"" – The Alps supercomputer (CSCS, Lugano)
- Alps - general info
- Daint docmentatation - details about the Daint cluster, managed by CSCS
- Eiger documentation - details about the Eiger cluster, managed by CSCS
- 1 top500 - June 2025
- Depending on usage patterns, the ratio of UZH Daint to Eiger Share could be adjusted in future.
How do I choose between Supercomputer or ScienceCluster?¶
Type of computation | Suitable Resource |
---|---|
Single Core CPU | ScienceCluster |
Many Single Core CPU jobs (or job array) | ScienceCluster |
Multi-Core CPU parallel job, < ~100 cores | ScienceCluster |
Multi-Core CPU parallel job, > ~100 cores | Eiger, Daint? |
Job uses 1 GPU | ScienceCluster |
Job uses 2 GPUs | ScienceCluster |
Job uses 4+ GPUs | Daint, ScienceCluster? |
Job uses many GPUs | Daint |
Need fast low latency network between nodes (e.g. MPI) | Eiger, Daint |
Application requires x86-64 (AMD/Intel) CPU | ScienceCluster, Eiger |
Application requires on ARM CPU | Daint |
Further info:
- MPI (Message Passing Interface) is a communication library used by many parallel applications (e.g., physical simulations).
- ScienceCloud may also be suitable for single node (<32 cores) or single GPU workloads
GPU jobs: Choosing Daint versus ScienceCluster¶
Daint | ScienceCluster |
---|---|
Hopper GPUs | Hopper, Ampere, Volta GPUs |
Integrated Grace (ARM) CPU | x86-64 (Intel or AMD) CPU |
Grace-Hopper Unified memory | NVlink CPU-to-GPU for H100, A100, V100 |
Larger multi-node jobs possible | Can be long wait for multiple GPUs |
Low latency, high performance network (Cray Slingshot) | some high performance Infiniband; some ethernet |
High performance parallel filesystem |
Further info:
More details about memory layout?
Getting access to Supercomputer¶
-
Current Eiger users get access to Daint.
-
For new projects, Contact Science IT us to setup a Supercomputer project.
- We coordinate with CSCS to create the project.
- Once a project is created, PIs and deputies can add or remove users to their project (separately for Eiger and Daint) using the CSCS Account and resources management tool.
-
Cost contributions are pay-per-use for Supercomputer service
- Daint and Eiger usage is included in the monthly usage reports (along with ScienceCluster/ScienceApps and ScienceCloud).
-
If you need very large amount of compute time on Daint or Eiger, please consider submitting a proposal to one of the allocation schemes available through CSCS.
Further info:
- Subsidized cost contribution rates can be found here: Science IT Computing Terms & Conditions (UZH Login required)
- CSCS allocation schemes are not associated with Science IT; awarded projects are not part of the UZH Supercomputer service, and hence there are no cost contributions applied.
Getting Started on Daint - background skills/knowledge¶
- Important background skills
- Bash/Command line
- If you have no experience with the linux/unix command line/bash, we offer a regular linux command line training
- Slurm
- Use Slurm to submit compute jobs to the cluster
- Allows cluster to be shared fairly and efficiently
- We cover slurm in the ScienceCluster training (registration link)
- Parallel Computing/HPC
- To utilize Daint efficiently you will need an application that can utilize multiple GPUs
- Preferably your application should also utilize many cores per node
- Bash/Command line
Further info:
- Simple Linux Utility for Resource Management
- Fun fact: the acronym is derived from a popular cartoon series by Matt Groening titled Futurama. Slurm is the name of the most popular soda in the galaxy.
Logging onto Daint¶
-
Details are found in the CSCS Daint documentation under Getting started
-
ssh access is only via ssh key, which is obtained from CSCS, and is valid for only 24 hours.
- Recommended to configure ssh on your laptop with a
~/.ssh/config
file for daint
- Recommended to configure ssh on your laptop with a
-
Daint and Eiger are reached from the same "jump host"
- Daint and Eiger are separate clusters with separate slurm queues
- i.e., login to
daint.alps
in order to run job on Daint; login toeiger.alps
to run job on Eiger.
- i.e., login to
- Daint and Eiger are separate clusters with separate slurm queues
Further info:
File systems on Daint¶
-
scratch: (
/capstore/scratch
) on Daint/Eiger for I/O is temporary- Strict 30 day automatic deletion policy
- 150TB maximum quota
- No cost contributions
-
A very small quota is available on
/store
, which is a persistent high performance filesystem.
Further info:
If you need expanded storage on
/store
, please Contact Science IT. We can help request a quote with CSCS for a reserved quota for the desired length of time. Note that additional storage will not be subsidized by UZH and is subject to availability.
Migrating from Eiger to Daint¶
-
Make sure that your software/application has the capability to utilize multiple GPUs -- if it only can use CPUs, then Eiger is preferred.
-
Software, Applications, and Environments:
- Use
uenv
instead of the old simplemodule load
- modules are available within
uenv
(hint:--view=modules
) - If using community applications, check whether what you use is available
- Containers can be built and run on Daint
- If installing software, consider Spack
- modules are available within
- Use
-
Any software, container, or virtual environment that you built for Eiger will need to be rebuilt/recompiled for Daint because of the different CPU architecture.
-
Daint has no multi-threading. No need to set
--ntasks-per-core
; it is 1 by default.
Further info:
uenv is also available on Eiger
Migrating from ScienceCluster to Daint¶
-
A slurm job on Daint (and Eiger) always gets allocated an entire node (unlike ScienceCluster).
- Make sure that your software can reasonably efficiently utilize a node (4 GPUs and 288 CPU cores) before deciding to migrate workflows to Daint.
-
Any software, container, or virtual environment that you built for ScienceCluster will need to be rebuilt/recompiled for Daint because of the different CPU architecture.
-
Use uenv instead of a simple "module load"
- Modules are available within uenv (hint:
--view=modules
) - If using community applications, check whether what you use is available
- Containers can be built and run on Daint Containers
- If installing software, consider Spack
- Modules are available within uenv (hint:
-
Use uv instead of conda/mamba following these [uv instructions](https://docs.cscs.ch/guides/storage/#python-virtual-environments-with-uenv
-
Daint login nodes have GPUs, suitable for compiling (not true for ScienceCluster)
Similarities to Eiger and ScienceCluster¶
Upon a successful login to Daint, you will arrive at one of multiple login nodes, each of which has identical hardware to the compute nodes.
- Don't: execute your code directly on login nodes.
- Do: submit a job to the job scheduler / resource manager (Slurm)
Further info:
The login nodes are for text editing, file, code and data management on the cluster. Nothing computationally intensive should be run on the login nodes.
Warning
Executing or running your code or scripts directly on login nodes will compromise the integrity of the system and harm your own and other users' experience on the cluster.
Tools for setting up your environment¶
- Many choices available - Use what works and is most comfortable (or simplest)
- Programming environments
- uenv
- access to "programming environments" (tools, compilers, libraries)
- access applications maintained by CSCS
- modules available within uenv
- possible (advanced) to build your own uenv
- uv (python virtual environments)
- uenv+spack (building software from a recipe)
- containers (run 3rd party containers or build your own)
- Tip: Consider using /dev/shm for performance - (RAM as a temporary virtual filesystem)
Setting up your environment: uenv¶
- uenv images exist for programming environments (compilers, libraries, etc.), tools, applications
- uenv documentation
- Some commonly-used applications are provided.
- check the listed software
- & run
uenv image find
-
Useful commands:
uenv image find
uenv image pull
uenv status
uenv image ls
;uenv start uenv_name
-
Modules (e.g,
prgenv-gnu
) withuenv start prgenv-gnu/24.11:v1 --view=modules
, thenmodule avail
Further info:
¶
Python virtual environments on Daint¶
- Use cases for virtual environmentsL
- When is it better to use than a container? Or in addtion to container?
- Want latest pytorch?
- install with uv (instead of the pre-installed pytorch uenv)
- Recommended to use uv, a fast python package manager
- Then use squashfs to compress the resulting directory tree into a single file
- Instructions to use uv on Daint
Further info:
instructions to install uv Can also use venv, conda, mamba
Spack: Building software with uenv and Spack¶
- Spack is a package manager for Scientific software
- Many applications have a spack recipe, which allows you to build from source, including all dependencies.
- If there is no uenv image for your application, check whether it has a spack build recipe
- Instructions on building with spack+uenv
Further info:
Containers on Daint¶
- You can use containers to run pre-built software
- You can build your own container on Daint/Eiger for full customization
- Guide to building container with podman
- Similar to docker (or singularity)
- build from a recipe file (syntax similar to docker)
- Guide to building container with podman
- Example to setup and run pytorch in a container
Further info:
jupyter web portal¶
Jupyter sessions on Daint or Eiger for interactive notebooks. Check here for details: jupyter
Further info:
Copy data to/from Daint or Eiger¶
- Use Globus
- High bandwidth transfers
- A globus transfer requires 2 globus endpoints (one of which can be your personal machine with globus personal connect).
- CSCS and Science IT both maintain a Globus endpoint.
- Note that the CSCS endpoint requires CSCS (not UZH) credentials.
- Instructions on using Globus to/from UZH (Science IT ScienceCluster endpoint)
- Globus high speed file transfer service
- More on setting up and using Globus
- CSCS endpoint (Daint/Eiger) <---> UZH Science IT endpoint (ScienceCluster)
Further info:
Globus can transfer between any 2 Globus endpoints, including globus personal endpoint
How to run jobs on Daint¶
Slurm job submission script for Daint¶
- Key ingredients:
- Specify number of --nodes (1 or more)
- Various ways to specify the number of GPUs (per node) and tasks (or MPI ranks)
- --ntasks-per-node - This is the number parallel tasks (MPI tasks, for example)
- alternatively --ntasks-per-gpu
- --cpus-per-task or --cpus-per-gpu - Number of CPU cores per task or per GPU
- Specify run time - 24 hour maximum. Shorter time allows backfilling.
- memory does NOT need to be specified because you get entire node
- number of GPUs do NOT need to specified because you get all 4 GPUs on node
- optional: --job-name; --output; --mail-user
- optional: --gpus-per-task - Number of GPUs per task,
- optional: Procedure for multiple tasks (ranks) per GPU
Further info:
More details of Daint slurm scripts
Warning
Daint is a powerful resource. Please check your job submission script very carefully before submitting and then monitor your job with squeue --me
to make sure that you only requesting the intended amount of resources.
Note
A task is a process or a unit of program that can be placed on any node. Each unit of program can use multiple CPUs that can be specified by --cpus-per-task
Daint slurm script with uenv¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
Daint slurm script with uv python virtual environment¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Submit a compute job and monitor it¶
-
Submit the submission script
myjob.sh
using thesbatch
command:sbatch myjob.sh
-
Upon a successful job submission, you will receive a message that reads
Submitted batch job <number>
where<number>
is the job's assigned ID. The jobID can be used for other operations (i.e., monitoring or canceling the job).
If the resources are available the job starts immediately. If not, it waits in the queue.
Tips on job monitoring
- pending or running jobs:
squeue --jobs jobID
orsqueue --me
- completed jobs:
sacct
- Look for file written in the directory: since the job is run in batch mode, the output from Slurm is put in a file.
Further info:
Warning
Consider using an array if you need to submit a large set of jobs. Limit your simultaneous jobs with e.g. #SBATCH --array=1-16%4
. If each of your individual jobs are less than a few minutes, especially I/O heavy jobs, consider putting them into a single job (i.e., batches of analyses that run sequentially within a script).
Determining the resource requirements¶
- How do I choose the number of nodes?
- If code needs lots of memory, too few nodes means running out of memory
- Use resources efficiently
- Fewer nodes might result in better efficiency, longer run times, shorter queue wait.
- Is 1 node enough?
- More nodes might result in shorter run time
- Benchmarking
- Some codes do not scale well
- Feel free to refer to the Scientific Workflow Course GitLab Repo.
Further info:
You should spend at least a small amount of time benchmarking your code before you scale it across your entire dataset, otherwise your resource requests are made with incomplete information and you may be requesting an inappropriate amount of resources.
Please keep in mind the following notes when selecting your resource requests for each job submission:
- Requesting more nodes may not make your code run faster
- Request an amount of time based on your best estimates of how long the code will need plus a small buffer.
- When possible, implement checkpointing in your code so you can easily restart the job and continue where your code stopped.
Science IT teaches a semester course called Scientific Workflows wherein we teach the basics of monitoring and benchmarking. We'll notify users via our Newsletter every time this course is offered, so make sure to read your monthly newsletters from us.
Understanding job priorities: Why is my job not running?¶
- Check squeue for possible issues
-
The next job to run will be the job with the highest priority, with exceptions:
- Backfilling -- a small short job skips the queue if it fits before the next scheduled job.
- Debug partition for small very short jobs, <30:00 (use
--partition=debug
) - A higher priority project could submit a new job and land near the front of the queue
-
Job priority depends both on the recent usage within your project and total UZH usage.
- Job size does not affect priority but small jobs fit more easily
-
Have patience. It will start eventually; if soon, squeue may give estimated start time.
- The system can be very busy at times; jobs could wait in the queue for many days.
- The priority of a pending job increases with time.
- UZH shares Daint and Eiger with a large number of other users.
- UZH quarterly quota is finite. If used up, you will get an error at job submission.
!¶
- The system is very powerful, so mistakes can be expensive and can impact all UZH users
- Check job scripts carefully before submitting
- Monitor your jobs and usage
- After submitting a job, check that you have requested the intended resources
squeue --me
;scontrol show job JOBID
- Babysit/test new workflows
- Are your running jobs progressing as expected (i.e. job is not hanging)?
- Regularly check project usage on the Accounting and resources management tool
- Files & data
- Use Scratch for I/O - Scratch files are deleted automatically after 30 days
- transfer results/output files immediately (consider unexpected delays/downtime)
- scratch is not backed up. What will you do if system disaster destroys all files?
- Home/Store for small important files (scripts, source code, param files, etc.)
- use git for version control (and backup)
- do not rely on filesystem backups (Home, Store)
- what if: backup fails? you accidentally delete critical files? etc.?
- Try to avoid directories with thousands of small files - (and check your
quota
) - Regular maintenance every wednesday morning 08:00-10:00
Further info:
git and other best practices for ScienceCluster
Communications about outages, issues, extra maintenance via email from CSCS
Be aware how your usage affects other users or the system¶
- Limit your number of simultaneous jobs
- Be very careful with automation
- e.g. Do not run squeue (or other slurm queries) every second
- No workloads on the login nodes
Further info:
Getting help¶
- Search the CSCS docs.
- Check the CSCS Daint slack channel discussions. Your question may already be answered there.
- Contact Science IT
- If we cannot help, we will direct you to open a ticket with CSCS who manages the hardware and software on Daint & Eiger
- To get support from CSCS open a jira ticket