Skip to content

Supercomputer Training

Introduction to the Grace Hopper Daint Cluster

on the Alps Supercomputer


Science IT Overview

  • We provide computing infrastructure:
  • We provide support, training, and consulting:
    • Application support
    • Training to use infrastructure
    • Expert Consulting examples:
      • Specialized advice for Science IT hardware;
      • Assistance with workflows or scripts;
      • Scaling up compute work from laptop to cluster;
      • Code optimization including porting to GPU or enabling parallelization;

Further info:


Supercomputer Service Summary

  • Daint and Eiger are clusters on the larger Alps system
    • managed by CSCS and available through a partnership between CSCS and UZH
  • Alps was recently ranked as the 8th fastest supercomputer in the world1



Daint Eiger
500+ Nodes 500+ Nodes
Node:
4 Nvidia Grace Hopper modules
288 ARM cores
4 x Nvidia Hopper GPUs
800+GB unified memory
Node:
2 x AMD Epyc Rome CPUs
128 x86-64 cores

256 GB RAM

UZH share:
24 Nodes
50k Node hours per quarter
UZH share:
65 Nodes
140k Node hours per quarter

Further info:

  • "Supercomputer"" – The Alps supercomputer (CSCS, Lugano)
  • Alps - general info
  • Daint docmentatation - details about the Daint cluster, managed by CSCS
  • Eiger documentation - details about the Eiger cluster, managed by CSCS
  • 1 top500 - June 2025
  • Depending on usage patterns, the ratio of UZH Daint to Eiger Share could be adjusted in future.

How do I choose between Supercomputer or ScienceCluster?


Type of computation Suitable Resource
Single Core CPU ScienceCluster
Many Single Core CPU jobs (or job array) ScienceCluster
Multi-Core CPU parallel job, < ~100 cores ScienceCluster
Multi-Core CPU parallel job, > ~100 cores Eiger, Daint?
Job uses 1 GPU ScienceCluster
Job uses 2 GPUs ScienceCluster
Job uses 4+ GPUs Daint, ScienceCluster?
Job uses many GPUs Daint
Need fast low latency network between nodes (e.g. MPI) Eiger, Daint
Application requires x86-64 (AMD/Intel) CPU ScienceCluster, Eiger
Application requires on ARM CPU Daint

Further info:

  • MPI (Message Passing Interface) is a communication library used by many parallel applications (e.g., physical simulations).
  • ScienceCloud may also be suitable for single node (<32 cores) or single GPU workloads

GPU jobs: Choosing Daint versus ScienceCluster


Daint ScienceCluster
Hopper GPUs Hopper, Ampere, Volta GPUs
Integrated Grace (ARM) CPU x86-64 (Intel or AMD) CPU
Grace-Hopper Unified memory NVlink CPU-to-GPU for H100, A100, V100
Larger multi-node jobs possible Can be long wait for multiple GPUs
Low latency, high performance network (Cray Slingshot) some high performance Infiniband; some ethernet
High performance parallel filesystem

Further info:

More details about memory layout?


Getting access to Supercomputer

  • Current Eiger users get access to Daint.

  • For new projects, Contact Science IT us to setup a Supercomputer project.

    • We coordinate with CSCS to create the project.
    • Once a project is created, PIs and deputies can add or remove users to their project (separately for Eiger and Daint) using the CSCS Account and resources management tool.
  • Cost contributions are pay-per-use for Supercomputer service

    • Daint and Eiger usage is included in the monthly usage reports (along with ScienceCluster/ScienceApps and ScienceCloud).
  • If you need very large amount of compute time on Daint or Eiger, please consider submitting a proposal to one of the allocation schemes available through CSCS.

Further info:


Getting Started on Daint - background skills/knowledge

  • Important background skills
    • Bash/Command line
    • Slurm
    • Parallel Computing/HPC
      • To utilize Daint efficiently you will need an application that can utilize multiple GPUs
      • Preferably your application should also utilize many cores per node

Further info:

  • Simple Linux Utility for Resource Management
  • Fun fact: the acronym is derived from a popular cartoon series by Matt Groening titled Futurama. Slurm is the name of the most popular soda in the galaxy.

Slurm Soda - It's Highly Addictive!

Logging onto Daint

  • Details are found in the CSCS Daint documentation under Getting started

  • ssh access is only via ssh key, which is obtained from CSCS, and is valid for only 24 hours.

    • Recommended to configure ssh on your laptop with a ~/.ssh/config file for daint
  • Daint and Eiger are reached from the same "jump host"

    • Daint and Eiger are separate clusters with separate slurm queues
      • i.e., login to daint.alps in order to run job on Daint; login to eiger.alps to run job on Eiger.

Further info:


File systems on Daint

  • Storage at CSCS

  • Filesystems on Alps

  • scratch: (/capstore/scratch) on Daint/Eiger for I/O is temporary

    • Strict 30 day automatic deletion policy
    • 150TB maximum quota
    • No cost contributions
  • A very small quota is available on /store, which is a persistent high performance filesystem.

Further info:

If you need expanded storage on /store, please Contact Science IT. We can help request a quote with CSCS for a reserved quota for the desired length of time. Note that additional storage will not be subsidized by UZH and is subject to availability.


Migrating from Eiger to Daint

  • Make sure that your software/application has the capability to utilize multiple GPUs -- if it only can use CPUs, then Eiger is preferred.

  • Getting started

  • Software, Applications, and Environments:

    • Use uenv instead of the old simple module load
      • modules are available within uenv (hint: --view=modules)
      • If using community applications, check whether what you use is available
      • Containers can be built and run on Daint
      • If installing software, consider Spack
  • Any software, container, or virtual environment that you built for Eiger will need to be rebuilt/recompiled for Daint because of the different CPU architecture.

  • Daint has no multi-threading. No need to set --ntasks-per-core; it is 1 by default.

Further info:

uenv is also available on Eiger


Migrating from ScienceCluster to Daint

  • Getting started - CSCS User Docs

  • A slurm job on Daint (and Eiger) always gets allocated an entire node (unlike ScienceCluster).

    • Make sure that your software can reasonably efficiently utilize a node (4 GPUs and 288 CPU cores) before deciding to migrate workflows to Daint.
  • Any software, container, or virtual environment that you built for ScienceCluster will need to be rebuilt/recompiled for Daint because of the different CPU architecture.

  • Use uenv instead of a simple "module load"

    • Modules are available within uenv (hint: --view=modules)
    • If using community applications, check whether what you use is available
    • Containers can be built and run on Daint Containers
    • If installing software, consider Spack
  • Use uv instead of conda/mamba following these [uv instructions](https://docs.cscs.ch/guides/storage/#python-virtual-environments-with-uenv

  • Daint login nodes have GPUs, suitable for compiling (not true for ScienceCluster)


Similarities to Eiger and ScienceCluster


Upon a successful login to Daint, you will arrive at one of multiple login nodes, each of which has identical hardware to the compute nodes.

  • Don't: execute your code directly on login nodes.
  • Do: submit a job to the job scheduler / resource manager (Slurm)

Further info:

The login nodes are for text editing, file, code and data management on the cluster. Nothing computationally intensive should be run on the login nodes.

Warning

Executing or running your code or scripts directly on login nodes will compromise the integrity of the system and harm your own and other users' experience on the cluster.


Tools for setting up your environment

  • Many choices available - Use what works and is most comfortable (or simplest)
    • Programming environments
    • uenv
      • access to "programming environments" (tools, compilers, libraries)
      • access applications maintained by CSCS
      • modules available within uenv
      • possible (advanced) to build your own uenv
    • uv (python virtual environments)
    • uenv+spack (building software from a recipe)
    • containers (run 3rd party containers or build your own)
  • Tip: Consider using /dev/shm for performance - (RAM as a temporary virtual filesystem)

Setting up your environment: uenv

  • uenv images exist for programming environments (compilers, libraries, etc.), tools, applications
  • Useful commands:

    • uenv image find
    • uenv image pull
    • uenv status
    • uenv image ls;
    • uenv start uenv_name
  • Modules (e.g, prgenv-gnu) with uenv start prgenv-gnu/24.11:v1 --view=modules, then module avail

Further info:


Python virtual environments on Daint

  • Use cases for virtual environmentsL
    • When is it better to use than a container? Or in addtion to container?
    • Want latest pytorch?
      • install with uv (instead of the pre-installed pytorch uenv)
  • Recommended to use uv, a fast python package manager

Further info:

instructions to install uv Can also use venv, conda, mamba


Spack: Building software with uenv and Spack

  • Spack is a package manager for Scientific software
    • Many applications have a spack recipe, which allows you to build from source, including all dependencies.
    • If there is no uenv image for your application, check whether it has a spack build recipe
    • Instructions on building with spack+uenv

Further info:


Containers on Daint

Further info:


jupyter web portal

Jupyter sessions on Daint or Eiger for interactive notebooks. Check here for details: jupyter

Further info:


Copy data to/from Daint or Eiger


  • Use Globus
  • High bandwidth transfers
  • A globus transfer requires 2 globus endpoints (one of which can be your personal machine with globus personal connect).
  • CSCS and Science IT both maintain a Globus endpoint.
    • Note that the CSCS endpoint requires CSCS (not UZH) credentials.
  • Instructions on using Globus to/from UZH (Science IT ScienceCluster endpoint)

Further info:

Globus can transfer between any 2 Globus endpoints, including globus personal endpoint


How to run jobs on Daint


Slurm job submission script for Daint

  • Key ingredients:
  • Specify number of --nodes (1 or more)
  • Various ways to specify the number of GPUs (per node) and tasks (or MPI ranks)
    • --ntasks-per-node - This is the number parallel tasks (MPI tasks, for example)
    • alternatively --ntasks-per-gpu
    • --cpus-per-task or --cpus-per-gpu - Number of CPU cores per task or per GPU
  • Specify run time - 24 hour maximum. Shorter time allows backfilling.
  • memory does NOT need to be specified because you get entire node
  • number of GPUs do NOT need to specified because you get all 4 GPUs on node
  • optional: --job-name; --output; --mail-user
  • optional: --gpus-per-task - Number of GPUs per task,
  • optional: Procedure for multiple tasks (ranks) per GPU

Further info:

More details of Daint slurm scripts

Warning

Daint is a powerful resource. Please check your job submission script very carefully before submitting and then monitor your job with squeue --me to make sure that you only requesting the intended amount of resources.

Note

A task is a process or a unit of program that can be placed on any node. Each unit of program can use multiple CPUs that can be specified by --cpus-per-task


Daint slurm script with uenv

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/usr/bin/bash -l

### Comment lines start with ## or #+space
### Slurm option lines start with #SBATCH

### Here are the SBATCH parameters that you should always consider:
#SBATCH --time=0-00:05:00     ## days-hours:minutes:seconds
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4  
#SBATCH --gpus-per-task=1    
#SBATCH --cpus-per-task=72   
#SBATCH --job-name=uenvtest1    ## optional


## start the uenv
## in this case the "default" view of prgenv-gnu provides python, cray-mpich,
## and other useful tools

uenv start prgenv-gnu/24.11:v1 --view=default

### Optional info to print

echo 'Starting.'
hostname    ## Prints the system hostname
date        ## Prints the system date
nvidia-smi
lscpu

srun <application> <command line arguments>

echo 'Finished.'

Daint slurm script with uv python virtual environment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/bash -l

### Here are the SBATCH parameters that you should always consider:
#SBATCH --time=0-00:05:00     ## days-hours:minutes:seconds
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4  ##
#SBATCH --gpus-per-task=1    ##
#SBATCH --cpus-per-task=72   ##  
#SBATCH --job-name=uvtest1  ## optional

## start the uenv
## in this case the "default" view of prgenv-gnu provides python, cray-mpich,
## and other useful tools


### optional info to print
echo 'hello starting.'
hostname    ## Prints the system hostname
date        ## Prints the system date
nvidia-smi
lscpu

## example to run a uv environment  (script needs debugging?)

cd $SCRATCH/sqfs-demo       ## location of your prepared uv environment
uenv start --view=default \
    prgenv-gnu/24.11:v1,$PWD/py_venv.squashfs:$SCRATCH/sqfs-demo/.venv
source .venv/bin/activate

srun python myscript.py


echo 'finished'

Submit a compute job and monitor it

  • Submit the submission script myjob.sh using the sbatch command:

    sbatch myjob.sh
    
  • Upon a successful job submission, you will receive a message that reads Submitted batch job <number> where <number> is the job's assigned ID. The jobID can be used for other operations (i.e., monitoring or canceling the job).
    If the resources are available the job starts immediately. If not, it waits in the queue.

Tips on job monitoring

  • pending or running jobs: squeue --jobs jobID or squeue --me
  • completed jobs: sacct
  • Look for file written in the directory: since the job is run in batch mode, the output from Slurm is put in a file.

Further info:

Warning

Consider using an array if you need to submit a large set of jobs. Limit your simultaneous jobs with e.g. #SBATCH --array=1-16%4. If each of your individual jobs are less than a few minutes, especially I/O heavy jobs, consider putting them into a single job (i.e., batches of analyses that run sequentially within a script).


Determining the resource requirements

  • How do I choose the number of nodes?
    • If code needs lots of memory, too few nodes means running out of memory
    • Use resources efficiently
    • Fewer nodes might result in better efficiency, longer run times, shorter queue wait.
      • Is 1 node enough?
    • More nodes might result in shorter run time
    • Benchmarking

Further info:

You should spend at least a small amount of time benchmarking your code before you scale it across your entire dataset, otherwise your resource requests are made with incomplete information and you may be requesting an inappropriate amount of resources.

Please keep in mind the following notes when selecting your resource requests for each job submission:

  • Requesting more nodes may not make your code run faster
  • Request an amount of time based on your best estimates of how long the code will need plus a small buffer.
  • When possible, implement checkpointing in your code so you can easily restart the job and continue where your code stopped.

Science IT teaches a semester course called Scientific Workflows wherein we teach the basics of monitoring and benchmarking. We'll notify users via our Newsletter every time this course is offered, so make sure to read your monthly newsletters from us.


Understanding job priorities: Why is my job not running?

  • Check squeue for possible issues
  • The next job to run will be the job with the highest priority, with exceptions:

    • Backfilling -- a small short job skips the queue if it fits before the next scheduled job.
    • Debug partition for small very short jobs, <30:00 (use --partition=debug)
    • A higher priority project could submit a new job and land near the front of the queue
  • Job priority depends both on the recent usage within your project and total UZH usage.

    • Job size does not affect priority but small jobs fit more easily
  • Have patience. It will start eventually; if soon, squeue may give estimated start time.

    • The system can be very busy at times; jobs could wait in the queue for many days.
    • The priority of a pending job increases with time.
    • UZH shares Daint and Eiger with a large number of other users.
  • UZH quarterly quota is finite. If used up, you will get an error at job submission.

!

  • The system is very powerful, so mistakes can be expensive and can impact all UZH users
  • Check job scripts carefully before submitting
  • Monitor your jobs and usage
    • After submitting a job, check that you have requested the intended resources
    • squeue --me; scontrol show job JOBID
    • Babysit/test new workflows
    • Are your running jobs progressing as expected (i.e. job is not hanging)?
    • Regularly check project usage on the Accounting and resources management tool
  • Files & data
  • Use Scratch for I/O - Scratch files are deleted automatically after 30 days
    • transfer results/output files immediately (consider unexpected delays/downtime)
    • scratch is not backed up. What will you do if system disaster destroys all files?
  • Home/Store for small important files (scripts, source code, param files, etc.)
    • use git for version control (and backup)
    • do not rely on filesystem backups (Home, Store)
    • what if: backup fails? you accidentally delete critical files? etc.?
  • Try to avoid directories with thousands of small files - (and check your quota)
  • Regular maintenance every wednesday morning 08:00-10:00

Further info:

git and other best practices for ScienceCluster

Communications about outages, issues, extra maintenance via email from CSCS


Be aware how your usage affects other users or the system

  • Limit your number of simultaneous jobs
  • Be very careful with automation
    • e.g. Do not run squeue (or other slurm queries) every second
  • No workloads on the login nodes

Further info:


Getting help

  • Search the CSCS docs.
  • Check the CSCS Daint slack channel discussions. Your question may already be answered there.
  • Contact Science IT
    • If we cannot help, we will direct you to open a ticket with CSCS who manages the hardware and software on Daint & Eiger
  • To get support from CSCS open a jira ticket