Skip to content

Install PyTorch on GPU nodes

This tutorial demonstrates the basics of how to create a PyTorch environment on GPU compute nodes of ScienceCluster.

Creating an environment on a GPU node

After connecting from a terminal to the ScienceCluster, work through the following steps.

  1. Load a specific GPU module (one that you plan to use)

    module load l4
    
  2. Request an interactive session, which allows the package installer to see the GPU hardware.

    srun --pty -n 1 -c 2 --time=01:00:00 --gpus=1 --partition lowprio --mem=8G bash -l
    
  3. Confirm the GPU is available. The output should show basic information about at least one GPU.

    nvidia-smi
    
  4. Load apptainer module

    module load apptainer
    
  5. Create a virtual environment container from an appropriately versioned PyTorch docker image

    apptainer pull pytorch.sif docker://pytorch/pytorch:2.12.0-cuda13.2-cudnn9-runtime
    
  6. Confirm that the GPU is correctly detected

    apptainer exec --nv pytorch.sif python - <<'EOB'
    import torch as t
    print(f"torch={t.__version__} cuda={t.version.cuda} available={t.cuda.is_available()}")
    for i in range(t.cuda.device_count()):
        print(f"{t.cuda.get_device_name(i)}: {t.cuda.get_device_properties(i)}")
    EOB
    
  7. When finished with your test, exit the interactive session

    exit
    

Note

You should always confirm that your chosen base Docker image correctly recognizes the GPU hardware you require (i.e., always test GPU availibility in your environment when trying a new base Docker image). As you'll note, the PyTorch example image (docker://pytorch/pytorch:2.12.0-cuda13.2-cudnn9-runtime) works with the l4 hardware.

Using a virtual environment in ScienceApps

If you would like to use a PyTorch environment with Jupyter and ScienceApps, see the documentation about installing the environment as an ipython kernel.

Preparing a job submission script

Single Node GPU Jobs

Once the virtual environment is created and packages installed, it can then be activated from within the job submission script.

First, create a file called examplecode.py with the following command:

cat << EOF > examplecode.py
import torch as t

print(f"torch={t.__version__} cuda={t.version.cuda} available={t.cuda.is_available()}")
for i in range(t.cuda.device_count()):
    print(f"{t.cuda.get_device_name(i)}: {t.cuda.get_device_properties(i)}")

x = t.randn(1000, 1000, device="cuda")
y = t.randn(1000, 1000, device="cuda")
z = t.matmul(x, y)
print("Computation successful on GPU!")
print("Result norm:", z.norm().item())
EOF

Then, similarly create the submission script:

cat << EOF > submission.sh
#!/usr/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --gpus=1
module load apptainer
apptainer exec --nv pytorch.sif python examplecode.py
EOF

You can check the contents of these files with cat examplecode.py and cat submission.sh.

Note

Please observe that the --gpus=1 flag is included in this batch submission script. Otherwise, Slurm will not allocate a GPU for your job and the code will fail.

To request more than 1 GPU on the same node, you can simply adjust --gpus=1 to --gpus=X where X is your desired requested number of GPUs on the single node. This flag can request no more than the maximum number of GPUs found on any specific node.

Important

If you request more than 1 GPU, you must also ensure your specific code makes use of a multi-GPU environment. If it does not, requesting multiple GPUs will not make your code run faster or improve your workflow.

Multi Node GPU Jobs

Due to our current Science Cluster hardware configuration, we recommend users make use of Piz Daint (via the Supercomputer service) for workflows requiring multi-node GPU functionality.

Submitting the job

To submit this script for processing (after the modules have been loaded and the Conda environment has been created), simply run

# to ensure you receive the same type gpu as above
module load l4

# then submit the job
sbatch submission.sh

When submitted, the console should print a message similar to

Submitted batch job <jobid>

where <jobid> is the Job ID numeric code assigned by the Slurm batch submission system.

Understanding job outputs

When the job runs to completion (provided your submitted code does not produce any errors) any/all files outputted by your script should have been written to their designated locations and a file named slurm-<jobid>.out should exist from where you submitted the script, unless you specified otherwise. This file contains the printed output from your job. Examine the output to ensure that the training and evaluation was successful. You should see a message similar to:

torch=2.12.0+cu132 cuda=13.2 available=True
NVIDIA L4: _CudaDeviceProperties(name='NVIDIA L4', major=8, minor=9, total_memory=22563MB, ...)
Computation successful on GPU!
Result norm: 31602.669921875

If you do not see the lines starting with "Computation successful" and "Result norm", something almost certainly went wrong!