Install PyTorch on GPU nodes¶
This tutorial demonstrates the basics of how to create a PyTorch environment on GPU compute nodes of ScienceCluster.
Creating an environment on a GPU node¶
After connecting from a terminal to the ScienceCluster, work through the following steps.
-
Load a specific GPU module (one that you plan to use)
-
Request an interactive session, which allows the package installer to see the GPU hardware.
-
Confirm the GPU is available. The output should show basic information about at least one GPU.
-
Load apptainer module
-
Create a virtual environment container from an appropriately versioned PyTorch docker image
-
Confirm that the GPU is correctly detected
-
When finished with your test, exit the interactive session
Note
You should always confirm that your chosen base Docker image correctly recognizes the GPU hardware you require (i.e., always test GPU availibility in your environment when trying a new base Docker image). As you'll note, the PyTorch example image (docker://pytorch/pytorch:2.12.0-cuda13.2-cudnn9-runtime) works with the l4 hardware.
Using a virtual environment in ScienceApps¶
If you would like to use a PyTorch environment with Jupyter and ScienceApps, see the documentation about installing the environment as an ipython kernel.
Preparing a job submission script¶
Single Node GPU Jobs¶
Once the virtual environment is created and packages installed, it can then be activated from within the job submission script.
First, create a file called examplecode.py with the following command:
cat << EOF > examplecode.py
import torch as t
print(f"torch={t.__version__} cuda={t.version.cuda} available={t.cuda.is_available()}")
for i in range(t.cuda.device_count()):
print(f"{t.cuda.get_device_name(i)}: {t.cuda.get_device_properties(i)}")
x = t.randn(1000, 1000, device="cuda")
y = t.randn(1000, 1000, device="cuda")
z = t.matmul(x, y)
print("Computation successful on GPU!")
print("Result norm:", z.norm().item())
EOF
Then, similarly create the submission script:
cat << EOF > submission.sh
#!/usr/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --gpus=1
module load apptainer
apptainer exec --nv pytorch.sif python examplecode.py
EOF
You can check the contents of these files with cat examplecode.py and cat submission.sh.
Note
Please observe that the --gpus=1 flag is included in this batch submission script. Otherwise, Slurm will not allocate a GPU for your job and the code will fail.
To request more than 1 GPU on the same node, you can simply adjust --gpus=1 to --gpus=X where X is your desired requested number of GPUs on the single node. This flag can request no more than the maximum number of GPUs found on any specific node.
Important
If you request more than 1 GPU, you must also ensure your specific code makes use of a multi-GPU environment. If it does not, requesting multiple GPUs will not make your code run faster or improve your workflow.
Multi Node GPU Jobs¶
Due to our current Science Cluster hardware configuration, we recommend users make use of Piz Daint (via the Supercomputer service) for workflows requiring multi-node GPU functionality.
Submitting the job¶
To submit this script for processing (after the modules have been loaded and the Conda environment has been created), simply run
# to ensure you receive the same type gpu as above
module load l4
# then submit the job
sbatch submission.sh
When submitted, the console should print a message similar to
where <jobid> is the Job ID numeric code assigned by the Slurm batch submission system.
Understanding job outputs¶
When the job runs to completion (provided your submitted code does not produce any errors) any/all files outputted by your script should have been written to their designated locations and a file named slurm-<jobid>.out should exist from where you submitted the script, unless you specified otherwise. This file contains the printed output from your job. Examine the output to ensure that the training and evaluation was successful. You should see a message similar to:
torch=2.12.0+cu132 cuda=13.2 available=True
NVIDIA L4: _CudaDeviceProperties(name='NVIDIA L4', major=8, minor=9, total_memory=22563MB, ...)
Computation successful on GPU!
Result norm: 31602.669921875
If you do not see the lines starting with "Computation successful" and "Result norm", something almost certainly went wrong!