Example Installations for Python-based Machine Learning Programming on GPU Nodes¶

This tutorial demonstrates the basics of how to create a Python environment on GPU compute nodes of ScienceCluster with specific packages of interest, in this case TensorFlow and PyTorch.

Creating an environment for TensorFlow on a GPU node¶

After connecting from a terminal to ScienceCluster, work through the following steps

# load the gpu module

module load gpu

# request an interactive session, which allows the package installer to see the GPU hardware

srun --pty -n 1 -c 2 --time=01:00:00 --gpus=1 --mem=8G bash -l

# (optional) confirm the gpu is available. The output should show basic information about at least
# one GPU.

nvidia-smi

# use mamba (drop-in replacement for conda)

module load mamba

# create a virtual environment named 'tf' and install packages

mamba create -n tf -c conda-forge tensorflow cudatoolkit-dev

# activate the virtual environment

source activate tf

# confirm that the GPU is correctly detected

python -c 'import tensorflow as tf; print("Built with CUDA:", tf.test.is_built_with_cuda()); print("Num GPUs Available:", len(tf.config.list_physical_devices("GPU"))); print("TF version:", tf.__version__)'

# when finished with your test, exit the interactive cluster session

conda deactivate
exit

Creating an environment for PyTorch on a GPU node¶

After connecting from a terminal to the ScienceCluster, work through the following steps

# load the gpu module

module load gpu

# request an interactive session, which allows the package installer to see the GPU hardware

srun --pty -n 1 -c 2 --time=01:00:00 --gpus=1 --mem=8G bash -l

# (optional) confirm the gpu is available. The output should show basic information about at least
# one GPU.

nvidia-smi

# use mamba (drop-in replacement for conda)

module load mamba

# create a virtual environment named 'torch' and install packages

mamba create -n torch -c pytorch -c nvidia pytorch torchvision torchaudio pytorch-cuda

# activate the virtual environment

source activate torch

# confirm that the GPU is correctly detected

python -c 'import torch as t; print("is available: ", t.cuda.is_available()); print("device count: ", t.cuda.device_count()); print("current device: ", t.cuda.current_device()); print("cuda device: ", t.cuda.device(0)); print("cuda device name: ", t.cuda.get_device_name(0)); print("cuda version: ", t.version.cuda)'

# when finished with your test, exit the interactive cluster session

conda deactivate
exit

Using this virtual environment in ScienceApps¶

If you would like to use your TensorFlow or Torch with Jupyter and ScienceApps, see the documentation about installing the environment as an ipython kernel.

Preparing a job submission script¶

Single Node GPU Jobs¶

Once the virtual environment is created and packages installed, it can then be activated from within the job submission script.

First, create a file called examplecode.py, in this case for TensorFlow, with the following command:

cat << EOF > examplecode.py
import tensorflow as tf
print("Built with CUDA:", tf.test.is_built_with_cuda())
print()
print("Tensorflow version:", tf.__version__)
print()
print(tf.config.list_physical_devices("GPU"))
print()
print("Preparing a test case...")
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
print("Compiling a model...")
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
print("Training the model...")
model.fit(x_train, y_train, epochs=5)
print("Evaluating the model...")
model.evaluate(x_test,  y_test, verbose=2)
print("Done")
EOF

Then, similarly create the submission script:

cat << EOF > tfsubmission.sh
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --gpus=1
module load mamba
source activate tf
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/pkgs/cuda-toolkit
python examplecode.py
EOF

You can check the contents of these files with cat examplecode.py and cat tfsubmission.sh.

Important

The XLA_FLAGS variable is set to prevent the "libdevice not found" error that may occur during training in tensorflow starting with v2.11. In our tests, this fix is sufficient. If you still get the error even with the XLA_FLAGS variable being set, you can try other approaches outlined in the official installation guide

Note

Please observe that the --gpus=1 flag is included in this batch submission script. Otherwise, SLURM will not allocate a GPU for your job and the code will run only on CPUs.

To request more than 1 GPU on the same node, you can simply adjust --gpus=1 to --gpus=X where X is your desired requested number of GPUs on the single node. This flag can request no more than the maximum number of GPUs found on any specific node.

Important

If you request more than 1 GPU, you must also ensure your specific code makes use of a multi-GPU environment. If it does not, requesting multiple GPUs will not make your code run faster or improve your workflow.

Multi Node GPU Jobs¶

Requesting multiple GPUs across different nodes for a single requires special preparation not only in the data analysis code but also in the Slurm submission script. For example, one can request and run an example PyTorch multi-node multi-GPU job with the following submission script:

#!/bin/bash

#SBATCH --job-name=jobname
#SBATCH --output=%x%j.out
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --gpus-per-node=2
#SBATCH --constraint=GPUMEM32GB
#SBATCH --mem-per-gpu=32G
#SBATCH --cpus-per-gpu=2
#SBATCH --time=00:10:00

module load mamba
source activate torch

# Node networking section
head_node_ip=$(hostname --ip-address)
echo Node IP: $head_node_ip
export LOGLEVEL=INFO

# Analytical code
srun torchrun \
--nnodes $SLURM_JOB_NUM_NODES \
--nproc_per_node $SLURM_GPUS_PER_NODE \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
~/data/examples/distributed/ddp-tutorial-series/multinode.py 50 10

This submission script has been adapted from this PyTorch Slurm example. You can clone the corresponding GitHub repo using git clone https://github.com/pytorch/examples.git, and the submission script presumes the repo was cloned in the user's ~/data directory. It also assumes that you run module load multigpu before you submit the script.

You can find a video description of it as well as additional documentation on the multinode.py script here.

The Slurm submission script has the following notable inclusions:

The recommended SBATCH flags for a multinode setup:
- --nodes=2; select your total number of nodes.
- --ntasks=2; this parameter should match the --nodes parameter.
- --gpus-per-node=2; adjust this parameter to select how many GPUs you want on each node.
- --mem-per-gpu=32G; this flag should match the amount of GPU memory in your selected GPU model.
- --cpus-per-gpu=2; users should at minimum request 2 CPUs per GPU; only increase this number of CPUs if your code is also CPU parallelized. Otherwise you will pay for unused resources.
- --constraint=GPUMEM32GB; other options include A100 (when requesting only A100 GPUs) or GPUMEM16GB (when specifically requesting 16GB V100 GPUs)
- Submission with these parameters will result in requesting a total of 4 V100 32GB GPUs across 2 nodes. Adjusting these parameters will allow you to request an identical number of GPUs across multiple nodes. While possible to request uneven numbers of nodes, the setup for such a submission is beyond the scope of this example.
The torchrun command arguments include Slurm variables computed from the SBATCH parameter set:
- $SLURM_JOB_NUM_NODES directs PyTorch to run across the requested number of nodes.
- $SLURM_GPUS_PER_NODE directs PyTorch to run across the requested number of GPUs on each node.
A node networking section, where a head node is appointed to organize the analysis across each other worker node.
- Note: if someone else sharing your node also uses port 29500 (see the line --rdzv_endpoint $head_node_ip:29500), you may need to change this to an alternative port number (e.g., 29505) to ensure your traffic doesn't collide.

Keep in mind that you will necessarily need to adjust your data analysis code in ways specific to your chosen modelling framework (TensorFlow, PyTorch, etc.). This example, alongside the materials from PyTorch, serves as a guide for adapting your own code to Slurm on the ScienceCluster.

Submitting the job¶

To submit this script for processing (after the modules have been loaded and the Conda environment has been created), simply run

sbatch tfsubmission.sh

When submitted, the console should print a message similar to

Submitted batch job <jobid>

where <jobid> is the Job ID numeric code assigned by the SLURM Batch Submission system.

Understanding job outputs¶

When the job runs to completion (provided your submitted code does not produce any errors) any/all files outputted by your script should have been written to their designated locations and a file named slurm-<jobid>.out should exist from where you submitted the script, unless you specified otherwise. This file contains the printed output from your job. Examine the output to ensure that the training and evaluation was successful. In particular, you should see a message listing the loss and accuracy of the model towards the end of the output.