Install Python-based Machine Learning Programming Packages (Tensorflow and PyTorch) on GPU nodes¶

This tutorial demonstrates the basics of how to create a Python environment on GPU compute nodes of ScienceCluster with specific packages of interest, in this case TensorFlow and PyTorch.

Creating an environment for TensorFlow on a GPU node¶

After connecting from a terminal to ScienceCluster, work through the following steps

# load a specific gpu module (one that you plan to use)

module load l4

# request an interactive session, which allows the package installer to see the GPU hardware

srun --pty -n 1 -c 2 --time=01:00:00 --gpus=1 --mem=8G bash -l

# (optional) confirm the gpu is available. The output should show basic information about at least
# one GPU.

nvidia-smi

# load apptainer

module load apptainer

# create a virtual environment container from an appropriately versioned Tensorflow docker image

apptainer pull tensorflow_2.16.2-gpu.sif docker://tensorflow/tensorflow:2.16.2-gpu

# confirm that the GPU is correctly detected

apptainer exec --nv tensorflow_2.16.2-gpu.sif python -c 'import tensorflow as tf; print("Built with CUDA:", tf.test.is_built_with_cuda()); print("Num GPUs Available:", len(tf.config.list_physical_devices("GPU"))); print("TF version:", tf.__version__)'

# when finished with your test, exit the interactive cluster session
exit

Note

In the example above, an l4 GPU is specified, as is a matching version of Tensorflow (docker://tensorflow/tensorflow:2.16.2-gpu). Confirm that the version you pull matches the hardware you'll use; otherwise, the GPU functionality may fail.

Creating an environment for PyTorch on a GPU node¶

After connecting from a terminal to the ScienceCluster, work through the following steps

# load a specific gpu module (one that you plan to use)

module load l4

# request an interactive session, which allows the package installer to see the GPU hardware

srun --pty -n 1 -c 2 --time=01:00:00 --gpus=1 --mem=8G bash -l

# (optional) confirm the gpu is available. The output should show basic information about at least
# one GPU.

nvidia-smi

# load apptainer

module load apptainer

# create a virtual environment container from an appropriately versioned PyTorch docker image

apptainer pull pytorch.sif docker://pytorch/pytorch:2.10.0-cuda13.0-cudnn9-runtime

# confirm that the GPU is correctly detected

apptainer exec --nv pytorch.sif python -c 'import torch as t; print("is available: ", t.cuda.is_available()); print("device count: ", t.cuda.device_count()); print("current device: ", t.cuda.current_device()); print("cuda device: ", t.cuda.device(0)); print("cuda device name: ", t.cuda.get_device_name(0)); print("cuda version: ", t.version.cuda)'

# when finished with your test, exit the interactive cluster session
exit

Note

Similar to the TensorFlow note above, you should always confirm that your chosen base Docker image correctly recognizes the GPU hardware you require (i.e., always test GPU availibility in your environment when trying a new base Docker image). As you'll note, the PyTorch example image (docker://pytorch/pytorch:2.10.0-cuda13.0-cudnn9-runtime) works with the l4 hardware.

Using a virtual environment in ScienceApps¶

If you would like to use a TensorFlow or PyTorch environment with Jupyter and ScienceApps, see the documentation about installing the environment as an ipython kernel.

Preparing a job submission script¶

Single Node GPU Jobs¶

Once the virtual environment is created and packages installed, it can then be activated from within the job submission script.

First, create a file called examplecode.py, in this case for TensorFlow, with the following command:

cat << EOF > examplecode.py
import tensorflow as tf
print("Built with CUDA:", tf.test.is_built_with_cuda())
print()
print("Tensorflow version:", tf.__version__)
print()
print(tf.config.list_physical_devices("GPU"))
print()
print("Preparing a test case...")
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
print("Compiling a model...")
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
print("Training the model...")
model.fit(x_train, y_train, epochs=5)
print("Evaluating the model...")
model.evaluate(x_test,  y_test, verbose=2)
print("Done")
EOF

Then, similarly create the submission script:

cat << EOF > tfsubmission.sh
#!/usr/bin/bash -l
#SBATCH --time=00:15:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --gpus=1
module load apptainer
apptainer exec --nv tensorflow_2.16.2-gpu.sif python examplecode.py
EOF

You can check the contents of these files with cat examplecode.py and cat tfsubmission.sh.

Note

Please observe that the --gpus=1 flag is included in this batch submission script. Otherwise, Slurm will not allocate a GPU for your job and the code will run only on CPUs.

To request more than 1 GPU on the same node, you can simply adjust --gpus=1 to --gpus=X where X is your desired requested number of GPUs on the single node. This flag can request no more than the maximum number of GPUs found on any specific node.

Important

If you request more than 1 GPU, you must also ensure your specific code makes use of a multi-GPU environment. If it does not, requesting multiple GPUs will not make your code run faster or improve your workflow.

Multi Node GPU Jobs¶

Note

We are currently developing a new multi-node GPU example. Please check back again soon.

Submitting the job¶

To submit this script for processing (after the modules have been loaded and the Conda environment has been created), simply run

# to ensure you receive the same type gpu as above
module load l4

# then submit the job
sbatch tfsubmission.sh

When submitted, the console should print a message similar to

Submitted batch job <jobid>

where <jobid> is the Job ID numeric code assigned by the Slurm batch submission system.

Understanding job outputs¶

When the job runs to completion (provided your submitted code does not produce any errors) any/all files outputted by your script should have been written to their designated locations and a file named slurm-<jobid>.out should exist from where you submitted the script, unless you specified otherwise. This file contains the printed output from your job. Examine the output to ensure that the training and evaluation was successful. In particular, you should see a message listing the loss and accuracy of the model towards the end of the output.