AMD ROCm¶

ROCm is AMD's software stack for accelerated computing on GPUs (and CPUs). It supports the OpenCL, OpenMP and HIP (Heterogeneous Interface for Portability, a compatibility layer for NVIDIA CUDA) programming models and also contains compilers, debuggers, profilers and various optimised libraries, e.g.:

rocBLAS, hipBLAS: Basic Linear Algebra Subprograms
rocFFT, hipFFT: Fast Fourier Transformation
rocSolver, hipSolver: LAPACK numerical solver routines

The official documentation for ROCm can be found at https://docs.amd.com/ for ROCm 4.5 and above, and at https://rocmdocs.amd.com/en/latest/ for earlier versions.

Using ROCm on the FTP¶

Most of the ROCm software components are pre-installed on the FTP-X86 cluster. They can be loaded using the Lmod module system. In order to avoid ambiguities, it is advisable to load architecture-specific modules on the compute nodes, working in an allocation created with salloc on the requested node:

$ salloc -p amd-milan-mi100 -t 01:00:00 --gpus-per-node=4

We can then load the ROCm module:

$ module load toolkit/rocm

This module sets a number of paths, the most important ones being

$ROCM_OPENCL_INCLUDE_DIR: contains the header files for building OpenCL software
$ROCM_INCLUDE_DIR: contains all ROCm header files for compiling Non-OpenCL software
$ROCM_LIB_DIR: contains all ROCm libraries

The module can be loaded on all nodes, it is therefore possible to build ROCm software on e.g. the login node. Execution of the generated binaries is of course only possible on one of the nodes in the amd-milan-mi100, amd-milan-mi210 or amd-milan-mi250 batch system queues that have actual MI100, MI210 or MI250 GPUs installed.

Intel Compiler

The ROCm software modules are incompatible with the Intel Compiler and only work with one of the compiler/gnu modules.

It is recommended to use the GNU compiler with ROCm:

$ module load compiler/gnu

Example: clpeak OpenCL benchmark¶

In this example we will build the clpeak OpenCL benchmark and run it on one of the amd-milan-mi100 nodes.

First get the source code:

$ git clone https://github.com/krrishnarraj/clpeak.git -b 1.1.0 --filter=blob:none

Then build the binary (on the compute node):

$ cd clpeak/
$ mkdir build
$ cd build/

$ module load toolkit/rocm

$ cmake \
    -DOpenCL_INCLUDE_DIR=${ROCM_OPENCL_INCLUDE_DIR} \
    -DOpenCL_LIBRARY=${ROCM_LIB_DIR}/libOpenCL.so \
    ..
$ make -j10

Compiling and build systems - OpenCL

Note that the OpenCL_INCLUDE_DIR and OpenCL_LIBRARY variables have to be set manually so the build system finds the header and library files. The build system used by clpeak expects the exact file name of the OpenCL library to be passed to OpenCL_LIBRARY, hence the parameter -DOpenCL_LIBRARY=${ROCM_LIB_DIR}/libOpenCL.so.

You will most likely have to do similar things for other pieces of OpenCL software and other build systems.

Now the generated binary can be executed (on the nodes in the amd-milan-mi100 queue):

$ ./clpeak
Platform: AMD Accelerated Parallel Processing
  Device: gfx908:sramecc+:xnack-
    Driver version  : 3305.0 (HSA1.1,LC) (Linux x64)
    Compute units   : 120
    Clock frequency : 1502 MHz

    Global memory bandwidth (GBPS)
      float   : 861.79
      float2  : 869.36
      float4  : 866.03
      float8  : 923.47
      float16 : 896.76

    Single-precision compute (GFLOPS)
      float   : 22693.13
      float2  : 21823.36
      float4  : 22344.26
      float8  : 21448.58
      float16 : 21073.82

    Half-precision compute (GFLOPS)
      half   : 11367.12
      half2  : 43564.05
      half4  : 41516.54
      half8  : 40785.44
      half16 : 39853.60

    Double-precision compute (GFLOPS)
      double   : 11270.71
      double2  : 10738.98
      double4  : 10640.16
      double8  : 10526.77
      double16 : 10239.28

    Integer compute (GIOPS)
      int   : 7215.25
      int2  : 6930.94
      int4  : 6890.10
      int8  : 6925.28
      int16 : 6849.86

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 6.50
      enqueueReadBuffer          : 6.04
      enqueueMapBuffer(for read) : 94602.80
        memcpy from mapped ptr   : 6.24
      enqueueUnmap(after write)  : 210537.61
        memcpy to mapped ptr     : 6.45

    Kernel launch latency : 13.05 us
    [..]

The output is repeated for every GPU in the system, in this case four times.

Example: Tensorflow¶

A pre-built tensorflow-rocm package exists on the Python Package Index (PyPi) and can be installed in your home directory in a virtual environment using the pip command:

$ python3.9 -m venv venv      # end of life 2025-10-05
$ source venv/bin/activate
$ pip install --upgrade pip
$ pip install tensorflow-rocm

Please be aware that this package is large (about 3 GB) and installation can take quite long.

After installation has been completed, you can run Tensorflow as usual on one of the nodes. Note that four AMD MI 100 devices are being detected:

$ salloc -p amd-milan-mi100 -t 01:00:00
salloc: Granted job allocation 75

$ module load toolkit/rocm
$ source activate venv/bin/activate    # This step may or may not be necessary:
                                       # check with "which python j"
$ python                               # Open a python3.9 shell
Python 3.9.2 (default, Apr 28 2022, 06:38:23)
[GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> for device in tf.config.list_physical_devices():
...     print(device)
...
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')
>>> print(tf.reduce_sum(tf.random.normal([10000,10000])))
2022-08-23 16:32:51.199419: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-23 16:32:51.206538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31740 MB memory:  -> device: 0, name: , pci bus id: 0000:e3:00.0
2022-08-23 16:32:51.508763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 31740 MB memory:  -> device: 1, name: , pci bus id: 0000:c3:00.0
2022-08-23 16:32:51.723525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 31740 MB memory:  -> device: 2, name: , pci bus id: 0000:83:00.0
2022-08-23 16:32:51.942105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 31740 MB memory:  -> device: 3, name: , pci bus id: 0000:a3:00.0
tf.Tensor(-17405.746, shape=(), dtype=float32)

You can exit the python shell with Ctrl+d, and then deactivate the virtual environment with

$ deactivate

Example: PyTorch¶

There is an official ROCm-enabled PyTorch version maintained by AMD that can be installed using the instructions on the homepage. On the FTP-X86 cluster, the easiest way to do so is using the pip command:

$ python3.9 -m venv venv
$ source venv/bin/activate
$ pip install --upgrade pip
$ pip install torch \
              torchvision \
              torchaudio \
              --extra-index-url https://download.pytorch.org/whl/rocm5.1.1 \
              # From https://pytorch.org/get-started/locally/

Please be aware that this package is very large (about 1 GB) and installation can take quite long.

PyTorch has no native ROCm backend

Please note the PyTorch does not have a native ROCm backend, but uses HIP to cross-compile the existing CUDA backend into something that can run on ROCm. PyTorch does not know that it is not really running on CUDA, and there is no torch.rocm context. The torch.cuda context will instead transparently execute things on the AMD GPUs as if they supported CUDA.

This is why torch.cuda.is_available() returns True when using ROCm, torch.cuda.device_count() returns the number of ROCm GPUs and why you still can/have to use device="cuda".

After installation has been completed, you can run PyTorch as usual on one of the nodes. The PyTorch package includes all necessary dependencies and libs. Note that four AMD MI 100 devices are being detected:

$ salloc -p amd-milan-mi100 -t 01:00:00
salloc: Granted job allocation 75

$ source ./venv/bin/activate
$ python
>>> import torch
>>> torch.__version__
'1.12.1+rocm5.1.1'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
4
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='', major=9, minor=0, total_memory=32752MB, multi_processor_count=120)
>>> m = torch.arange(100, device=0)
>>> m
tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
        72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
        90, 91, 92, 93, 94, 95, 96, 97, 98, 99], device='cuda:0')
>>>

You can exit the python shell with Ctrl+d, and then deactivate the virtual environment with

$ deactivate