Skip to content

AMD ROCm

ROCm is AMD's software stack for accelerated computing on GPUs (and CPUs). It supports the OpenCL, OpenMP and HIP (Heterogeneous Interface for Portability, a compatibility layer for NVIDIA CUDA) programming models and also contains compilers, debuggers, profilers and various optimised libraries, e.g.:

The official documentation for ROCm can be found at https://rocmdocs.amd.com/en/latest/.

Using ROCm on the FTP

Most of the ROCm software components are pre-installed on the FTP-X86 cluster. They can be loaded using the Lmod module system:

$ module load toolkit/rocm

This module sets a number of paths, the most important ones being

  • $ROCM_OPENCL_INCLUDE_DIR: contains the header files for building OpenCL software
  • $ROCM_INCLUDE_DIR: contains all ROCm header files for compiling Non-OpenCL software
  • $ROCM_LIB_DIR: contains all ROCm libraries

The module can be loaded on all nodes, it is therefore possible to build ROCm software on e.g. the login node. Execution of the generated binaries is of course only possible on one of the nodes in the amd-rome-mi100 batch system queue that have actual MI100 GPUs installed.

Intel Compiler

The ROCm software modules are incompatible with the Intel Compiler and only work with one of the compiler/gnu modules.

Example: clpeak OpenCL benchmark

In this example we will build the clpeak OpenCL benchmark and run it on one of the amd-rome-mi100 nodes.

First get the source code (on the login node):

$ git clone https://github.com/krrishnarraj/clpeak.git -b 1.1.0 --depth=1

Then build the binary (on the login node):

$ cd clpeak/
$ mkdir build
$ cd build/

$ module load toolkit/rocm

$ cmake -DOpenCL_INCLUDE_DIR=${ROCM_OPENCL_INCLUDE_DIR} -DOpenCL_LIBRARY=${ROCM_LIB_DIR}/libOpenCL.so ..
$ make -j10

Compiling and build systems

Note that the OpenCL_INCLUDE_DIR and OpenCL_LIBRARY variables have to be set manually so the build system finds the header and library files. The build system used by clpeak expects the exact file name of the OpenCL library to be passed to OpenCL_LIBRARY, hence the parameter -DOpenCL_LIBRARY=${ROCM_LIB_DIR}/libOpenCL.so.

You will most likely have to do similar things for other pieces of OpenCL software and other build systems.

Now the generated binary can be executed on one of the nodes in the amd-rome-mi100 queue:

$ salloc -p amd-rome-mi100 -t 01:00:00
salloc: Granted job allocation 75

$ module load toolkit/rocm

$ cd clpeak/build
$ ./clpeak

Platform: AMD Accelerated Parallel Processing
  Device: gfx908:sramecc+:xnack-
    Driver version  : 3305.0 (HSA1.1,LC) (Linux x64)
    Compute units   : 120
    Clock frequency : 1502 MHz

    Global memory bandwidth (GBPS)
      float   : 861.79
      float2  : 869.36
      float4  : 866.03
      float8  : 923.47
      float16 : 896.76

    Single-precision compute (GFLOPS)
      float   : 22693.13
      float2  : 21823.36
      float4  : 22344.26
      float8  : 21448.58
      float16 : 21073.82

    Half-precision compute (GFLOPS)
      half   : 11367.12
      half2  : 43564.05
      half4  : 41516.54
      half8  : 40785.44
      half16 : 39853.60

    Double-precision compute (GFLOPS)
      double   : 11270.71
      double2  : 10738.98
      double4  : 10640.16
      double8  : 10526.77
      double16 : 10239.28

    Integer compute (GIOPS)
      int   : 7215.25
      int2  : 6930.94
      int4  : 6890.10
      int8  : 6925.28
      int16 : 6849.86

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 6.50
      enqueueReadBuffer          : 6.04
      enqueueMapBuffer(for read) : 94602.80
        memcpy from mapped ptr   : 6.24
      enqueueUnmap(after write)  : 210537.61
        memcpy to mapped ptr     : 6.45

    Kernel launch latency : 13.05 us
    [..]

The output is repeated for every GPU in the system, in this case four times.

Example: Tensorflow

A pre-built tensorflow-rocm package exists on the Python Package Index (PyPi) and can be installed in your home directory or a virtual env using the pip command:

$ python -m venv venv
$ source venv/bin/activate
$ pip install tensorflow-rocm

Please be aware that this package is large (about 3 GB) and installation can take quite long.

After installation has been completed, you can run Tensorflow as usual on one of the nodes. Note that four AMD MI 100 devices are being detected:

$ salloc -p amd-rome-mi100 -t 01:00:00
salloc: Granted job allocation 75

$ module load toolkit/rocm
$ python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([10000, 10000])))"

2021-08-18 11:05:15.801981: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libamdhip64.so
2021-08-18 11:05:16.072506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: 
pciBusID: 0000:63:00.0 name: Arcturus GL-XL [AMD Instinct MI100]     ROCm AMDGPU Arch: gfx908:sramecc+:xnack-
coreClock: 1.502GHz coreCount: 120 deviceMemorySize: 31.98GiB deviceMemoryBandwidth: 1.12TiB/s
2021-08-18 11:05:16.072638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 1 with properties: 
pciBusID: 0000:43:00.0 name: Arcturus GL-XL [AMD Instinct MI100]     ROCm AMDGPU Arch: gfx908:sramecc+:xnack-
coreClock: 1.502GHz coreCount: 120 deviceMemorySize: 31.98GiB deviceMemoryBandwidth: 1.12TiB/s
2021-08-18 11:05:16.072723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 2 with properties: 
pciBusID: 0000:03:00.0 name: Arcturus GL-XL [AMD Instinct MI100]     ROCm AMDGPU Arch: gfx908:sramecc+:xnack-
coreClock: 1.502GHz coreCount: 120 deviceMemorySize: 31.98GiB deviceMemoryBandwidth: 1.12TiB/s
2021-08-18 11:05:16.072811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 3 with properties: 
pciBusID: 0000:26:00.0 name: Arcturus GL-XL [AMD Instinct MI100]     ROCm AMDGPU Arch: gfx908:sramecc+:xnack-
coreClock: 1.502GHz coreCount: 120 deviceMemorySize: 31.98GiB deviceMemoryBandwidth: 1.12TiB/s
2021-08-18 11:05:16.072989: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so
2021-08-18 11:05:16.078776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libMIOpen.so
2021-08-18 11:05:16.147251: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libhipfft.so
2021-08-18 11:05:16.147932: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocrand.so
2021-08-18 11:05:16.148544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3
2021-08-18 11:05:16.149300: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-18 11:05:16.155840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: 
pciBusID: 0000:63:00.0 name: Arcturus GL-XL [AMD Instinct MI100]     ROCm AMDGPU Arch: gfx908:sramecc+:xnack-
coreClock: 1.502GHz coreCount: 120 deviceMemorySize: 31.98GiB deviceMemoryBandwidth: 1.12TiB/s
2021-08-18 11:05:16.155943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 1 with properties: 
pciBusID: 0000:43:00.0 name: Arcturus GL-XL [AMD Instinct MI100]     ROCm AMDGPU Arch: gfx908:sramecc+:xnack-
coreClock: 1.502GHz coreCount: 120 deviceMemorySize: 31.98GiB deviceMemoryBandwidth: 1.12TiB/s
2021-08-18 11:05:16.156031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 2 with properties: 
pciBusID: 0000:03:00.0 name: Arcturus GL-XL [AMD Instinct MI100]     ROCm AMDGPU Arch: gfx908:sramecc+:xnack-
coreClock: 1.502GHz coreCount: 120 deviceMemorySize: 31.98GiB deviceMemoryBandwidth: 1.12TiB/s
2021-08-18 11:05:16.156116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 3 with properties: 
pciBusID: 0000:26:00.0 name: Arcturus GL-XL [AMD Instinct MI100]     ROCm AMDGPU Arch: gfx908:sramecc+:xnack-
coreClock: 1.502GHz coreCount: 120 deviceMemorySize: 31.98GiB deviceMemoryBandwidth: 1.12TiB/s
2021-08-18 11:05:16.156142: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so
2021-08-18 11:05:16.156165: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libMIOpen.so
2021-08-18 11:05:16.156186: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libhipfft.so
2021-08-18 11:05:16.156208: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocrand.so
2021-08-18 11:05:16.156723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3
2021-08-18 11:05:16.156769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-18 11:05:16.156783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1 2 3 
2021-08-18 11:05:16.156796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y Y Y 
2021-08-18 11:05:16.156809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N Y Y 
2021-08-18 11:05:16.156823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 2:   Y Y N Y 
2021-08-18 11:05:16.156835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3:   Y Y Y N 
2021-08-18 11:05:16.157540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 32252 MB memory) -> physical GPU (device: 0, name: Arcturus GL-XL [AMD Instinct MI100], pci bus id: 0000:63:00.0)
2021-08-18 11:05:16.937763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 32252 MB memory) -> physical GPU (device: 1, name: Arcturus GL-XL [AMD Instinct MI100], pci bus id: 0000:43:00.0)
2021-08-18 11:05:17.707073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 32252 MB memory) -> physical GPU (device: 2, name: Arcturus GL-XL [AMD Instinct MI100], pci bus id: 0000:03:00.0)
2021-08-18 11:05:18.475006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 32252 MB memory) -> physical GPU (device: 3, name: Arcturus GL-XL [AMD Instinct MI100], pci bus id: 0000:26:00.0)
tf.Tensor(-18850.012, shape=(), dtype=float32)

Example: PyTorch

There is an official ROCm-enabled PyTorch version maintained by AMD that can be installed using the instructions on the homepage. On the FTP-X86 cluster, the easiest way to do so is using the pip command:

$ python -m venv venv
$ source venv/bin/activate
$ pip install torch -f https://download.pytorch.org/whl/rocm4.2/torch_stable.html
$ pip install torchvision==0.10.0 -f https://download.pytorch.org/whl/rocm4.2/torch_stable.html

Please be aware that this package is very large (about 5.2 GB) and installation can take quite long.

PyTorch has no ROCm backend

Please note the PyTorch does not have a native ROCm backend, but uses HIP to cross-compile the existing CUDA backend into something that can run on ROCm. PyTorch does not know that it is not really running on CUDA, and there is no torch.rocm context. The torch.cuda context will instead transparently execute things on the AMD GPUs as if they supported CUDA.

This is why torch.cuda.is_available() returns True when using ROCm, torch.cuda.device_count() returns the number of ROCm GPUs and why you still can/have to use device="cuda".

After installation has been completed, you can run PyTorch as usual on one of the nodes. The PyTorch package includes all necessary dependencies and libs. Note that four AMD MI 100 devices are being detected:

$ salloc -p amd-rome-mi100 -t 01:00:00
salloc: Granted job allocation 75

$ python -c 'import torch; print(torch.__version__);'
1.9.0+rocm4.2

$ python -c 'import torch; print(torch.cuda.is_available());'
True

$ python -c 'import torch; print(torch.cuda.device_count());'
4

$ python -c 'import torch; print(torch.cuda.get_device_name(torch.cuda.current_device()));'
Arcturus GL-XL [AMD Instinct MI100]

$ python -c 'import torch; m = torch.arange(100, device="cuda:0"); print(m)'
tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
        72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
        90, 91, 92, 93, 94, 95, 96, 97, 98, 99], device='cuda:0')

Last update: August 20, 2021