GPU Offloading¶

The accelerated partition of the HoreKa cluster consists of nodes that draw their computing power mainly from GPUs. In order to be able to use these, the computing operations must be transferred from the host CPU to the GPUs, which is then referred to as offloading. In recent years, four main programming approaches have emerged that can be used to perform this offloading. The following sections show how these can be used on HoreKa.

In order for the compilers to apply the appropriate optimizations, it is recommended to compile directly on the target architecture. I.e. it usually makes sense to compile programs with GPU offloading on compute nodes equipped with GPUs. On the HoreKa cluster you can start an interactive job on the GPU development partition for this purpose:

salloc -p dev_accelerated --gres=gpu:1 -t 60

OpenACC¶

OpenACC is a directive-based parallel programming model. The specification supports the C, C++ and Fortran programming languages. For more information visit the OpenACC homepage to download additional resources:

The fundamental idea of an OpenACC program is to execute all compute-intensive parts on the accelerator. Thereby, data transfers between host memory and device memory should be avoided as far as possible. The host CPU can execute instructions asynchronously to the accelerator. In many cases, however, it is easier to have the computation run exclusively on the accelerator and to use the CPU only for auxiliary tasks

The following simple C program shows the typical sequence of an OpenACC program:

#include <openacc.h>
#include <stdlib.h>
#include <err.h>

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    // Allocate memory on the host
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }

    // Preprocess / initialize data on the host
    // e.g. read data from storage
    for (int i = 0; i < size; i++) {
        a[i] = 0.;
    }

    // Copy data from the host to the device
    #pragma acc enter data copyin(a[0:size])
    {}

    // Compute on the device
    #pragma acc parallel loop present(a[0:size]) wait
    for (int i = 0; i < size; i++) {
        a[i]++;
    }

    // Transfer data back from the device to the host
    #pragma acc update host(a[0:size])
    {}

    // Delete data on the device
    #pragma acc exit data delete(a[0:size])
    {}

    // Postprocess data on the host
    // e.g. write data to storage
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");
        }
    }

    // Free memory on the host
    free(a);

    return 0;
}

OpenACC programs can be compiled with different compilers. The procedure for this is as follows:

GNU Compiler Collection

# Load GNU compiler environment
module add \
    compiler/gnu

# Compile C, C++ or FORTRAN source code with OpenACC support
gcc      ... -fopenacc   ${C_SOURCE} -o ${EXECUTABLE}
g++      ... -fopenacc ${CXX_SOURCE} -o ${EXECUTABLE}
gfortran ... -fopenacc ${F90_SOURCE} -o ${EXECUTABLE}

NVIDIA High Performance Computing (HPC) SDK

# Load NVIDIA HPC SDK environment
module add toolkit/nvidia-hpc-sdk

# Compile C, C++ or FORTRAN source code with OpenACC support
nvc       ... -acc -target=gpu   ${C_SOURCE} -o ${EXECUTABLE}
nvc++     ... -acc -target=gpu ${CXX_SOURCE} -o ${EXECUTABLE}
nvfortran ... -acc -target=gpu ${F90_SOURCE} -o ${EXECUTABLE}

OpenMP¶

OpenMP is a directive-based parallel programming model. The specification supports the C, C++ and Fortran programming languages. OpenMP originally targeted shared memory systems and added support for accelerators such as GPUs in recent versions. For more information visit the OpenMP homepage to download additional resources:

OpenMP Specifications

The fundamental idea of an OpenMP program is to execute all compute-intensive parts on the accelerator. Thereby, data transfers between host memory and device memory should be avoided as far as possible. The host CPU can execute instructions asynchronously to the accelerator. In many cases, however, it is easier to have the computation run exclusively on the accelerator and to use the CPU only for auxiliary tasks

The following simple C program shows the typical sequence of an OpenMP program:

#include <omp.h>
#include <stdlib.h>
#include <err.h>

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    // Allocate memory on the host
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }

    // Preprocess / initialize data on the host
    // e.g. read data from storage
    for (int i = 0; i < size; i++) {
        a[i] = 0.;
    }

    // Copy data from the host to the device
    #pragma omp target enter data map(to: a[0:size])
    {}

    // Compute on the device
    #pragma omp target teams distribute parallel for simd
    for (int i = 0; i < size; i++) {
        a[i]++;
    }

    // Transfer data back from the device to the host
    #pragma omp target update from(a[0:size])
    {}

    // Delete data on the device
    #pragma omp target exit data map(release: a[0:size])
    {}

    // Postprocess data on the host
    // e.g. write data to storage
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");
        }
    }

    // Free memory on the host
    free(a);

    return 0;
}

OpenMP programs can be compiled with different compilers. The procedure for this is as follows:

GNU Compiler Collection

# Load GNU compiler environment
module add \
    compiler/gnu

# Compile C, C++ or FORTRAN source code with OpenMP support
gcc      ... -fopenmp ${C_SOURCE}   -o ${EXECUTABLE}
g++      ... -fopenmp ${CXX_SOURCE} -o ${EXECUTABLE}
gfortran ... -fopenmp ${F90_SOURCE} -o ${EXECUTABLE}

LLVM Compiler

# Load LLVM compiler and CUDA environment
module add \
    compiler/llvm \
    devel/cuda

# Compile C or C++ source code with OpenMP support
clang   ... -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=${CUDA_PATH} --offload-arch=sm_80 -L${LLVM_LIB_DIR}   ${C_SOURCE} -o ${EXECUTABLE}
clang++ ... -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=${CUDA_PATH} --offload-arch=sm_80 -L${LLVM_LIB_DIR} ${CXX_SOURCE} -o ${EXECUTABLE}

NVIDIA High Performance Computing (HPC) SDK

# Load NVIDIA HPC SDK environment
module add \
    toolkit/nvidia-hpc-sdk

# Compile C or C++ source code with OpenMP support
nvc       ... -mp -target=gpu   ${C_SOURCE} -o ${EXECUTABLE}
nvc++     ... -mp -target=gpu ${CXX_SOURCE} -o ${EXECUTABLE}
nvfortran ... -mp -target=gpu ${F90_SOURCE} -o ${EXECUTABLE}

CUDA¶

CUDA is collection of accelerated libraries and extensions to the C, C++ and Fortran programming languages.

For more information visit the CUDA homepage to read additional resources:

CUDA Toolkit Documentation

#include <stdlib.h>
#include <err.h>

__global__
void inc_kernel(double *a) {
    const int i = blockDim.x * blockIdx.x + threadIdx.x;
    a[i]++;
}

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    // Allocate memory on the host
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }
    // Allocate memory on the device
    double *d_a;
    if (cudaMalloc(&d_a, size * sizeof(double)) != cudaSuccess) {
        errx(1, "cudaMalloc d_a[] failed");
    }

    // Preprocess / initialize data on the host
    // e.g. read data from storage
    for (int i = 0; i < size; i++) {
        a[i] = 0.;
    }

    // Copy data from the host to the device
    cudaMemcpy(d_a, a, size * sizeof(double), cudaMemcpyHostToDevice);

    // Compute on the device
    inc_kernel<<<size/256, 256>>>(d_a);

    // Transfer data back from the device to the host
    cudaMemcpy(a, d_a, size * sizeof(double), cudaMemcpyDeviceToHost);

    // Delete data on the device
    cudaFree(d_a);

    // Postprocess data on the host
    // e.g. write data to storage
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");
        }
    }

    // Free memory on the host
    free(a);

    return 0;
}

CUDA programs can be compiled with different compilers. The procedure for this is as follows:

LLVM Compiler

# Load LLVM compiler and CUDA environment
module add \
    compiler/llvm \
    devel/cuda

# Compile C or C++ source code with CUDA support
clang   ... --cuda-gpu-arch=sm_80   ${C_SOURCE} -o ${EXECUTABLE} -lcudart
clang++ ... --cuda-gpu-arch=sm_80 ${CXX_SOURCE} -o ${EXECUTABLE} -lcudart

NVIDIA High Performance Computing (HPC) SDK

# Load NVIDIA HPC SDK environment
module add \
    toolkit/nvidia-hpc-sdk

# Compile C or C++ source code with CUDA support
nvcc  ...   ${C_SOURCE} -o ${EXECUTABLE}
nvc++ ... ${CXX_SOURCE} -o ${EXECUTABLE}

OpenCL¶

OpenCL is a low-level programming framework with direct control where code is executed. Accelerator code, so called kernels, are usually written in OpenCL C, which is based on C99. Host code is typically written using C or C++.

For more information visit the Khronos OpenCL homepage to read additional resources:

#include <CL/opencl.h>
#include <stdlib.h>
#include <err.h>

const char *incSource =                 "\n" \
"__kernel void inc(                      \n" \
"   __global double* a_d,                \n" \
"   const unsigned int size) {           \n" \
"   int i = get_global_id(0);            \n" \
"   if(i < size)                         \n" \
"       a_d[i]= a_d[i] + 1.0;            \n" \
"}                                       \n";

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    cl_int errCode;

    // Obtain the first available platform.
    cl_platform_id platformID = NULL;
    cl_uint numPlatforms;
    errCode = clGetPlatformIDs(1, &platformID, &numPlatforms);
    if (errCode != CL_SUCCESS) {
        errx(1, "clGetPlatformIDs() failed");
    }

    // Obtain the first available device on the platform
    cl_device_id deviceID = NULL;   
    cl_uint numDevices;
    errCode = clGetDeviceIDs( platformID, CL_DEVICE_TYPE_DEFAULT, 1, 
                              &deviceID, &numDevices);
    if (errCode != CL_SUCCESS) {
        errx(1, "clGetDeviceIDs() failed");
    }

    // Create an OpenCL context
    cl_context context = clCreateContext( NULL, 1, &deviceID, NULL, NULL, &errCode);
    if (errCode != CL_SUCCESS) {
        errx(1, "clCreateContext() failed");
    }

    // Create a command queue
    cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &errCode);
    if (errCode != CL_SUCCESS) {
        errx(1, "clCreateCommandQueue() failed");
    }

    // Allocate memory on the host
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }

    // Allocate memory on the device
    cl_mem a_d = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(double), NULL, &errCode);
    if (errCode != CL_SUCCESS) {
        errx(1, "clCreateBuffer() failed");
    }

    // Preprocess / initialize data on the host
    // e.g. read data from storage
    for (int i = 0; i < size; i++) {
        a[i] = 0.;
    }

    // Copy data from the host to the device
    errCode = clEnqueueWriteBuffer(commandQueue, a_d, CL_TRUE, 0, size * sizeof(double), a, 0, NULL, NULL);
    if (errCode != CL_SUCCESS) {
        errx(1, "clEnqueueWriteBuffer() failed");
    }

    //
    // Compute on the device
    //

    // Creates a program object for a context, and loads source code specified by text strings into the program object
    cl_program program = clCreateProgramWithSource(context, 1, & incSource, NULL, &errCode);
    if (errCode != CL_SUCCESS) {
       errx(1, "clCreateProgramWithSource() failed");
    }

    // Builds (compiles and links) a program executable from the program source
    errCode = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);
    if (errCode != CL_SUCCESS) {
        size_t len;
        char buffer[2048];
        clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len);
        errx(1, "clBuildProgram() failed:\n%s", buffer);
    }

    // Creates a kernel object
    cl_kernel kernel = clCreateKernel(program, "inc", &errCode);
    if (errCode != CL_SUCCESS) {
        errx(1, "clCreateKernel() failed");
    }

    // Set the argument value for a specific argument of a kernel
    errCode = clSetKernelArg(kernel, 0, sizeof(cl_mem)      , &a_d);
    if (errCode != CL_SUCCESS) {
        errx(1, "clSetKernelArg() failed");
    }
    errCode = clSetKernelArg(kernel, 1, sizeof(unsigned int), &size);
    if (errCode != CL_SUCCESS) {
        errx(1, "clSetKernelArg() failed");
    }

    // Query the maximum workgroup size
    size_t local;
    errCode = clGetKernelWorkGroupInfo(kernel, deviceID, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
    if (errCode != CL_SUCCESS) {
        errx(1, "clGetKernelWorkGroupInfo() failed");
    }

    // Enqueues a command to execute a kernel on a device
    size_t global = size;
    errCode = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
    if (errCode != CL_SUCCESS) {
        errx(1, "clEnqueueNDRangeKernel() failed");
    }

    // Wait for command completion
    errCode = clFinish(commandQueue);
    if (errCode != CL_SUCCESS) {                                                                                                                                                                  
         errx(1, "clFinish() failed");
    }

    // Release the kernel object
    errCode = clReleaseKernel(kernel);

    // Release the program object
    errCode = clReleaseProgram(program);

    // Release the device
    errCode = clReleaseDevice(deviceID);

    // Transfer data back from the device to the host
    errCode = clEnqueueReadBuffer(commandQueue, a_d, CL_TRUE, 0, size * sizeof(double), a, 0, NULL, NULL);
    if (errCode != CL_SUCCESS) {
        errx(1, "clEnqueueReadBuffer() failed");
    }

    // Delete data on the device
    errCode = clReleaseMemObject(a_d);
    if (errCode != CL_SUCCESS) {
        errx(1, "clReleaseMemObject() failed");
    }

    // Release a command queue
    errCode = clReleaseCommandQueue(commandQueue);

    // release the context
    errCode = clReleaseContext(context);

    // Postprocess data on the host
    // e.g. write data to storage
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");
        }
    }

    // Free memory on the host
    free(a);

    return 0;
}

OpenCL programs can be compiled with different compilers. The procedure for this is as follows:

GNU Compiler Collection

# Load GNU compiler and CUDA environment
module add \
    compiler/gnu \
    devel/cuda

# Compile C or C++ source code with OpenCL support
gcc      ...   ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL
g++      ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL

Intel Compiler

# Load Intel compiler and CUDA environment
module add \
    compiler/intel \
    devel/cuda

# Compile C or C++ source code with OpenCL support
icc   ...   ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL
icpc  ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL

LLVM Compiler

# Load LLVM compiler and CUDA environment
module add \
    compiler/llvm \
    devel/cuda

# Compile C or C++ source code with OpenCL support
clang   ...   ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL
clang++ ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL

GPU Offloading and MPI¶

MPI is a standardized and portable message-passing library for distributed processes. Both point-to-point and collective communication are supported. All communication takes place between the private memory spaces of the distributed processes. Therefore, all data must be copied from the GPU to this private memory area prior to communication. With so-called CUDA aware MPI implementations, this recopying is not required, transfers can be performed directly from the GPU memory.

See:

The following simple MPI program demonstrates host to host, host to device and device to device MPI point to point communication:

#include <mpi.h>
#include <stdlib.h>
#include <err.h>
#include "cuda.h"
#include "cuda_runtime.h"

int main(int argc, char* argv[]) {
    // Initialize the MPI execution environment
    MPI_Init(&argc, &argv);

    // Get the size of the group associated with communicator MPI_COMM_WORLD
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the calling process in the communicator MPI_COMM_WORLD
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Allocate memory on the host
    int size = 1000;
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }
    // Allocate memory on the device
    double *d_a;
    if (cudaMalloc((void **) &d_a, size * sizeof(double)) != cudaSuccess) {
        errx(1, "cudaMalloc d_a[] failed");
    }

    // Initalize host memory
    for (int i = 0; i < size; i++) {
        a[i] = (double) world_rank;
    }

    MPI_Status status;

    // From [0],CPU to [1],GPU
    if      (world_rank == 0) {
        MPI_Send(  a, size, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
    }
    else if (world_rank == 1) {
        MPI_Recv(d_a, size, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, &status);
    }

    // From [1],GPU to [0],GPU
    if      (world_rank == 1) {
        MPI_Send(d_a, size, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD);
    }
    else if (world_rank == 0) {
        MPI_Recv(d_a, size, MPI_DOUBLE, 1, 2, MPI_COMM_WORLD, &status);
    }

    // From [0],GPU to [1],CPU
    if      (world_rank == 0) {
        MPI_Send(d_a, size, MPI_DOUBLE, 1, 3, MPI_COMM_WORLD);
    }
    else if (world_rank == 1) {
        MPI_Recv(  a, size, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD, &status);
    }

    // From [1],CPU to [0],CPU
    if      (world_rank == 1) {
        MPI_Send(  a, size, MPI_DOUBLE, 0, 4, MPI_COMM_WORLD);
    }
    else if (world_rank == 0) {
        MPI_Recv(  a, size, MPI_DOUBLE, 1, 4, MPI_COMM_WORLD, &status);
    }

    // Check host memory
    for (int i = 0; i < size; i++) {
        if (a[i] != 0.) {
            errx(2, "MPI transport failed");
        }
    }

    // Terminates MPI execution environment
    MPI_Finalize();

    // Delete data on the device
    cudaFree(d_a);

    // Free memory on the host
    free(a);
}

MPI programs can be compiled with different compilers. The procedure for this is as follows:

GNU Compiler Collection, OpenMPI

# Load GNU compiler, OpenMPI and CUDA environment
module add \
    compiler/gnu \
    mpi/openmpi \
    devel/cuda

# Compile C or C++ source code with OpenMP support
mpicc  ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart

LLVM Compiler, OpenMPI

# Load LLVM compiler, OpenMPI and CUDA environment
module add \
    compiler/llvm  \
    mpi/openmpi \
    devel/cuda

# Compile C or C++ source code with OpenMP support
mpicc  ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart

NVIDIA High Performance Computing (HPC) SDK

# Load NVIDIA HPC SDK environment
module add toolkit/nvidia-hpc-sdk

# Compile C or C++ source code with OpenMP support
mpicc  ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart

The start of an MPI program with accelerator support is handled as usual with mpirun. In the accelerated partition of the HoreKa cluster, 2 CPUs each with 38 cores and 4 GPUs are available per node. To distribute the CPUs evenly to the GPUs, you can proceed as follows:

mpirun \
    --display-map \
    --display-allocation \
    --map-by ppr:2:socket:pe=19 \
    --bind-to core \
    bash -c \
        'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK};
        ${EXECUTABLE}'

Last update: August 10, 2021