Skip to content

GPU Offloading

The accelerated partition of the HoreKa cluster consists of nodes that draw their computing power mainly from GPUs. In order to be able to use these, the computing operations must be transferred from the host CPU to the GPUs, which is then referred to as offloading. In recent years, four main programming approaches have emerged that can be used to perform this offloading. The following sections show how these can be used on HoreKa.

In order for the compilers to apply the appropriate optimizations, it is recommended to compile directly on the target architecture. I.e. it usually makes sense to compile programs with GPU offloading on compute nodes equipped with GPUs. On the HoreKa cluster you can start an interactive job on the GPU development partition for this purpose:

salloc -p dev_accelerated --gres=gpu:1 -t 60

OpenACC

OpenACC is a directive-based parallel programming model. The specification supports the C, C++ and Fortran programming languages. For more information visit the OpenACC homepage to download additional resources:

The fundamental idea of an OpenACC program is to execute all compute-intensive parts on the accelerator. Thereby, data transfers between host memory and device memory should be avoided as far as possible. The host CPU can execute instructions asynchronously to the accelerator. In many cases, however, it is easier to have the computation run exclusively on the accelerator and to use the CPU only for auxiliary tasks

The following simple C program shows the typical sequence of an OpenACC program:

#include <openacc.h>
#include <stdlib.h>
#include <err.h>

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    // Allocate memory on the host
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }

    // Preprocess / initialize data on the host
    // e.g. read data from storage
    for (int i = 0; i < size; i++) {
        a[i] = 0.;
    }

    // Copy data from the host to the device
    #pragma acc enter data copyin(a[0:size])
    {}

    // Compute on the device
    #pragma acc parallel loop present(a[0:size]) wait
    for (int i = 0; i < size; i++) {
        a[i]++;
    }

    // Transfer data back from the device to the host
    #pragma acc update host(a[0:size])
    {}

    // Delete data on the device
    #pragma acc exit data delete(a[0:size])
    {}

    // Postprocess data on the host
    // e.g. write data to storage
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");
        }
    }

    // Free memory on the host
    free(a);

    return 0;
}

OpenACC programs can be compiled with different compilers. The procedure for this is as follows:

  • GNU Compiler Collection
    # Load GNU compiler environment
    module add \
        compiler/gnu
    
    # Compile C, C++ or FORTRAN source code with OpenACC support
    gcc      ... -fopenacc   ${C_SOURCE} -o ${EXECUTABLE}
    g++      ... -fopenacc ${CXX_SOURCE} -o ${EXECUTABLE}
    gfortran ... -fopenacc ${F90_SOURCE} -o ${EXECUTABLE}
    
  • NVIDIA High Performance Computing (HPC) SDK
    # Load NVIDIA HPC SDK environment
    module add toolkit/nvidia-hpc-sdk
    
    # Compile C, C++ or FORTRAN source code with OpenACC support
    nvc       ... -acc -target=gpu   ${C_SOURCE} -o ${EXECUTABLE}
    nvc++     ... -acc -target=gpu ${CXX_SOURCE} -o ${EXECUTABLE}
    nvfortran ... -acc -target=gpu ${F90_SOURCE} -o ${EXECUTABLE}
    

OpenMP

OpenMP is a directive-based parallel programming model. The specification supports the C, C++ and Fortran programming languages. OpenMP originally targeted shared memory systems and added support for accelerators such as GPUs in recent versions. For more information visit the OpenMP homepage to download additional resources:

The fundamental idea of an OpenMP program is to execute all compute-intensive parts on the accelerator. Thereby, data transfers between host memory and device memory should be avoided as far as possible. The host CPU can execute instructions asynchronously to the accelerator. In many cases, however, it is easier to have the computation run exclusively on the accelerator and to use the CPU only for auxiliary tasks

The following simple C program shows the typical sequence of an OpenMP program:

#include <omp.h>
#include <stdlib.h>
#include <err.h>

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    // Allocate memory on the host
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }

    // Preprocess / initialize data on the host
    // e.g. read data from storage
    for (int i = 0; i < size; i++) {
        a[i] = 0.;
    }

    // Copy data from the host to the device
    #pragma omp target enter data map(to: a[0:size])
    {}

    // Compute on the device
    #pragma omp target teams distribute parallel for simd
    for (int i = 0; i < size; i++) {
        a[i]++;
    }

    // Transfer data back from the device to the host
    #pragma omp target update from(a[0:size])
    {}

    // Delete data on the device
    #pragma omp target exit data map(release: a[0:size])
    {}

    // Postprocess data on the host
    // e.g. write data to storage
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");
        }
    }

    // Free memory on the host
    free(a);

    return 0;
}

OpenMP programs can be compiled with different compilers. The procedure for this is as follows:

  • GNU Compiler Collection
    # Load GNU compiler environment
    module add \
        compiler/gnu
    
    # Compile C, C++ or FORTRAN source code with OpenMP support
    gcc      ... -fopenmp ${C_SOURCE}   -o ${EXECUTABLE}
    g++      ... -fopenmp ${CXX_SOURCE} -o ${EXECUTABLE}
    gfortran ... -fopenmp ${F90_SOURCE} -o ${EXECUTABLE}
    
  • LLVM Compiler
    # Load LLVM compiler and CUDA environment
    module add \
        compiler/llvm \
        devel/cuda
    
    # Compile C or C++ source code with OpenMP support
    clang   ... -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=${CUDA_PATH} --offload-arch=sm_80 -L${LLVM_LIB_DIR}   ${C_SOURCE} -o ${EXECUTABLE}
    clang++ ... -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=${CUDA_PATH} --offload-arch=sm_80 -L${LLVM_LIB_DIR} ${CXX_SOURCE} -o ${EXECUTABLE}
    
  • NVIDIA High Performance Computing (HPC) SDK
    # Load NVIDIA HPC SDK environment
    module add \
        toolkit/nvidia-hpc-sdk
    
    # Compile C or C++ source code with OpenMP support
    nvc       ... -mp -target=gpu   ${C_SOURCE} -o ${EXECUTABLE}
    nvc++     ... -mp -target=gpu ${CXX_SOURCE} -o ${EXECUTABLE}
    nvfortran ... -mp -target=gpu ${F90_SOURCE} -o ${EXECUTABLE}
    

CUDA

CUDA is collection of accelerated libraries and extensions to the C, C++ and Fortran programming languages.

For more information visit the CUDA homepage to read additional resources:

#include <stdlib.h>
#include <err.h>

__global__
void inc_kernel(double *a) {
    const int i = blockDim.x * blockIdx.x + threadIdx.x;
    a[i]++;
}

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    // Allocate memory on the host
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }
    // Allocate memory on the device
    double *d_a;
    if (cudaMalloc(&d_a, size * sizeof(double)) != cudaSuccess) {
        errx(1, "cudaMalloc d_a[] failed");
    }

    // Preprocess / initialize data on the host
    // e.g. read data from storage
    for (int i = 0; i < size; i++) {
        a[i] = 0.;
    }

    // Copy data from the host to the device
    cudaMemcpy(d_a, a, size * sizeof(double), cudaMemcpyHostToDevice);

    // Compute on the device
    inc_kernel<<<size/256, 256>>>(d_a);

    // Transfer data back from the device to the host
    cudaMemcpy(a, d_a, size * sizeof(double), cudaMemcpyDeviceToHost);

    // Delete data on the device
    cudaFree(d_a);

    // Postprocess data on the host
    // e.g. write data to storage
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");
        }
    }

    // Free memory on the host
    free(a);

    return 0;
}

CUDA programs can be compiled with different compilers. The procedure for this is as follows:

  • LLVM Compiler
    # Load LLVM compiler and CUDA environment
    module add \
        compiler/llvm \
        devel/cuda
    
    # Compile C or C++ source code with CUDA support
    clang   ... --cuda-gpu-arch=sm_80   ${C_SOURCE} -o ${EXECUTABLE} -lcudart
    clang++ ... --cuda-gpu-arch=sm_80 ${CXX_SOURCE} -o ${EXECUTABLE} -lcudart
    
  • NVIDIA High Performance Computing (HPC) SDK
    # Load NVIDIA HPC SDK environment
    module add \
        toolkit/nvidia-hpc-sdk
    
    # Compile C or C++ source code with CUDA support
    nvcc  ...   ${C_SOURCE} -o ${EXECUTABLE}
    nvc++ ... ${CXX_SOURCE} -o ${EXECUTABLE}
    

OpenCL

OpenCL is a low-level programming framework with direct control where code is executed. Accelerator code, so called kernels, are usually written in OpenCL C, which is based on C99. Host code is typically written using C or C++.

For more information visit the Khronos OpenCL homepage to read additional resources:

#include <CL/opencl.h>
#include <stdlib.h>
#include <err.h>

const char *incSource =                 "\n" \
"__kernel void inc(                      \n" \
"   __global double* a_d,                \n" \
"   const unsigned int size) {           \n" \
"   int i = get_global_id(0);            \n" \
"   if(i < size)                         \n" \
"       a_d[i]= a_d[i] + 1.0;            \n" \
"}                                       \n";

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    cl_int errCode;

    // Obtain the first available platform.
    cl_platform_id platformID = NULL;
    cl_uint numPlatforms;
    errCode = clGetPlatformIDs(1, &platformID, &numPlatforms);
    if (errCode != CL_SUCCESS) {
        errx(1, "clGetPlatformIDs() failed");
    }

    // Obtain the first available device on the platform
    cl_device_id deviceID = NULL;   
    cl_uint numDevices;
    errCode = clGetDeviceIDs( platformID, CL_DEVICE_TYPE_DEFAULT, 1, 
                              &deviceID, &numDevices);
    if (errCode != CL_SUCCESS) {
        errx(1, "clGetDeviceIDs() failed");
    }

    // Create an OpenCL context
    cl_context context = clCreateContext( NULL, 1, &deviceID, NULL, NULL, &errCode);
    if (errCode != CL_SUCCESS) {
        errx(1, "clCreateContext() failed");
    }

    // Create a command queue
    cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &errCode);
    if (errCode != CL_SUCCESS) {
        errx(1, "clCreateCommandQueue() failed");
    }

    // Allocate memory on the host
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }

    // Allocate memory on the device
    cl_mem a_d = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(double), NULL, &errCode);
    if (errCode != CL_SUCCESS) {
        errx(1, "clCreateBuffer() failed");
    }

    // Preprocess / initialize data on the host
    // e.g. read data from storage
    for (int i = 0; i < size; i++) {
        a[i] = 0.;
    }

    // Copy data from the host to the device
    errCode = clEnqueueWriteBuffer(commandQueue, a_d, CL_TRUE, 0, size * sizeof(double), a, 0, NULL, NULL);
    if (errCode != CL_SUCCESS) {
        errx(1, "clEnqueueWriteBuffer() failed");
    }

    //
    // Compute on the device
    //

    // Creates a program object for a context, and loads source code specified by text strings into the program object
    cl_program program = clCreateProgramWithSource(context, 1, & incSource, NULL, &errCode);
    if (errCode != CL_SUCCESS) {
       errx(1, "clCreateProgramWithSource() failed");
    }

    // Builds (compiles and links) a program executable from the program source
    errCode = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);
    if (errCode != CL_SUCCESS) {
        size_t len;
        char buffer[2048];
        clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len);
        errx(1, "clBuildProgram() failed:\n%s", buffer);
    }

    // Creates a kernel object
    cl_kernel kernel = clCreateKernel(program, "inc", &errCode);
    if (errCode != CL_SUCCESS) {
        errx(1, "clCreateKernel() failed");
    }

    // Set the argument value for a specific argument of a kernel
    errCode = clSetKernelArg(kernel, 0, sizeof(cl_mem)      , &a_d);
    if (errCode != CL_SUCCESS) {
        errx(1, "clSetKernelArg() failed");
    }
    errCode = clSetKernelArg(kernel, 1, sizeof(unsigned int), &size);
    if (errCode != CL_SUCCESS) {
        errx(1, "clSetKernelArg() failed");
    }

    // Query the maximum workgroup size
    size_t local;
    errCode = clGetKernelWorkGroupInfo(kernel, deviceID, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
    if (errCode != CL_SUCCESS) {
        errx(1, "clGetKernelWorkGroupInfo() failed");
    }

    // Enqueues a command to execute a kernel on a device
    size_t global = size;
    errCode = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
    if (errCode != CL_SUCCESS) {
        errx(1, "clEnqueueNDRangeKernel() failed");
    }

    // Wait for command completion
    errCode = clFinish(commandQueue);
    if (errCode != CL_SUCCESS) {                                                                                                                                                                  
         errx(1, "clFinish() failed");
    }

    // Release the kernel object
    errCode = clReleaseKernel(kernel);

    // Release the program object
    errCode = clReleaseProgram(program);

    // Release the device
    errCode = clReleaseDevice(deviceID);

    // Transfer data back from the device to the host
    errCode = clEnqueueReadBuffer(commandQueue, a_d, CL_TRUE, 0, size * sizeof(double), a, 0, NULL, NULL);
    if (errCode != CL_SUCCESS) {
        errx(1, "clEnqueueReadBuffer() failed");
    }

    // Delete data on the device
    errCode = clReleaseMemObject(a_d);
    if (errCode != CL_SUCCESS) {
        errx(1, "clReleaseMemObject() failed");
    }

    // Release a command queue
    errCode = clReleaseCommandQueue(commandQueue);

    // release the context
    errCode = clReleaseContext(context);

    // Postprocess data on the host
    // e.g. write data to storage
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");
        }
    }

    // Free memory on the host
    free(a);

    return 0;
}

OpenCL programs can be compiled with different compilers. The procedure for this is as follows:

  • GNU Compiler Collection
    # Load GNU compiler and CUDA environment
    module add \
        compiler/gnu \
        devel/cuda
    
    # Compile C or C++ source code with OpenCL support
    gcc      ...   ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL
    g++      ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL
    
  • Intel Compiler
    # Load Intel compiler and CUDA environment
    module add \
        compiler/intel \
        devel/cuda
    
    # Compile C or C++ source code with OpenCL support
    icc   ...   ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL
    icpc  ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL
    
  • LLVM Compiler
    # Load LLVM compiler and CUDA environment
    module add \
        compiler/llvm \
        devel/cuda
    
    # Compile C or C++ source code with OpenCL support
    clang   ...   ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL
    clang++ ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL
    

GPU Offloading and MPI

MPI is a standardized and portable message-passing library for distributed processes. Both point-to-point and collective communication are supported. All communication takes place between the private memory spaces of the distributed processes. Therefore, all data must be copied from the GPU to this private memory area prior to communication. With so-called CUDA aware MPI implementations, this recopying is not required, transfers can be performed directly from the GPU memory.

See:

The following simple MPI program demonstrates host to host, host to device and device to device MPI point to point communication:

#include <mpi.h>
#include <stdlib.h>
#include <err.h>
#include "cuda.h"
#include "cuda_runtime.h"

int main(int argc, char* argv[]) {
    // Initialize the MPI execution environment
    MPI_Init(&argc, &argv);

    // Get the size of the group associated with communicator MPI_COMM_WORLD
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the calling process in the communicator MPI_COMM_WORLD
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Allocate memory on the host
    int size = 1000;
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }
    // Allocate memory on the device
    double *d_a;
    if (cudaMalloc((void **) &d_a, size * sizeof(double)) != cudaSuccess) {
        errx(1, "cudaMalloc d_a[] failed");
    }

    // Initalize host memory
    for (int i = 0; i < size; i++) {
        a[i] = (double) world_rank;
    }

    MPI_Status status;

    // From [0],CPU to [1],GPU
    if      (world_rank == 0) {
        MPI_Send(  a, size, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
    }
    else if (world_rank == 1) {
        MPI_Recv(d_a, size, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, &status);
    }

    // From [1],GPU to [0],GPU
    if      (world_rank == 1) {
        MPI_Send(d_a, size, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD);
    }
    else if (world_rank == 0) {
        MPI_Recv(d_a, size, MPI_DOUBLE, 1, 2, MPI_COMM_WORLD, &status);
    }

    // From [0],GPU to [1],CPU
    if      (world_rank == 0) {
        MPI_Send(d_a, size, MPI_DOUBLE, 1, 3, MPI_COMM_WORLD);
    }
    else if (world_rank == 1) {
        MPI_Recv(  a, size, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD, &status);
    }

    // From [1],CPU to [0],CPU
    if      (world_rank == 1) {
        MPI_Send(  a, size, MPI_DOUBLE, 0, 4, MPI_COMM_WORLD);
    }
    else if (world_rank == 0) {
        MPI_Recv(  a, size, MPI_DOUBLE, 1, 4, MPI_COMM_WORLD, &status);
    }

    // Check host memory
    for (int i = 0; i < size; i++) {
        if (a[i] != 0.) {
            errx(2, "MPI transport failed");
        }
    }

    // Terminates MPI execution environment
    MPI_Finalize();

    // Delete data on the device
    cudaFree(d_a);

    // Free memory on the host
    free(a);
}

MPI programs can be compiled with different compilers. The procedure for this is as follows:

  • GNU Compiler Collection, OpenMPI
    # Load GNU compiler, OpenMPI and CUDA environment
    module add \
        compiler/gnu \
        mpi/openmpi \
        devel/cuda
    
    # Compile C or C++ source code with OpenMP support
    mpicc  ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
    mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
    
  • LLVM Compiler, OpenMPI
    # Load LLVM compiler, OpenMPI and CUDA environment
    module add \
        compiler/llvm  \
        mpi/openmpi \
        devel/cuda
    
    # Compile C or C++ source code with OpenMP support
    mpicc  ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
    mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
    
  • NVIDIA High Performance Computing (HPC) SDK
    # Load NVIDIA HPC SDK environment
    module add toolkit/nvidia-hpc-sdk
    
    # Compile C or C++ source code with OpenMP support
    mpicc  ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
    mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
    

The start of an MPI program with accelerator support is handled as usual with mpirun. In the accelerated partition of the HoreKa cluster, 2 CPUs each with 38 cores and 4 GPUs are available per node. To distribute the CPUs evenly to the GPUs, you can proceed as follows:

mpirun \
    --display-map \
    --display-allocation \
    --map-by ppr:2:socket:pe=19 \
    --bind-to core \
    bash -c \
        'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK};
        ${EXECUTABLE}'

Last update: August 10, 2021