Skip to content


CUDA is a collection of accelerated libraries and extensions to the C, C++ and Fortran programming languages. For more information visit the CUDA homepage. Recommended reading resources are:

CUDA fundamentals

CUDA offers direct, low-level control over where code is executed. Pieces of code meant to be executed on GPGPU accelerators, so called "kernels", are written in the same programming language as the host code and are embedded into the source code between the other functions that will run on CPUs.

Code written for CUDA will not run on CPUs without modifications, and CUDA code does not conform to the C/C++/FORTRAN standards. The code can only be directly compiled using NVIDIA's own compilers and a few others, e.g. LLVM.

Code sample

The following simple C program shows the typical sequence of CUDA program:

#include <stdio.h>
#include <stdlib.h>
#include <err.h>

void inc_kernel(double *a) {
    const int i = blockDim.x * blockIdx.x + threadIdx.x;

int main(int arg, char *argv[]) {
    const unsigned int size = 1 << 16;

    // Platform information
    int runtime_version = 0;
    printf("CUDA version: %i.%i\n",
        runtime_version / 1000, (runtime_version % 100) / 10);

    int driver_version = 0;
    printf("CUDA driver version: %i.%i\n",
        driver_version / 1000, (driver_version % 100) / 10);

    int num_devices = 0;
    cudaError_t error_id = cudaGetDeviceCount(&num_devices);
    printf("Number of CUDA devices: %i\n", num_devices);

    int device_num = 0;
    printf("CUDA Device number: %i\n", device_num);

    size_t memory_free = 0, memory_total = 0;
    cudaMemGetInfo(&memory_free, &memory_total);
    printf("Memory on CUDA device: %llu bytes\n",      (unsigned long long) memory_total);
    printf("Free Memory on CUDA device: %llu bytes\n", (unsigned long long) memory_free);

    cudaDeviceProp device_properties;
    cudaGetDeviceProperties(&device_properties, device_num);
    printf("CUDA device name: %s\n",;
    printf("CUDA capability: %i.%i\n", device_properties.major, device_properties.minor);

    printf("* Allocate memory on the host\n");
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    printf("* Allocate memory on the device\n");
    double *d_a;
    if (cudaMalloc(&d_a, size * sizeof(double)) != cudaSuccess) {
        errx(1, "cudaMalloc d_a[] failed");

    printf("* Preprocess / initialize data on the host\n");
    printf("  e.g. read data from storage\n");
    for (int i = 0; i < size; i++) {
        a[i] = 0.;

    printf("* Copy data from the host to the device\n");
    cudaMemcpy(d_a, a, size * sizeof(double), cudaMemcpyHostToDevice);

    printf("* Compute on the device\n");
    inc_kernel<<<size/256, 256>>>(d_a);

    printf("* Transfer data back from the device to the host\n");
    cudaMemcpy(a, d_a, size * sizeof(double), cudaMemcpyDeviceToHost);

    printf("* Delete data on the device\n");

    printf("* Postprocess data on the host\n");
    printf("  e.g. write data to storage\n");
    for (int i = 0; i < size; i++) {
        if (a[i] != 1.) {
            errx(2, "Computation on GPU failed");

    printf("* Free memory on the host\n");

    return 0;

Compiling CUDA code

CUDA programs can be compiled with different compilers. The procedure for this is as follows:

  • LLVM Compiler

    ## Load LLVM compiler and CUDA environment
    $ module add \
        compiler/llvm \
    ## Compile C or C++ source code with CUDA support
    $ clang   ... --cuda-gpu-arch=sm_80   ${C_SOURCE} -o ${EXECUTABLE} -lcudart
    $ clang++ ... --cuda-gpu-arch=sm_80 ${CXX_SOURCE} -o ${EXECUTABLE} -lcudart
  • NVIDIA High Performance Computing (HPC) SDK

    ## Load NVIDIA HPC SDK environment
    $ module add \
    ## Compile C or C++ source code with CUDA support
    $ nvcc  ...   ${C_SOURCE} -o ${EXECUTABLE}
    $ nvc++ ... ${CXX_SOURCE} -o ${EXECUTABLE}

CUDA on other accelerators

Both AMD and Intel have developed compatibility layers that make it possible to run many CUDA codebases on their own devices and accelerators without having to modifiy the code. The solution for AMD GPUs is called Heterogenous-Computing Interface for Portability (HIP), the one of for Intel GPUs Intel DPC++ Compatibility Tool.

Last update: December 20, 2022