Skip to content

Multithreaded + MPI parallel Programs

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

OpenMPI with Multithreading

Multiple MPI tasks using OpenMPI must be launched by the MPI parallel program mpirun. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and an 76-fold threaded program ompi_omp_program requiring 3000 MByte of physical memory per thread (using 10 threads per MPI task you will get 10*3000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=76
#SBATCH --time=03:00:00
#SBATCH --mem=30gb
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/4.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${SLURM_CPUS_PER_TASK} -report-bindings"
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export NUM_CORES=${SLURM_NTASKS}*${SLURM_CPUS_PER_TASK}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe

Execute the script job_ompi_omp.sh by command sbatch:

$ sbatch -p cpuonly job_ompi_omp.sh
  • With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores.
  • With the option --map-by socket:PE=<value> (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
  • The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
  • The mpirun-options --bind-to core, --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program.

Intel MPI with Multithreading

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program mpiexec.hydra. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

For Intel MPI a job-script to submit a batch job called job_impi_omp.sh that runs a Intel MPI program with 4 tasks and a 76-fold threaded program impi_omp_program requiring 64000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=76
#SBATCH --time=60
#SBATCH --mem=64000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter  prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe

Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.

If you want to use 128 or more nodes, you must also set the environment variable as follows:

$ export I_MPI_HYDRA_BRANCH_COUNT=-1

If you want to use the options perhost, ppn or rr, you must additionally set the environment variable

$ export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off

Execute the script job_impi_omp.sh by command sbatch:

$ sbatch -p cpuonly job_impi_omp.sh

The mpirun option -print-rank-map shows the bindings between MPI tasks and nodes (not very beneficial). The option -binding binds MPI tasks (processes) to a particular processor; domain=omp means that the domain size is determined by the number of threads. In the above examples (2 MPI tasks per node) you could also choose -binding "cell=unit;map=bunch"; this binding maps one MPI process to each socket.

New Examples

This is a WIP section.
In the future, all code examples will be available from a gitlab repository and will be tested continuously.

Code

Download: hello-world_mpi+omp.cpp, Makefile

#include <mpi.h>
#include <omp.h>
#include <iostream>
#include <sched.h>
#include <unistd.h> 
#include <hwloc.h>

using namespace std;

int main (int argc, char *argv[])
{
    int size, rank, name_len;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    hwloc_topology_t topology;
    hwloc_topology_init(&topology);
    hwloc_topology_load(topology);

    MPI_Init(NULL, NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &name_len);

    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        int logical_core = sched_getcpu();

        hwloc_obj_t obj = hwloc_get_pu_obj_by_os_index(topology, logical_core);
        int physical_core = obj->parent->logical_index;

        usleep(rank*100 + logical_core*1000 + tid*10000);

        #pragma omp critical
        cout<<"node: "<<processor_name
            <<" | MPI-task: "<<rank<<" / "<<size
            <<" | OpenMP-thread: "<<tid <<" / "<<nthreads
            <<" | physical core: " << physical_core
            <<" | logical core: " << logical_core
            <<endl;

    }

    hwloc_topology_destroy(topology);
    MPI_Finalize();
    return 0;
}

Submit scripts

OpenMPI

Download: batch_one-mpi-task-per-socket.sh

#!/bin/bash
#SBATCH --partition=dev_cpuonly
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=38
#SBATCH --threads-per-core=1
#SBATCH --time=00:02
#SBATCH --mem=5000
#SBATCH --output="hello-world_mpi+omp_openmpi_%j.out"

# Load modules
module load compiler/gnu/14
module load mpi/openmpi/5.0

# Compile source code
make hello-world_mpi+omp_openmpi

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PLACES=cores
export OMP_PROC_BIND=TRUE

mpirun --map-by package --bind-to package ./hello-world_mpi+omp_openmpi

Download: batch_one-mpi-task-per-socket_hyperthreading.sh

#!/bin/bash
#SBATCH --partition=dev_cpuonly
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=76
#SBATCH --threads-per-core=2
#SBATCH --time=00:02
#SBATCH --mem=5000
#SBATCH --output="hello-world_mpi+omp_openmpi_%j.out"

# Load modules
module load compiler/gnu/14
module load mpi/openmpi/5.0

# Compile source code
make hello-world_mpi+omp_openmpi

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PLACES=threads
export OMP_PROC_BIND=TRUE

mpirun --map-by package --bind-to package ./hello-world_mpi+omp_openmpi

Intel MPI

Download: batch_one-mpi-task-per-socket_intel.sh

#!/bin/bash
#SBATCH --partition=dev_cpuonly
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=38
#SBATCH --threads-per-core=1
#SBATCH --time=00:02
#SBATCH --mem=5000
#SBATCH --output="hello-world_mpi+omp_intelmpi_%j.out"

# Load modules
module load compiler/intel/2025.1_llvm
module load mpi/impi/2021.11

# Compile source code
make hello-world_mpi+omp_intelmpi

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PLACES=cores
export OMP_PROC_BIND=TRUE
export I_MPI_PIN_DOMAIN=socket

mpirun ./hello-world_mpi+omp_intelmpi

Download: batch_one-mpi-task-per-socket_hyperthreading_intel.sh

#!/bin/bash
#SBATCH --partition=dev_cpuonly
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=76
#SBATCH --threads-per-core=2
#SBATCH --time=00:02
#SBATCH --mem=5000
#SBATCH --output="hello-world_mpi+omp_intelmpi_%j.out"

# Load modules
module load compiler/intel/2025.1_llvm
module load mpi/impi/2021.11

# Compile source code
make hello-world_mpi+omp_intelmpi

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PLACES=threads
export OMP_PROC_BIND=TRUE
export I_MPI_PIN_DOMAIN=socket

mpirun ./hello-world_mpi+omp_intelmpi