Skip to content

JobMon - Job Performance Metrics

With JobMon we offer a web based performance service. It visualizes selected performance metrics collected on cluster nodes during runtime of a users job. Metrics are stored for at least 4 weeks, so that performance changes over time can be tracked.

Benchmarks

In this documentation we will demonstrate JobMon use cases by benchmarks from three different categories:

Compute Bound

Performance is limited by the speed of the CPU, this typically means it is limited by the number of floating point operations. Benchmarks from this category include:

  • DGEMM

    Benchmark DGEMM performs a matrix matrix multiplication C = A · B. For n dimensional square matrixes computation requires O(n3) compute operations but only O(n2) memory operations. There are implementations for CPU and GPU available for this benchmark.

Memory bound

Performance is limited by the speed of the main memory subsystem. This can be caused by memory bandwidth saturation, or memory access latency. Benchmarks from this category include:

  • Stream / BabelStream

    Benchmark Stream performs four sub-benchmarks: vector copy c = a, vector scale b = α · c, vector add c = a + b and vector triad a = b + α · c. For n dimensional vectors computation requires O(n) memory operations but only O(n) compute operations. There are implementations for CPU and GPU available for this benchmark.

  • High Performance Conjugate Gradients (HPCG)

    HPCG performs the conjugate gradients method with a sparse matrix. There are implementations for CPU and GPU available for this benchmark.

Communication bound

Performance is limited by the node to node communication network. This can be caused by interconnect bandwidth saturation, or communication package latency. Benchmarks from this category include:

  • OSU Micro-Benchmarks (OMB)

    OMB performs point to point communication with messages of increasing size to measure bandwidth and latency. There are implementations for CPU and GPU available for this benchmark.

Jobs page

After logging in to the JobMon service, you will be automatically redirected to the Jobs page.

This page presents an overview of the individual batch jobs that have been executed on the HoreKa cluster. These Jobs can be filter by:

  • the partition the job was running
  • the number of nodes or number of GPUs used by the job
  • still running or already finished jobs
  • the time the job was running (e.g. jobs of last week)
  • tags assigned to the job (e.g. tags: "without optimization", "optimization A", "optimization B", ...)

Job overview: Filter job data

The Jobs page includes for each job a so called spider plot. This plot very easily shows the performance limitations of this job and allows categorization as memory bound, compute bound, or communication bound. For this purpose, the plot shows the average and maximum values for the metrics:

  • CPU floating point operations per second
  • CPU memory bandwidth
  • GPU utilization
  • GPU memory utilization
  • InfiniBand bandwidth
Examples: Spider Plot

Job Overview for Compute Bound CPU-Job

DGEMM is capable of using the entire available floating point power of the node. This also uses a significant portion of the memory bandwidth.

Job Overview for Memory Bound CPU-Job

Stream saturates the entire memory bandwidth. All other resources are under-utilized.

Job Overview for communication Bound CPU-Job

OSU Micro-Benchmarks saturates the entire InfiniBand bandwidth. Some memory bandwidth utilization can also be seen as messages are transferred from memory to memory.

Job Overview for Compute Bound GPU-Job

The DGEMM implementation uses only a single GPU. Therefore, only a quarter of the available floating point performance of the GPUs is utilized. This is reflected in the use of memory bandwidth as well.

Job Overview for Memory Bound GPU-Job

HPCG runs distributed on all GPUs of the node. HPCG saturates the entire memory bandwidth of these GPUs. The compute units are also well utilized.

Job Overview for communication Bound GPU-Job

InfiniBand communication between GPUs is not as performant as between CPUs. Therefore, OSU Micro-Benchmarks cannot utilize the entire available InfiniBand bandwidth.

Per job page

Configuration options

  • Select subset of nodes: Show only nodes of interest
  • Select time range: Zoom into time ranges of interest
  • Select subset of metrics to focus on metrics of interest
  • Set Tag

    Tags are a handy tool to mark jobs during different optimization steps (e.g. tags: "without optimization", "optimization A", "optimization B", ...) or jobs from different workflows (e.g. tags: "workflow A", "workflow B"). Filtering for tags on the jobs page allows easily finding all jobs with the same tag.

  • Toggle for Automatic Scaling:

    Select diagram y-axis upper and lower limits depending on the measured metric values

  • Toggle for Changepoints

    Change point detection tries to identify times where performance metric behavior changes. If changes are identified, they are displayed as a vertical line in the diagram.

    Examples: Changepoints

    Jobs: Toggle Changepoints off for InfiniBand packets

    OMB performs communication with messages of increasing size. The performance changes caused by increasing message size can only be seen as steps in the graph.

    Jobs: Toggle Changepoints on for InfiniBand packets

    OMB performs communication with messages of increasing size. Some of time points with changing performance caused by increasing message size are detected by the changepoint algorithm and marked by vertical lines

  • Toggle for Quantile view

    • Quantile view is a useful feature to improve clarity when too many graphs are displayed in one diagram.
    • Information is condensed into only three graphs (25% / 50% / 75% Quantile = Quartiles)
    • The 25% Quantile is the graph for which 25% of the measured values are below the graph
    • The 50% Quantile is also known as median
    • The difference between upper and lower Quantile is a measure for the spread of the metrics
    Examples: Quantile View

    Jobs: Toggle Quantile view off for CPU temperature

    The CPU temperatures are collected per hardware thread, therefore the diagram appears very cluttered

    Jobs: Toggle Quantile view on for CPU temperature

    The Quantile View shows the distribution of the CPU temperatures much clearer

    Jobs: Toggle Quantile view off for CPU frequency

    The CPU frequencies are collected per CPU core, therefore the diagram appears very cluttered

    Jobs: Toggle Quantile view on for CPU frequency

    The Quantile View shows the distribution of the CPU frequencies much clearer

Performance categories

Metrics are grouped into different performance categories

Energy

This category offers diagrams for:

  • CPU power consumption of DRAM channels and the package
  • GPU power consumption
  • Server system power consumption
Examples: Category Energy

Jobs: CPU Power DRAM for benchmark stream

Stream as a memory bound benchmark put constantly high pressure on the DRAM subsystem. This is reflected by a constantly high energy consumption of this subsystem.

Jobs: CPU Power DRAM for benchmark DGEMM

DGEMM as a compute bound benchmark does put less pressure on the DRAM subsystem. This is reflected in the varying power consumption of this subsystem over time.

Jobs: CPU Power Package for benchmark DGEMM

The GPU implementation of the HPCG benchmark has a preparation phase executed on the CPU and a computation phase executed on the GPU. During the first phase the power consumption of the CPU package is therefore higher than in the subsequent phase.

Jobs: CPU Power Package for benchmark stream

The GPU implementation of the HPCG benchmark has a preparation phase executed on the CPU and a computation phase executed on the GPU. During the first phase the power consumption of the GPUs is therefore lower than in the subsequent phase.

Filesystem

This category offers graphs for e.g. meta data operation, IO throughput, ...

Interconnect

This category offers diagrams for:

  • InfiniBand: Send, receive and aggregated bandwidth
  • InfiniBand: Send, received and aggregated number of packages
Examples: Category Interconnect

Jobs: InfiniBand send bandwidth for OSU Micro-Benchmark

InfiniBand send bandwidth for OSU Micro-Benchmark. In the point to point communication node hkn0201 send data while node hkn0202 only receives data (send bandwidth is zero).

Jobs: InfiniBand receive bandwidth for OSU Micro-Benchmark

InfiniBand receive bandwidth for OSU Micro-Benchmark. In the point to point communication node hkn0202 receives data while node hkn0201 only sends data (receive bandwidth is zero).

Jobs: InfiniBand bandwidth for OSU Micro-Benchmark

InfiniBand bandwidth for OSU Micro-Benchmark. In the point-to-point communication, both nodes use the same bandwidth, even though one of the nodes only sends data and the other one only receives data.

Memory

This category offers diagrams for:

  • Amount of memory used on the system (CPU) and on the GPU
  • CPU memory bandwidth
  • GPU memory utilization (in %) and frequency
Examples: Category Memory

Jobs: CPU Memory Bandwidth for Benchmark Stream

Stream as a memory bound benchmark put constantly high pressure on the memory subsystem.

Jobs: CPU Memory Bandwidth for Benchmark DGEMM

DGEMM as a compute bound benchmark does put less pressure on the memory subsystem. This is reflected in the varying bandwidth over time.

Jobs: GPU Memory Utilization for Benchmark BabelStream

Babelstream as a memory bound benchmark fully utilizes the memory subsystem of the GPU.

Jobs: GPU Memory Utilization for Benchmark DGEMM

The GPU implementation of DGEMM as a compute bound benchmark does put less pressure on the GPU memory subsystem. This is reflected in the varying utilization over time.

Performance

This category offers diagrams for:

  • Floating point operation per second (FLOPs), collected per hardware thread and aggregated per core or per socket
  • Instructions per cycle (IPC), collected per hardware thread and aggregated per core or per socket
  • CPU time spend in kernel and in user space
  • One Minute Linux load average
  • GPU utilization
  • CPU and GPU frequency
Examples: Category Performance

Jobs:

Floating point operation per second (FLOP/s) are collected per hardware thread. As hardware threads of a core share the same compute units it may be more appropriated to aggregated per core.

Jobs:

Floating point operation per second (FLOP/s) are collected per hardware thread. To examine the even utilization of the CPU sockets, an aggregation per socket can be beneficial.

Temperature

This category offers diagrams for:

  • CPU and GPU temperature
Examples: Category Temperature

Jobs:

The GPU implementation of the DGEMM benchmark only utilizes one of the GPUs. Only this GPU gets hot, while the other GPUs maintain a lower temperature.

Jobs:

The GPU implementation of the HPCG benchmark has a preparation phase executed on the CPU and a computation phase executed on the GPU. During the first phase, the GPU is not utilized and therefore stays cooler than in the subsequent phase.

Additional Features

  • For multi-node jobs there is a configuration option to select the per-node aggregation function used (e.g. average, sum, maximum)
  • Live view of still running jobs
  • Download of metrics as CSV file

    • All metrics as measured by the collector
    • Ready for use e.g. in spread sheet application or Python
  • Outlook

    • For future versions of JobMon it is planned to automatically analyze each job and assign tags for detected characteristics

Last update: February 1, 2024