Skip to content

Gaudi 2/3

The official documentation for Intel Gaudi can be found at https://docs.habana.ai/en/latest/index.html

Hardware Overview

The system management interface tool hl-smi aids in the management and monitoring of the Gaudi accelerators.

Running hl-smi without an Options argument set displays a summary table of the detected Gaudi devices:

+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.19.1-fw-57.2.2.0          |
| Driver Version:                                     1.19.1-6f47ddd          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:19:00.0     N/A |                   0  |
| N/A   27C   N/A  82W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   26C   N/A  73W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:1a:00.0     N/A |                   0  |
| N/A   28C   N/A  75W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:b4:00.0     N/A |                   0  |
| N/A   28C   N/A  79W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:43:00.0     N/A |                   0  |
| N/A   28C   N/A  81W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:cc:00.0     N/A |                   0  |
| N/A   28C   N/A  73W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:44:00.0     N/A |                   0  |
| N/A   27C   N/A  82W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:cd:00.0     N/A |                   0  |
| N/A   28C   N/A  77W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

Using Gaudi Accelerators

Intel provides a custom Intel Gaudi PyTorch environment which is optimized for Intel Gaudi AI accelerator. The necessary software is preinstalled on the nodes and can be loaded using the Lmod module system. In order to avoid ambiguities, it is advisable to load architecture-specific modules on the compute nodes, working in an allocation created with salloc on the requested node:

salloc -p gaudi2 -t 01:00:00
salloc -p gaudi3 -t 01:00:00

We can then load the pytorch module:

module purge
module load toolkit/gaudi-torch

Afterwards a python program can be started via:

python torch_example.py

Code examples and more information can be found here.

Software for Gaudi

From our experience, the least error prone way to get a functional software stack for Gaudi is to use containers. These are provided by Intel directly. Here is a small working example to encorporate a Docker image from Intel, use the enroot runtime to launch it and to run a hello world example script found here here.

# Login to FTP
$ ssh <user>@ftp-x86-login.scc.kit.edu

# Allocate Gaudi 3 Node
$ salloc -p gaudi3 -t 01:00:00   

# Get Docker container
$ enroot import docker://vault.habana.ai#gaudi-docker/1.21.1/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest
$ enroot create gaudi-docker+1.21.1+ubuntu24.04+habanalabs+pytorch-installer-2.6.0+latest.sqsh

# Download sample code
$ mkdir -p gaudi && cd gaudi
$ git clone -b 1.21.0 https://github.com/HabanaAI/Model-References.git
$ cd Model-References/PyTorch/examples/computer_vision/hello_world/

# Start container
$ enroot start --rw -m .:/mnt gaudi-docker+1.21.1+ubuntu24.04+habanalabs+pytorch-installer-2.6.0 bash
$ cd /mnt

# Run sample case
$ mpirun -n 8 --bind-to core --map-by socket:PE=6 \
      --rank-by core --report-bindings \
      --allow-run-as-root \
      -x PT_HPU_LAZY_MODE=0 python -W ignore mnist.py \
      --batch-size=64 --epochs=1 \
      --lr=1.0 --gamma=0.7 \
      --hpu --autocast --use-torch-compile
In a separate terminal on the same compute node, you can check the activity of the hardware:

# Login to FTP
$ ssh <user>@ftp-x86-login.scc.kit.edu

# Login to active compute node
$ srun --jobid $(squeue --format=%i -h|head -n 1) --pty bash

# Monitor the activity
$ hl-smi -l 1
You should see something like this:
+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.21.1-fw-59.2.3.0          |
| Driver Version:                                     1.21.0-ca59b5a          |
| Nic Driver Version:                                 1.21.0-732bcf3          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-325L             N/A  | 0000:17:00.0     N/A |                   0  |
| N/A   45C   P0  222W /  900W  |131072MiB / 131072MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   1  HL-325L             N/A  | 0000:97:00.0     N/A |                   0  |
| N/A   44C   P0  223W /  900W  |131072MiB / 131072MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   2  HL-325L             N/A  | 0000:2c:00.0     N/A |                   0  |
| N/A   39C   P0  219W /  900W  |131072MiB / 131072MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   3  HL-325L             N/A  | 0000:ba:00.0     N/A |                   0  |
| N/A   39C   P0  220W /  900W  |131072MiB / 131072MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   4  HL-325L             N/A  | 0000:3d:00.0     N/A |                   0  |
| N/A   41C   P0  218W /  900W  |131072MiB / 131072MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   5  HL-325L             N/A  | 0000:a9:00.0     N/A |                   0  |
| N/A   43C   P0  222W /  900W  |131072MiB / 131072MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   6  HL-325L             N/A  | 0000:4e:00.0     N/A |                   0  |
| N/A   45C   P0  217W /  900W  |131072MiB / 131072MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   7  HL-325L             N/A  | 0000:cb:00.0     N/A |                   0  |
| N/A   43C   P0  224W /  900W  |131072MiB / 131072MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0       1661819     C   python                                  130400MiB  
|   1       1661823     C   python                                  130400MiB  
|   2       1661820     C   python                                  130400MiB  
|   3       1661825     C   python                                  130400MiB  
|   4       1661821     C   python                                  130400MiB  
|   5       1661824     C   python                                  130400MiB  
|   6       1661822     C   python                                  130400MiB  
|   7       1661826     C   python                                  130400MiB  
+=============================================================================+