Overview¶
Malfunction of the workspace file system: SOLVED
The malfunction of the workspace file system reported on July 7 has been resolved since July 18. This means that I/O errors should no longer occur when accessing the data.
The cause was several hard disks failing at the same time. As a result, less than 1 percent of the file system could no longer be accessed. In order to repair this area, a hard disk with slightly outdated data had to be installed in the system. As a result, it is theoretically possible that the file content of less than one millionth of the files has changed (silent data corruption).
Please note:
Unfortunately, we cannot say which files are affected. However, the files are located in the 1 percent of the file system that could no longer be accessed. For a file name, you can use the following command to check whether the file was located in this area and could therefore theoretically be affected by silent data corruption:
/hkfs/work/data_corrupt/check_file_corrupt.sh "FILENAME"
(replace FILENAME with the name of the file you want to check).
We also created a list of file names that were located in the aforementioned 1 percent of the file system. The file names belonging to a single workspace are listed in the workspace directory in the file
possibly_corrupt_files_in_this_workspace.txt
. The listed files are only affected by silent data corruption with a very low probability.
We recommend that you carefully check the data used in the calculations and the results.
If you discover a silent data corruption, we are interested in you reporting it in a ticket: https://nhr-helpdesk.scc.kit.edu.
Welcome to the Tier 2 High Performance Computing system "Hochleistungsrechner Karlsruhe" (HoreKa) at KIT.
HoreKa is an innovative hybrid system with more than 60,000 processor cores, nearly 300 terabytes of main memory and more than 750 NVIDIA (A100 and H100) GPUs. The CPU partition is called HoreKa Blue, while the GPU partition is called HoreKa Green and the NVIDIA H100 GPU partition is called HoreKa Teal.

A 200 GBit/s non-blocking InfiniBand HDR network is used as the communication network, and two parallel Spectrum Scale file systems with a total capacity of more than 15 petabytes are used for data storage.
A key consideration during the design of the system were the enormous amounts of data generated by scientific research projects. A multi-level data storage architecture guarantees high-throughput processing on external storage systems.
HoreKa is housed in a dedicated computer building on KIT's North Campus, which was newly constructed in 2015 for its predecessor ForHLR. The award-winning, energy-efficient hot water cooling concept is continued with the new system.
This documentation includes some information about the Helmholtz AI HAICORE partition (HAICORE@KIT) since HAICORE is integrated with HoreKa.
Access to HAICORE is open to all researchers from the Helmholtz AI community. Please refer to the official website for more information, e.g. on how to hand in a project proposal.
Access¶
Employees of all Helmholtz research institutes in Germany can self-register for HAICORE.
Use of HAICORE is free of charge, but the resource has to be acknowleged on all publications.