Cx as a service for sustainable HPC research software engineering

Introduction

Many scientific application codes have grown over many years or even decades. Legacy issues such as outdated programming techniques, a "monolithic" design or lack of automation hamper further development and constantly increase the effort required to validate new program versions. In addition, porting to new and innovative hardware architectures - which could possibly enable much higher computing power than established architectures - is only possible to a limited extent.

In order to preserve the investments already made in the existing code bases and to ensure their future security, the area of sustainable software development is therefore becoming increasingly important. In particular, automation in the form of Continuous Integration, Continuous Testing and Continuous Deployment - CI/CT/CD or Cx for short - is of great importance here.

Project Leader: Jennifer Buchmüller
Project Partners:

KIT: J. Buchmüller, R. Caspart, H. Anzt,
T. Cojean, M. Selzer, B. Nestler
RWTH Aachen: C. Terboven
TU Darmstadt: Th. Reimann, A. Hück
PC²:  H. Riebler, H. Köstler, M. Schroschk, J. Kunkel, C. Boehme

Participating
NHR Centers:

NHR@KIT 
PC² 
NHR4CES@RWTH 
NHR4CES@TUDa 
NHR@FAU
ZIH
GWDG

Software/Library GitLab, gitlab-runner

Project description 

Software is an integral part of modern research and has significantly grown in importance in recent years. Most software projects in a scientific environment can be categorized in one of two categories, large software projects developed over a long time or small short-lived projects. The former is written and maintained by a group of researchers over a long period of time, with the members contributing to the project typically changing over time. This type of software is typically a large project, involving potentially millions of lines of code, and serving as an integral part of one or more scientific domains. The latter are small software projects, which are often only written and maintained by a very small group of people, e.g. one doctoral and one postdoctoral researcher. These projects are typically only actively developed during the course of the associated theses and used for a few years after, without being used further or being handed over to succeeding doctoral researchers. Typically, the majority of scientific software projects focus only on a single type of compute cluster, with one specific architecture, e.g. a cluster with many x86 CPUs and a fast interconnect between the nodes. Due to the involved demand for person power, especially for large projects, porting or adapting to additional clusters is typically avoided as much as possible and only conducted when necessary. As a consequence these projects are limited in where they can
confidently be used and researchers are faced with a significant time invest when intending to run them on different existing or future clusters.

Continuous Integration, Testing, Deployment and Benchmarking, CI, CT, CD and CB or Cx for short, can significantly help in the development on the one hand and software maintenance and sustainable usage of these codes on the other hand. Consequent use of CI and CT can help to foster confidence in the code of a project and ensure changes to the code and new features or reworks do not break previously working parts of the software. In addition, defined procedures help in the process of onboarding new researches, as they can follow the recipes and instructions for building and testing the software defined for the Cx process. These can serve as an implicit documentation and supplement existing documentation. Therefore, Cx can help to foster larger collaborations around a project. While these types of tests can also be performed by hand, CI and CT help to significantly reduce the required time spent on testing and enable testing on a much broader range of systems and architectures, thereby helping to facilitate a sustainable software development when considering usage of the software on different existing and future clusters. CB enriches the information provided to developers by CT and CI with additional runtime information like for example the time needed to solve a certain problem or associated memory profile. CB thereby enables developers to judge whether additions to a project have any unwanted impacts on the performance of a project. CB specifically helps to ensure refactoring or reworking of code does not negatively impact the performance of the software project.

The project partners have experience with setting up and operating Cx setups at their respective HPC Clusters and also using Cx for their research software projects. Following the project layout for the preceding project in 2021, the foreseen work for this project can be mainly split into two parts. On the one hand, we will further develop and improve setups for Cx at the involved NHR centers and aim towards providing a unified solution for Cx in the context of HPC centers. In the preceding project, first steps were undertaken to identify commonalities in Cx deployments at the center and derive common approaches to the setup. And initial steps towards a unified Cx setup. The next steps in this project will be to provide these types of setups at additional centers and enable the usage of additional types of resources, such as FPGAs, GPUs and different CPU architectures for Cx. A major part of this project will be working on an as common as possible approach to user authentication and authorization in the scope of the Cx setups. Currently, the centers follow different protocols, ranging from researchers being required to use their personal accounts, over project specific service accounts to general service accounts used for Cx. In addition, we aim to conduct a survey among the users at the NHR centers and related to them to gather insight in the experience and proficiency of users with Cx setups. This will serve as an input to provide appropriate levels of documentation and training to meet the demand of the users. It will also provide information on the requirements and expectations of users towards the offered Cx services and help identify missing functionalities.

On the other hand, we will further develop the field of continuous benchmarking (CB). Building on the last application, the concept of automatically creating benchmarks with Autotester is to be extended: Instead of describing the execution rules to be interpreted by Autotester for the generation of benchmarks in XML as before, visually representable workflows are to be used as execution rules. For this purpose it is planned to use the already existing workflow solution from Kadi4Mat, which offers comprehensive possibilities for the visual creation, storage and execution of workflows. The workflows represent processes that can be freely assembled from executable components. The use of the workflows for the generation of the benchmarks aims at facilitating the verification of the tests. The clear representation of the workflows as well as a restriction of the selection of allowed components is advantageous. This approach permits the simple integration of inhomogeneous software components into the flow for the production of the benchmarks.

The workflows can be created either directly in the web interface of Kadi4Mat, or with a specialized desktop application and then uploaded to the Kadi4Mat data repository. Within the CI pipeline, workflows that have already been stored and released can be started via a REST interface on different HPC systems to generate the desired benchmarks. The benchmark results should first be stored in the Kadi4Mat data repository and are directly accessible via the Kadi4Mat web interface. On the other hand, an automated query via Kadi4Mat's REST interface within the CI pipeline is also possible. In summary, storage in a fully-fledged FDM solution such as Kadi4Mat offers a wide range of options for further evaluation and publication of the benchmark results.


For both of these parts of the project, containerization plays an important role and is the de-facto standard in Cx services offered outside of the HPC environment, such as GitHub Actions or GitLab CI. We therefore need to further investigate and enable the usage of containers. To that end, we foresee a close interaction with the corresponding NHR central project to interface (1) their established proposals and solutions for providing containers to the users of our Cx services and also (2) feedback insights gained from, e.g. the user survey to the container project.