Training of Swedish language models while leveraging the capabilities of EuroHPC JU petascale system, Vega cluster has been the focus of ENCCS collaboration with KB’s data scientists.
The main goal was to perform multi-GPU node training on Vega. Standard practice for such tasks is to use containers in a High-Performance Computing (HPC) environment where the stack and its compatibility can be controlled. The available container runtime on Vega is Singularity. Therefore, the task was to create a Singularity container that has compatible Nvidia-based libraries, e.g., Cuda and NCCL, with the hardware and necessary software stack such as PyTorch, Transformers, DeepSpeed to train the desired language model.
We successfully created an Nvidia-based container that enabled us to train the language model on multiple-GPU nodes of Vega successfully. This achievement makes it possible for the KB team to train a much larger model with unprecedented data on Vega or a similar cluster. The ongoing efforts now are concentrated on producing a similar container for an AMD-based platform.