Kungliga Biblioteket (KB) – Project update

Training of Swedish language models while leveraging the capabilities of EuroHPC JU petascale system, Vega cluster has been the focus of ENCCS collaboration with KB’s data scientists.

The main goal was to perform multi-GPU node training on Vega. Standard practice for such tasks is to use containers in a High-Performance Computing (HPC) environment where the stack and its compatibility can be controlled. The available container runtime on Vega is Singularity. Therefore, the task was to create a Singularity container that has compatible Nvidia-based libraries, e.g., Cuda and NCCL, with the hardware and necessary software stack such as PyTorch, Transformers, DeepSpeed to train the desired language model.

We successfully created an Nvidia-based container that enabled us to train the language model on multiple-GPU nodes of Vega successfully. This achievement makes it possible for the KB team to train a much larger model with unprecedented data on Vega or a similar cluster. The ongoing efforts now are concentrated on producing a similar container for an AMD-based platform.

RECENT NEWS

Time: Tue 2022-01-25 16.00 - 17.30 Location: Online (link will be sent after registration) Join PDC in the online inauguration for the Read more
Recently ENCCS started a collaboration with the Computational Brain Science Lab at KTH Royal Institute of Technology to accelerate their Read more
Together with SMHI and NSC, ENCCS is working on porting EC-Earth v4 to Dardel which is the newest high-performance computing Read more

Categories: