ENCCS gave a workshop at AIDA on October 26, 2022 in two short sessions covering “Hyperparameter optimization using Optuna” and “Distributed PyTorch training: single and multiple node training”.
A fundamental challenge when developing Deep Neural Networks is finding suitable hyperparameters such as the number and size of layers of the network, the learning rate and regularization strength. With the ongoing upscaling of compute resources in Europe, particularly within the EuroHPC JU, there is an opportunity to leverage clusters to perform large hyperparameter sweeps. In the first session of this workshop, we gave an introduction to how the Optuna  framework can be used to more efficiently search over large hyperparameter spaces of different variable types. The workshop also highlighted how this framework is not only tied to optimizing machine learning models but can in principle be used to find factors which optimize any black box which can be useful in e.g. Design Of Experiments (DOE).
While compute clusters are easily used to perform tasks such as cross-validation and hyperparameter search, using GPU clusters to perform synchronized optimization of neural networks adds complexity. In the second part of the workshop, we looked at how the built-in distributed  training capabilities of Pytorch can be used to scale out from a single GPU on a single node to multiple GPUs over multiple nodes using the torch.distributed package and the DistributedDataPrallel wrapper.
The workshop was structured in a code-along style where a neural network is gradually adapted to run on multiple GPUs, with a focus on how to set up the process groups necessary to perform the synchronized distributed training.
The workshop material can be found at: