Porting BCPNNSim to GPUs

Recently ENCCS started a collaboration with the Computational Brain Science Lab at KTH Royal Institute of Technology to accelerate their BCPNNSim code on heterogeneous systems. BCPNNSim is an open-source code for scalable parallel simulation of Bayesian Confidence Propagation Neural networks. A BCPNN module features Bayesian-Hebbian synaptic plasticity as well as structural plasticity for unsupervised and supervised learning. The code has been used successfully for simulation of reduced brain models of e.g. associative memory and to run Machine Learning benchmarks like MNIST, SVHN, and CIFAR-10. 

In the first stage, we focused on the acceleration of the proxy application namely AssoMem (Associative Memory). AssoMem uses Brain-like machine learning and takes the main algorithm from BCPNN, i.e. neural computations critical for brain science and machine learning. With this approach, the matrix-vector multiplications dominate the total execution time. To validate the benefits from GPUs, we started with porting the code with OpenACC. When it became obvious that one can get quite a significant speed-up from the GPU, we also started working on CUDA version of the code. The latter should allow us to fully utilize all the features of the GPUs.

With code maintainers, we participated in the Hackathon organized by ENCCS and NVIDIA, where we had valuable assistance from Mattias Noack from NVIDIA. During the hackathon, the code was carefully profiled and analyzed using NVIDIA Nsight tools. One of the immediate optimizations that were performed based on this analysis was to introduce asynchronous OpenACC calls. We also did some additional optimization to the CUDA version, which at this point is not fully ported to the GPU and slightly underperforms when compared to OpenACC. To be able to target AMD GPUs in the future, we created an initial version of the code with OpenMP GPU offloading. We also tested the specific libraries, cuBlas, and cuSparse, with the OpenACC API. They proved to be useful for the code and will probably be used in future versions.

As it stands currently, the total execution time of AssoMem reduces from 21600 on the serial version to 157 seconds on one single A100 GPU. By comparing with the previous MPI CPU version of AssoMem (3869 seconds with 30 MPI-rank on Beskow at PDC), the performance has significantly increased. We continue to collaborate to accelerate the BCPNNSim code, with special effort dedicated to multi-GPU support.