NEK5000 and LUMI GPU Partitions

One of main tasks for the ENCCS project is to enable Nek5000 to run on LUMI GPU partitions and other Prace Tier-0 heterogeneous systems for large scale simulations. It was announced in the LUMI press conference on October 21, 2020 that the LUMI partition will be equipped by the AMD Instinct GPUs with peak performance of 550 PFLOPs. Furthermore, it is written in a recent LUMI blog:

” If you can currently build your OpenACC programs with the GNU compiler, you should be able to use OpenACC on LUMI. As an alternative to OpenACC, LUMI will support programs using OpenMP directives to offload code to the GPUs. OpenMP offloading has better support from the system vendor, meaning it may be worth considering porting your OpenACC code to OpenMP.

 

Nek5000 has been ported to GPU using OpenACC [1]. However, with this advice we rewrote code using OpenMP GPU offloading [2]. Fortunately, the map between OpenACC and OpenMP GPU offloading is rather straightforward and almost one-by-one. As a example, for the most time-consuming matrix-matrix multiplications in the Nek5000, the OpenACC directive COLLAPSE instruct is used to collapse the quadruply nested loop into a single loop

!$ACC PARALLEL LOOP COLLAPSE(4)

The corresponding OpenMP GPU offloading directive is

!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO COLLAPSE(4)

Though we encountered many compilation issues,  we firstly implemented the OpenMP to Nek5000 mini-app, Nekbone.  It uses  Jacobi-preconditioned conjugate gradients and the gather-scatter operation, which are principal computation and communication kernels in Nek5000. Consequently, Nekbone is used as kernel benchmarks for Nek5000.

Figure 1 – Performance results of Nekbone using OpenMP GPU offloading on a single NVIDIA P100 GPU.  The performance depends critically on the computational workload of the GPU, which  is the same as that of OpenACC and CUDA [2]. The performance increases with the number of elements (E) and the polynomial order (N).
Figure 2 – Comparison of performance with fixed polynomial order (N=11) between OpenMP, OpenACC and OpenACC+CUDA. The performances using OpenMP GPU offloading are slightly better than these using pure OpenACC. 

We used a 3D eddy problem to verify and validate Nek5000 simulations on GPU systems. To run 100,000 steps, the maximum errors  (by comparing with exact solutions) between CPU version and OpenACC are O(10^-12) for the velocity fields and O(10^-8) for the pressure field. 

References:

[1] Evelyn Otero, Jing Gong, Misun Min, Pual Fischer, Philipp Schlatter, and Erwin Laure, OpenACC acceleration for the PN-PN-2 algorithm in Nek5000, Journal of Parallel and Distributed Computing, Vol.132. pp 69-78.

[2]  OpenMP: www.openmp.org

[3] Jing Gong, Stefano Markidis, Erwin Laure, Matthew Otten, Paul Fischer and Misun Min, Nekbone performance on GPUs with OpenACC and CUDA Fortran implementations, The Journal of Supercomputing, Vol. 72, pp. 4160-4180

Categories:

Tags: