GPGPU: openACC Implementation
On HPC system a code-optimization yields a global minimum of the computational cost if the optimal programming strategy is performed on the targeted HPC system. Such an optimization could be pursued for a specific technical application. However, it will require a large number of performance evaluations and as such non-trivial computational effort for tuning various functions on HPC environments. The focus of this sub-project is on the development of the optimum workflow using heterogeneous hardware systems, i.e., CPU and GPU processors on JUWELS. The performance assessment steps define the fundamental scientific core of this project.
In addition to the optimization of the MPI/OpenMP code the GPU-programming will also enable a more detailed tuning of the individual subroutines. The assessment of a CFD solver with respect to the HPC performance terms, which are defined for MPI, OpenMP threading, parallel file IO, and GPU-based acceleration, acts as an essential factor or sensitivity map to identify the software tuning strategies, which are mainly responsible for the final HPC performance.
If this analysis is applied to both a baseline code without GPU-programming and a configuration with GPU-programming via OpenACC and CUDA, this information will improve the understanding how the GPU-programs modify the software architecture and especially their effect on the HPC performance. In the long run, this will greatly enhance the identification of other possibly even more effective GPU-based computations such as modular supercomputing on CPU/GPU systems of JUWELS.
Implementation
The up-to-date HPC tools consist of various modules designed to solve multi-scale multi-physics problems. The current solver has been successfully applied to the various CFD problems. The numerical solver is implemented using the high-order accuracy finite difference schemes and a Runge-Kutta method.
type, pointer :: fieldlocal => null
...
DO iteration = 1, max_iterations
DO localBlockId = 1, size_of_localDomain_block
fieldlocal => localDomain_block_field
!$acc data create(fieldlocal)
!$acc update device(fieldlocal)
DO RK_Stage = 1, MAX_RK_STAGE
call Required_Subroutines(...,fieldlocal,...)
!$acc update self(fieldlocal)
!$acc update device(fieldlocal)
Structure of the main subroutine with the GPU region defined by the OpenACC directives.
! Subroutine Required_Subroutines
!$acc parallel present(fieldlocal(:,:,:,:))
!$acc loop gang
DO k = 1, DIM_MAX_K
!$acc loop vector collapse(2)
DO j = 1, DIM_MAX_J
DO i = 1, DIM_MAX_I
fieldlocal(...,:) = ...
...
!$acc end parallel
Parallelization of DO-loops in the subroutine by the OpenACC directives.
In the Runge-Kutta steps of the main subroutine the data
directive defines a region of code in which GPU arrays remain on the GPU and are shared among all kernels in that region. The most simulation time is spent to compute the new flux distribution in the Runge-Kutta stage. To improve the overall performance the flux vector calculation must be accelerated. In recent implementation an OpenACC programming is tested to port the original MPI/OpenMP code to JUWELS GPU cluster.
The subroutines Required_Subroutines include nested DO
-loops to calculate the field vector at a new time step. The detected hotspot in the former tuning workshop in 2019 is parallelized using OpenACC directive. The most inner iteration loops are collapsed to increase the vector size and to improve the memory bandwidth.
Baseline Performance
Scaling behavior of the CFD solver on the JUWELS GPU cluster of JSC. The test was performed with a structured grid domain of 201$\times$201$\times$201 meshes.
Case ID | # of cores | # of MPI_Ranks | # of GPUs | Runtime (s) | speedup |
---|---|---|---|---|---|
I | 48 | 8 | - | 1.24 | 1.0 |
I | 48 | 12 | - | 0.85 | 1.5 |
I | 48 | 24 | - | 0.43 | 2.9 |
P | 8 | 8 | 4 | 0.74 | 1.7 |
P | 16 | 16 | 4 | 0.61 | 2.0 |
P | 32 | 32 | 4 | 0.59 | 2.1 |
P | 48 | 48 | 4 | 0.66 | 1.9 |
Plan
The code is running on the current software stack on JUWELS Booster equipped with NVIDIA A100 GPUs. The performance improvement with new features of GPGPU and CI will be reported in 2021.