Performance study on GPU offloading via CUDA, OpenACC, and OpenMP in AMReX

From left to right: Gary Choi, me, and our advisor Chris Rycroft.

This past week, some of us from the Rycroft group attended SIAM CSE 19 at Spokane, WA. One of the presentations I gave is a poster on GPU application performance profiling. This is part of the summer CSGF practicum project I did at LBNL with CCSE group.

The applications are written using AMReX framework, and launched on Summit (a supercomputer at OLCF). Though performance of the same code on a HPC platform can change rapidly because of software and hardware updates, the poster reveals some interesting issues and common got-cha’s.

I had some good conversations with people who work closely with AMReX and/or the Summit system. A common question we have is how to use profiling tools to get more accurate estimates of low-level memory usage. For example, some of us have observed that in functions that have atomic operations (27 in this case, 9 per Cartesian direction), OpenMP (compiled with XL 16.1.1-1) allocates about 19% more register per thread on NVIDIA GPU than OpenACC does (compiled with PGI 18) .

This leads to a lower theoretical occupancy (OpenMP 25% vs. OpenACC 31.2%). There is also a very large difference in kernel speed, OpenMP is about 4 times slower than OpenACC. However, is it not clear if low occupancy is the cause of low speed. It would be nice if tools such nvprof can give more detailed information about allocated registers and shared memories, besides just the amount, which then can potentially inform us if the performance difference stems from memory allocation or something else.

See the poster here.