### Hybrid parallelized multigrid with OpenMP, GPU acceleration, and MPI

Many problems in computational sciences require solving linear systems, in the form of $\mathbf A \mathbf x=\mathbf b$. Examples includes integrating an system of ordinary differential equations (ODEs) or solving partial differential equations (PDEs), image processing and analysis, etc. When the problems become large, direct methods, such as LU decomposition, become inefficient. This makes using an iterative method, such as multigrid, very appealing for solving a large linear system.

In the following sections, I will describe briefly (1) some applications in my research area that call for multigrid; (2) how multigrid works in a nutshell; (3) designs of parallelism in the multigrid solver developed in our research group; (4) effects of adding OpenMP on the existing MPI implementation; (4) effects of adding GPU acceleration.  You could read the post sequentially, or use the following links to jump to any self-contained sections.

## Parallel design

In our research group, we are interested in building a fluid-structure interaction (FSI) problem solver for 3D problems. This leads to very large linear systems when we require high spatial resolution. We rely on our in-house multigrid solver for solving these large matrix equations. The multigrid algorithm has a $\mathcal{O}(n)$ time complexity, where $n$ is the number of unknowns.

#### The type of application

In the terminology of parallel computing, our FSI solver is a compute-intensive application, in particular, a high performance computing (HPC) application. HPC application is characterized by having multiple parallel tasks working on a part of the same problem. The tasks would need to communicate and nodes must synchronize with one another. Example applications include computational fluid dynamics, cosmological simulation, and climate model simulations.

This categorization is in contrast with compute-intensive but high throughput computing applications, where a large number of independent tasks need to be performed, and implementation is optimized to execute as many tasks as possible.

#### The level of parallelism

At the task level, the multigrid solver has been implemented with distributed memory parallelism via MPI, where the problem domain is divided into subdomains and each subdomain carries out computation simultaneously. We use ghost regions on the outer edge of the subdomain to store information about the neighbors. The ghost regions need to be communicated at certain points in the computation.

What this project aims to do is to add parallelism at the procedure level. This is also called medium grain parallelism, implementation is done via using OpenMP and/or GPU acceleration. The part that has the most potential, as shown in the profiling part below, is the smoothing step, e.g. Jacobi or Gauss-Seidel iterative method, for detailed discussion see section on Multigrid in a nut shell.

The code is written in C++, and it goes without saying that we use optimization compiler flags, such as $\texttt{-O3}$, to optimize our code. But following the good advice of getting the code to work first, before optimization, we’d not use any finer level of parallelism for now.

Very often, to test a numerical method and its implementation, we would run simulations with many different sets of parameters in a large parameter space. This is deployed at the application level, which comes with its own challenges to maintain good scalability. I would not consider application level parallelism for now in this project.

#### The parallel execution model

The parallelism model is Single Program-Multiple Data model or SPMD. This is when a single program is executed on all tasks, but depending on the load in each task, they may execute different parts of the program at a given time.

#### Profiling and estimating theoretical speed-up and scalability

Before we begin adding procedure level parallelism, let’s look at how multigrid performs with only task level (via MPI) parallelism. Since each level of the grid behaves similarly, let’s begin by looking at how one grid performs, i.e. how much time it spends on communication and doing the actual Gauss-Seidel sweeping. Communication is necessary because Gauss-Seidel depends on neighboring grip points, and to carry out the next sweep, we need to fill the ghost regions with updated values from the neighbors.

Later, we will look at the time the whole multigrid system spend on carrying out a V-cycle, and the breakdown of time in the two main components of a V-cycle.

#### Performance of smoothing at a given grid level

The figure below shows the time it takes to do 100 Gauss-Seidel sweeps to solve a Poisson problem with point sources in the domain $\Omega = [0,1]^3$, with $100\times 100 \times 100$ grid points. The speedup saturates after 8 MPI processes. The maximum speedup we observe is $\frac{4.776}{1.473} \approx 3.242$. Usually this is because of the overhead of parallelization. To confirm, we plot the percentage of time our grid spend on communication and Gauss-Seidel sweep versus the number of MPI processes. As the plot below shows, aggressive task division incurs more communication overhead, and leads to slight increase in the computation time. #### Performance of multigrid as a whole

Next we test how much speedup we gain in parallelizing the multigrid algorithm. Here we only test on a V-cycle to get a general idea of the performance. We keep track of the total time a V-cycle takes, as well as the time it takes to do smoothing, and restriction/interpolation. We repeat the V-cycle 5 times and take the average values. Times are plotted below.

This is very similar to what we have observed in the last subsection. Speedup saturates after 8 processes; after that excessive parallelization overhead leads to increase in execution time. Next let’s look at the percentage of time a single V-cycle spends on smoothing, and doing restriction/interpolation. We observe that a V-cycle consistently spend approximately 80% of time on smoothing, and 20% on restriction/interpolation. As the number of processes increase, communication overhead in restriction/interpolation most likely increases, driving up its time usage percentage. #### Effect on parallel design: overhead, remedies, speedups

Besides communication, other overheads include sequential part of the code, mainly in allocating memory and initializing various data structure needed for communication. These overhead costs are one time, i.e. they are done once at the beginning of the application.

Load balancing could be an issue here, because at a coarse grid, the total amount of work is small. The solver aggregates work in fewer number of MPI processes than originally allotted. Although this means some MPI processes would be waiting in these levels, the advantage is that we cut down communication cost when a problem can fit on a single node.

From profiling plots above, it seems to me that communications is one of the largest overhead. One of the simplest solution is, before diving into a very long simulation, we profile each application and make sure we use the appropriate number of MPI processes.

Let’s do a quick estimation. For a single V-cycle, roughly 80% of time is spend on smoothing. If we use GPU for this step and assume we have enough cores such that all parallelizable computations can be done all at once, and further assume that communications in GPU is much faster than CPU, we have a theoretical upper bound of speedup = 5, i.e. when Gauss-Seidel sweep part of the V-cycle takes negligible amount of time compared to restriction/interpolation.

## Implementation of OpenMP

#### Reordering

For a relaxation method like Jacobi, updates at each grid point use information from the previous iteration. Parallelization is straightforward. We use two solution arrays, and alternate writing to each array.

At first glance, Gauss-Seidel seems to be a sequential algorithm. Fortunately, matrices resulted from discretization partial differential equations are often sparse, which means to update a grid point only requires information from a few nearby grid points. This makes it possible to parallelize Gauss-Seidel.

When adding OpenMP parallelization to Gauss-Seidel, the order of computation has to be rearranged. If we compute in the natural ordering, we might run into race condition. A thread might read a value before or after a thread update it, leading to inconsistent results.

Consider a 1D Poisson problem $\nabla^2 u = f$, with finite difference stencil for the Laplacian operator being $\nabla^2 u_i = \frac{u_{i+1} - 2 u_i + u_{i-1}}{h^2}$, where $h$ is the grid spacing. Using this stencil, we would end up with a tridiagonal matrix. Notice the derivatives only depend on the immediate neighboring points and no other points further away.

Thus, we can first update all the even grid points, which only depend on the odd grid points. After that, we would update all the odd points. We could avoid race condition in this way. This is sometimes called the red-black ordering (if we label first set of points red and second set black. This is part of what’s called multicolor ordering in the lingo of iterative methods).

We can use the idea of red-black ordering, and modify it to fit our particular matrix problem. In my application, the stencil uses every point in the $3\times 3\times 3$ neighborhood of the grid point. To simplify the ordering, I arrange grid points in layers, each layers have the same $z$ coordinate. Hence one layer would only depend on grid points in the same layer, and two immediate neighboring layers.

The new ordering scheme is to compute the layers with even $z$ index first, then the odd layers. The resulting solution at intermediate steps will be different from the serial execution, because we have changed the update order, but the difference should be small. If the method converges, the slight difference in the solution at intermediate steps is not a problem, since eventually, both would converges to the same solution.

However, a caveat of reordering is that it might affect the convergence rate. Depending on the problem, the loss in convergence rate can be offset by the gain in added parallelism, but this is not always the case.

#### Convergence

I implemented the simple offset ordering scheme, with dynamic scheduling, and execute the code on the same Poisson problem we looked at in the previous section “Performance of smoothing at a given grid level,” i.e. solving the Poisson equation with point sources, in $100\times 100 \times 100$ grid, 100 iterations. To check that the parallel implementations are doing the right thing, we first look at how the residual reduces with each iteration.

In the figure below, the residual after each iteration is plotted, for both Jacobi and Gauss-Seidel method. For each method, serial and OpenMP implementation are both included. There appears to be only 3 lines, because for Jacobi, two implementations overlap exactly, which is expected because parallelized and serial Jacobi are mathematically equivalent. For Gauss-Seidel, OpenMP implementation has a slightly larger error, appears to be a constant shift above the serial version. The over all convergence rate is similar to the serial implementation. The difference in residual, as mentioned above, is due the change in the order of computation.

Comparing the solutions from parallel implementation and serial code, the $L_2$ norm of the difference, defined as $\int_{\Omega} ( \mathbf u_p - \mathbf u_s)^2 d\mathbf x$, is on the order of $10^{-6}$. Here $\mathbf u_p, ~\mathbf u_s$ denote parallel and serial solution, respectively. The integration is taken over the entire problem domain.

#### Speedup

Now that we have confirmed OpenMPimplementation is doing the right thing, let us look at the speedup. The figure below shows the execution time for running 100 iterations (left axis) and speedup (right axis), defined as $\frac{T_{serial}}{T_{parallel}}$. All data points are generated with only 1 MPI process.

The execution time decreases with increasing number of threads, as expected. However, the rate of decrease slows down, and the execution time tends to a constant. This is expected, as the number of threads increases, the time spent on overhead starts to dominate. Therefore, we cannot reduce the execution time further. Speedup is just another way to look at the performance of the parallelization. As shown in the figure above (right axis), the speed up saturates at around 10. Interestingly, speedups of both Jacobi and Gauss-Seidel increase linearly until approximately 16 threads. At large number of threads, the fluctuations in small execution time (in denominator  leads to large fluctuations in speedup. But it is apparent that Jacobi has a higher speedup than Gauss-Seidel in the limit of large number of threads.

This is partly because we only launch the threads once for Jacobi. But we have to launch the threads twice for Gauss-Seidel, once for even layers of points, and another time for odd layers, leading to more overhead.

#### Apply to Multigrid

To keep it simple, let’s look at applying Gauss-Seidel to multigrid. Since multigrid creates coarser and coarser grids, using the same number of threads throughout the entire algorithm can be inefficient. To gain some intuition on how many threads are needed for a given system size, we run a series Poisson problem tests (same test problem as in the section above), but vary the system size from 10 to 100. The speedups are plotted below. The plot shows for the larger systems, performance fluctuates quite a bit when we use more than 16 threads. For smaller systems, the speedup plateaus quickly after just a few threads. Therefore, we limit the maximum number of threads to use to be 16, and adjust the maximum number for smaller systems.

We run multigrid test on the same Poisson problem, with 6 V-cycles. At a grid level $l$, the number of Gauss-Seidel sweeps is set to $l+2$.

Recall from section “Performance of multigrid as a whole,” multigrid spends about 20% of total time on restriction and interpolation. Note that this part of the code is not multithreaded. The speedup, shown below, is calculated for both the grid smoothing part of the algorithm, and the algorithm as a whole, for system size 100 and 200. The speedups are much smaller than what we have observed for running Gauss-Seidel tests. This is a combined effect of the different amount of work threads have to do in multigrid compared to in just Gauss-Seidel sweeps. There is in total smaller and smaller amount of work on the coarser grids. On each level, the number of Gauss-Seidel iterations (~3-9 iterations) is also much smaller than what we have used in the Gauss-Seidel test (100 iterations).

For a larger system size ( $m=200$), since the ratios of parallelized computation to parallelization overhead, and sequential part are both larger, we achieve slightly better performance improvement. In the best case, when we use 4 MPI processes, we can achieve a factor of 2.5 speedup in Gauss-Seidel operations, and a factor of 1.8-2 speedup in multigrid as whole.

## Implementation of GPU acceleration

There are several steps in delegating the smoothing step to the GPU. First memory needs to be allocated and initialized on the GPU. This is done via cudaMalloc and cudaMemcpy. Then CUDA kernels are called to carry out the computations. Finally, we transfer the memory back to the CPU, using cudaMemcpy again.

Memory transfer between GPU and CPU is the most expensive part of the computation, we ought to develop code to maximize computation and minimize memory transfer.

There are several issues we need to address. Let’s assume that the entire problem can fit in a single GPU. First question is how to specify the matrix entries for each grid point. For example, if the matrix results from discretizing a PDE using finite difference, the banded structure of the matrix would correspond to the stencil. For periodic boundary condition, we would have the same stencil for every grid point. The boundary conditions are handled by using a ghost region, periodic boundary conditions means we would need to communicate the updated information at the end of every smoothing step.

If we have other type of boundary conditions, we need to use if statements to differentiate the grid points, by checking what type of boundary a grid point is on. Branching is inefficient in GPU, since it prevents GPU from executing the same code at the same time on all the threads on a warp. But in non-periodic cases, we have the trade-off of not having to communicate at all.

In the OpenMP implementation, we multithreaded on the even and odd z-layers. We design the GPU code in similar fashion. There is a crucial difference, however. In OpenMP implementation, computations are carried out in sequential order within each z-layer. In GPU, all operations are carried out in parallel. We expect the solution to be different again. We must check again if the algorithm converges in a similar manner to serial and OpenMP implementations.

#### Convergence and Speedup

In the test case we have seen so far, I have used non-periodic boundary conditions. This makes first pass in CUDA implementation simpler, since we don’t have to handle communication. Residual is plotted below. The residual from GPU solve decreases at a similar rate as Gauss-Seidel and Jacobi. However, the residual at 100th iteration is approximately twice as high as that of the serial and OpenMP implementations of Gauss-Seidel. The good news is that the GPU implementation is more than 30 times faster than its CPU counter part. For system size $m=100$, serial execution take about 4 seconds. Using 4 threads, OpenMP implementation takes about 2.5 seconds. Using GPU, it takes only 0.12 seconds. We could increase the number of iterations to compensate for the slower convergence.

#### Applying to multigrid

After restriction, the total number of grid points decreases by 8 fold. A good heuristic to use for moderate system sizes ( $~100^3$) is to only apply GPU to the finest grid. This could be modify in the future. In the tables below, I report the execution time, speedup, and the percentage of time multigrid spends on smoothing in several runtime configuration for $m=100, 200$. Discussions of the results are in the conclusion section.  ## Conclusion and Future work

• Take away:
• GPU though sounds appealing, finding and designing a suitable problem is crucial, this includes the type of computation and the size of the computation. For smaller problems, and for problems with frequent need of communication, using GPU does not improve performance as much as using OpenMP with a moderate number of threads (6-20 threads). Example of this can be seen in the red cells in the table above.
• Even though OpenMP can boost basic iterative method performance by a factor of 8-10 at best, when applied to distributed memory multigrid, the max speed achieved is only 2.5 – 4, due to smaller sizes of parallel work at the lower grid levels. Still better than using GPU.
• Using only MPI and OpenMP hybrid, I ran tests with much larger problem sizes on 2 nodes connected via fast (10Gb/s) ethernet. The results are tabulated above. We are able to solve for 1 billion unknowns within 1.5 minutes. Speedup versus serial execution is not calculated because there is no serial time. Using OpenMP, we observe 1.2 – 2 factor of speedup comparing to using a single thread. • Future work: Tests in the project was limited to a few workstations. Problem size on a single node is limited to N=200, by GPU RAM. I plan to further explore hybridization of MPI, OpenMP and GPU, by testing it on larger clusters, e.g. AWS or university cluster.