Towards Automatic Performance Tuning of OpenACC Accelerated Scientific Applications
Abstract
OpenACC was announced in Supercomputing 2011 as a new standard for parallel programming targeting hardware
accelerators [1]. Although the goal of the standard is to increase programmer productivity by using compiler directives,
getting the best performance of the target device is still tedious and requires performance tuning.
We propose a new methodology for empirical tuning of OpenACC accelerated scientific applications at runtime to relieve
this burden from the end user. This strategy is already proven to work well on MPI communication operations in the Abstract
Data and Communication Library [2] (ADCL). Indeed we are planning to partially use the framework of ADCL. We present
the benefits of tuning OpenACC pragmas and clauses in an accelerated seismic imaging kernel. The application is compiled
with the PGI compiler and runs on the NVIDIA K20c GPU. The performance results obtained are encouraging for future
development of this methodology and its application to a larger spectrum of scientific applications.
Application: Seismic Imaging
We used in our experiments the isotropic finite difference kernel which constitutes the building block
for the Reverse Time Migration (RTM) application and Full Waveform Inversion (FWI), extensively
used by the oil and gas exploration industry for velocity model building and seismic imaging of the
sub-surface. The Reverse Time Migration application uses forward modeling and backward
migration using a finite difference kernel that solves the acoustic wave equation.
where c is the velocity of the propagated wave and P is the wavefield pressure.
The 3D finite difference stencil scheme is 8th order in space and 2nd order in time. We plan in the
future to extend this study to different orders in space to explore the impact of computation intensity
on the parameter choices of the auto-tuner.
Performance Results
We present here the performance results of manually applying the suggested tuning methodology on the RTM isotropic modeling kernel. As shown
in Figure 3, the tuning procedure that we suggest, when applied to different 3D domain sizes, leads in many cases to a significant performance
improvement (up to 30%) against the input code with the default compiler optimizations. By default, the compiler implicitly uses gang and vector
clauses with heuristically determined parameter values. This proves that by tuning OpenACC kernels, we can achieve good performance
improvement while still programming at a higher level of abstraction. Moreover, the placement of the gang and vector clauses within the nested
loops (distinguished by different colors in figure 1 and table 1) as well as the grid and block sizes leading to the best performance vary for different
domain sizes. This reinforces our tendency towards runtime performance tuning, especially when the problem size for instance is a runtime
parameter, undefined at compile time.
Conclusion
Although inserting OpenACC annotations is intended to be more
productive than writing native CUDA, the best possible OpenACC
performance can not be obtained unless the developer goes through
a tedious tuning exercise. We suggested a new framework to
perform runtime performance tuning automatically for OpenACC
accelerated applications. The performance results obtained by hand
tuning a finite difference kernel justify future development of this
methodology and its application to a larger spectrum of scientific
applications.
Future Work
- Explore different computational intensities by increasing,
decreasing the order in space of the finite difference kernel and
considering the 2D case.
- Implement the auto tuning engine and test it with a variety of
scientific applications, and different generations of GPUs.
References
[1] OpenACC standard www.openacc-standard.org/
[2] Saber Feki, Edgar Gabriel. A Historic Knowledge Based Approach for
Dynamic Optimization , in proceedings of the International Conference on
Parallel Computing, 2009, P. 389 - 396
[3] Yunsong Huang, Gerard T. Schuster. Multisource least-squares
migration of marine streamer and land data with frequency-division
encoding, Geophysical Prospecting, 2012, Vol. 60, P. 663 - 680.