Towards Automatic Performance Tuning of OpenACC Accelerated Scientific Applications

Abstract

OpenACC was announced in Supercomputing 2011 as a new standard for parallel programming targeting hardware accelerators [1]. Although the goal of the standard is to increase programmer productivity by using compiler directives, getting the best performance of the target device is still tedious and requires performance tuning. We propose a new methodology for empirical tuning of OpenACC accelerated scientific applications at runtime to relieve this burden from the end user. This strategy is already proven to work well on MPI communication operations in the Abstract Data and Communication Library [2] (ADCL). Indeed we are planning to partially use the framework of ADCL. We present the benefits of tuning OpenACC pragmas and clauses in an accelerated seismic imaging kernel. The application is compiled with the PGI compiler and runs on the NVIDIA K20c GPU. The performance results obtained are encouraging for future development of this methodology and its application to a larger spectrum of scientific applications.

Application: Seismic Imaging

We used in our experiments the isotropic finite difference kernel which constitutes the building block for the Reverse Time Migration (RTM) application and Full Waveform Inversion (FWI), extensively used by the oil and gas exploration industry for velocity model building and seismic imaging of the sub-surface. The Reverse Time Migration application uses forward modeling and backward migration using a finite difference kernel that solves the acoustic wave equation.

where c is the velocity of the propagated wave and P is the wavefield pressure. The 3D finite difference stencil scheme is 8th order in space and 2nd order in time. We plan in the future to extend this study to different orders in space to explore the impact of computation intensity on the parameter choices of the auto-tuner.

Performance Results

We present here the performance results of manually applying the suggested tuning methodology on the RTM isotropic modeling kernel. As shown in Figure 3, the tuning procedure that we suggest, when applied to different 3D domain sizes, leads in many cases to a significant performance improvement (up to 30%) against the input code with the default compiler optimizations. By default, the compiler implicitly uses gang and vector clauses with heuristically determined parameter values. This proves that by tuning OpenACC kernels, we can achieve good performance improvement while still programming at a higher level of abstraction. Moreover, the placement of the gang and vector clauses within the nested loops (distinguished by different colors in figure 1 and table 1) as well as the grid and block sizes leading to the best performance vary for different domain sizes. This reinforces our tendency towards runtime performance tuning, especially when the problem size for instance is a runtime parameter, undefined at compile time.

Conclusion

Although inserting OpenACC annotations is intended to be more productive than writing native CUDA, the best possible OpenACC performance can not be obtained unless the developer goes through a tedious tuning exercise. We suggested a new framework to perform runtime performance tuning automatically for OpenACC accelerated applications. The performance results obtained by hand tuning a finite difference kernel justify future development of this methodology and its application to a larger spectrum of scientific applications.

Future Work

References

[1] OpenACC standard www.openacc-standard.org/
[2] Saber Feki, Edgar Gabriel. A Historic Knowledge Based Approach for Dynamic Optimization , in proceedings of the International Conference on Parallel Computing, 2009, P. 389 - 396
[3] Yunsong Huang, Gerard T. Schuster. Multisource least-squares migration of marine streamer and land data with frequency-division encoding, Geophysical Prospecting, 2012, Vol. 60, P. 663 - 680.