dot
Detailansicht
Katalogkarte GBA
Katalogkarte ISBD
Suche präzisieren
Drucken
Download RIS
Hier klicken, um den Treffer aus der Auswahl zu entfernen
Titel An Efficient Stencil Implementation for Modern GPUs
VerfasserIn M. Krotkiewski, M. Dabrowski
Konferenz EGU General Assembly 2012
Medientyp Artikel
Sprache Englisch
Digitales Dokument PDF
Erschienen In: GRA - Volume 14 (2012)
Datensatznummer 250070137
 
Zusammenfassung
Efficient solution of the Poisson's equation is crucial for many applications in geophysics. We show that modern Graphics Processing Units (GPUs) are very well suited for solving Poisson's equation on structured Cartesian grids using techniques such as Finite Element Method (FEM) or Finite Difference Method (FDM). For the homogeneous Poisson's problem the discretized differential operator can be computed in every grid point as a stencil. We present an efficient implementation of 7--point and 27--point stencil computation on high-end Nvidia Tesla GPUs. A new method of reading data from the global memory to the shared memory of thread blocks is shown. The method avoids conditional statements and idle threads, and shows good cache reuse of the halo data required by every thread block. Software prefetching is used to overlap arithmetic and memory instructions. We analyze the performance using a memory footprint model that takes into account the actual halo overhead due to the memory transaction size on the GPUs. Detailed performance analysis for single precision and performance results for single and double precision arithmetic on Nvidia Tesla cards are presented. On Tesla C2050 with single and double precision arithmetic our 7--point stencil implementation achieves an average throughput of 11.8 and 6.5 Gpts/s, respectively. The symmetric 27--point stencil implementation sustains a throughput of 10.5 and 5.8 Gpts/s, respectively, which is equivalent to 456 and 164 GFLOP/s, respectively. Our stencil implementation is used as a building block of a Geometric Multigrid solver for the Poisson's problem. For single precision arithmetic and a grid size of 257^3 Tesla C2050 performs more than 50 V-cycles per second. As an example application we use the developed Multigrid solver in simulations of natural porous convection in a homogeneous medium saturated with incompressible fluid.