dot
Detailansicht
Katalogkarte GBA
Katalogkarte ISBD
Suche präzisieren
Drucken
Download RIS
Hier klicken, um den Treffer aus der Auswahl zu entfernen
Titel Toward a more efficient and scalable checkpoint/restart mechanism in the Community Atmosphere Model
VerfasserIn Valentine Anantharaj
Konferenz EGU General Assembly 2015
Medientyp Artikel
Sprache Englisch
Digitales Dokument PDF
Erschienen In: GRA - Volume 17 (2015)
Datensatznummer 250108780
Publikation (Nr.) Volltext-Dokument vorhandenEGU/EGU2015-8550.pdf
 
Zusammenfassung
The number of cores (both CPU as well as accelerator) in large-scale systems has been increasing rapidly over the past several years. ÂIn 2008, there were only 5 systems in the Top500 list that had over 100,000 total cores (including accelerator cores) whereas the number of system with such capability has jumped to 31 in Nov 2014. ÂThis growth however has also increased the risk of hardware failure rates, necessitating the implementation of fault tolerance mechanism in applications. The checkpoint and restart (C/R) approach is commonly used to save the state of the application and restart at a later time either after failure or to continue execution of experiments. The implementation of an efficient C/R mechanism will make it more affordable to output the necessary C/R files more frequently. ÂThe availability of larger systems (more nodes, memory and cores) has also facilitated the scaling of applications. ÂNowadays, it is more common to conduct coupled global climate simulation experiments at 1 deg horizontal resolution (atmosphere), often requiring about 103 cores. ÂAt the same time, a few climate modeling teams that have access to a dedicated cluster and/or large scale systems are involved in modeling experiments at 0.25 deg horizontal resolution (atmosphere) and 0.1 deg resolution for the ocean. ÂThese ultrascale configurations require the order of 104 to 105 cores. ÂIt is not only necessary for the numerical algorithms to scale efficiently but the input/output (IO) mechanism must also scale accordingly. An ongoing series of ultrascale climate simulations, using the Titan supercomputer at the Oak Ridge Leadership Computing Facility (ORNL), is based on the spectral element dynamical core of the Community Atmosphere Model (CAM-SE), which is a component of the Community Earth System Model and the DOE Accelerated Climate Model for Energy (ACME). ÂThe CAM-SE dynamical core for a 0.25 deg configuration has been shown to scale efficiently across 100,000 cpu cores. Â At this scale, there is an increased risk that the simulation could be terminated due to hardware failures, resulting in a loss that could be as high as 105 - 106 titan core hours. ÂIncreasing the frequency of the output of C/R files could mitigate this loss but at the cost of additional C/R overhead. ÂWe are testing a more efficient C/R mechanism in CAM-SE. Our early implementation has demonstrated a nearly 3X performance improvement for a 1 deg CAM-SE (with CAM5 physics and MOZART chemistry) configuration using nearly 103 cores. We are in the process of scaling our implementation to 105 cores. This would allow us to run ultra scale simulations with more sophisticated physics and chemistry options while making better utilization of resources.