Master 2012 2013
Stages de la spécialité SAR
Using replication for fault tolerance in High Performance Computing Systems


Site :Laboratoire de Systèmes Répartis
Lieu :Ecole Polytechnique Fédérale de Lausanne (Suisse)
Encadrant : Thomas Ropars (thomas.ropars@epfl.ch) André Schiper (Andre.Schiper@epfl.ch)
Dates :To be discussed
Rémunération :To be discussed
Mots-clés : Parcours SAR autre qu’ATIAM, recherche

Description

Exascale super-computers (100 times more powerful than today’s most powerful super-computer) are expected by the end of the decade. At such scale failure rate is expected to be very high. As a consequence, many efforts are invested in finding efficient fault tolerant solutions that would allow applications to correctly terminate.

Replication has long been considered as too expensive in the context of high performance computing (HPC). However, at very high failure rate, some studies show that replication could become more efficient than traditional checkpointing techniques used in existing HPC systems [1]. Moreover, replication can allow to detect and even correct silent data corruptions that could be caused by bit flips undetected by the hardware [2]. Consequently, replication is gaining more and more attention.

Although replication has been extensively studied in the general context of distributed systems, it is not yet the case in the context of HPC. In this project, we propose to study replication for HPC application to try to understand how the specific characteristics of HPC applications can be used to improve replication performance. More precisely, the goal of this project would be to understand if all the parts of a HPC application have to be replicated to provide fault tolerance, or if some part of the computations can be executed by a single replica to improve performance. The project would include the implementation of a prototype to test and validate the proposed ideas. Evaluations with real HPC applications would be run of the Grid’5000 tested (www.grid5000.fr)

Bibliographie

[1] "Evaluating the viability of process replication reliability for exascale systems". Ferreira et al. Supercomputing 2011.

[2] "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing". Fiala et al. SuperComputing 2012.