Runtime interval optimization and dependable performance for application-level checkpointing

Apostolos Kokolis*, Alexandros Mavrogiannis, Dimitrios Rodopoulos, Christos Strydis, Dimitrios Soudris

*Corresponding author for this work

Research output: Chapter/Conference proceedingConference proceedingAcademicpeer-review

2 Citations (Scopus)

Abstract

As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured.

Original languageEnglish
Title of host publicationProceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages594-599
Number of pages6
ISBN (Electronic)9783981537062
DOIs
Publication statusPublished - 25 Apr 2016
Event19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016 - Dresden, Germany
Duration: 14 Mar 201618 Mar 2016

Publication series

SeriesProceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016

Conference

Conference19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
Country/TerritoryGermany
CityDresden
Period14/03/1618/03/16

Bibliographical note

Publisher Copyright: © 2016 EDAA.

Fingerprint

Dive into the research topics of 'Runtime interval optimization and dependable performance for application-level checkpointing'. Together they form a unique fingerprint.

Cite this