TY - GEN
T1 - Runtime interval optimization and dependable performance for application-level checkpointing
AU - Kokolis, Apostolos
AU - Mavrogiannis, Alexandros
AU - Rodopoulos, Dimitrios
AU - Strydis, Christos
AU - Soudris, Dimitrios
N1 - Publisher Copyright: © 2016 EDAA.
PY - 2016/4/25
Y1 - 2016/4/25
N2 - As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured.
AB - As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured.
UR - http://www.scopus.com/inward/record.url?scp=84973631255&partnerID=8YFLogxK
U2 - 10.3850/9783981537079_0294
DO - 10.3850/9783981537079_0294
M3 - Conference proceeding
AN - SCOPUS:84973631255
T3 - Proceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
SP - 594
EP - 599
BT - Proceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
Y2 - 14 March 2016 through 18 March 2016
ER -