# Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing

## Katinka Wolter

Language: English

Pages: 269

ISBN: 3642112560

Format: PDF / Kindle (mobi) / ePub

As modern society relies on the fault-free operation of complex computing systems, system fault-tolerance has become an indispensable requirement. Therefore, we need mechanisms that guarantee correct service in cases where system components fail, be they software or hardware elements. Redundancy patterns are commonly used, for either redundancy in space or redundancy in time.

Wolter’s book details methods of redundancy in time that need to be issued at the right moment. In particular, she addresses the so-called "timeout selection problem", i.e., the question of choosing the right time for different fault-tolerance mechanisms like restart, rejuvenation and checkpointing. Restart indicates the pure system restart, rejuvenation denotes the restart of the operating environment of a task, and checkpointing includes saving the system state periodically and reinitializing the system at the most recent checkpoint upon failure of the system. Her presentation includes a brief introduction to the methods, their detailed stochastic description, and also aspects of their efficient implementation in real-world systems.

The book is targeted at researchers and graduate students in system dependability, stochastic modeling and software reliability. Readers will find here an up-to-date overview of the key theoretical results, making this the only comprehensive text on stochastic models for restart-related problems.

Database Design for Mere Mortals (3rd Edition)

Agile Software Engineering (Undergraduate Topics in Computer Science)

Parsing Techniques: A Practical Guide (Monographs in Computer Science)

Time . . . 100 5.4 Towards Online Restart for Self-Management of Systems . . . . . . . . . 106 5.4.1 Estimating the Hazard Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Part III Software Rejuvenation Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6 Practical Aspects of Preventive Maintenance and Software

means of its own timeouts. To the client’s SMTP server, the 60 s disruption manifested itself as repeated timeouts when connecting to the destination. The SMTP server then used the queueing system we sketched earlier to retransmit the deferred messages. We can, unfortunately, not endeavour to fully explore the intricate interactions between timeouts and queues in the SMTP server and restarts initiated by the WSRM component. We will thus only point out interesting aspects of the ETT curves shown

sample of n individuals, that is n completions in our study. We sample the completion times and if we order them, we obtain a data set of D distinct times t1 ≤ t2 ≤ . . . ≤ t D where at time ti there are di events, that is di completions take time ti . The random variable Yi counts the number of jobs that need more or equal to ti time units to complete. We can write Yi as i−1 Yi = n − dj j=1 All observations that have not completed at the end of the regarded time period, usually time t D , are

method of choice. Even though there are many interesting open issues in software fault-tolerance the focus of this book lies on the timeout selection problem rather than on the many other aspects of fault-tolerance mechanisms using redundancy in time. Before diving into a treatment of stochastic models adressing the timeout selection problem for the three selected methods, i.e. restart, rejuvenation and checkpointing, this chapter discusses preliminary aspects of the general timeout problem

) again assuming the simplified case where the increments of the degradation index are iid random variables with cdf FX (x). The long-term cost rate is plotted in [14] for the following parameter values: FX (x) is the Gamma distribution with three different sets of parameters, as to achieve identical expected value, but different variances. • FX (x) = Γ (0.2, 0.2). The cdf of the increment is decreasing hazard rate (DFR) with expected value equal to 1 and variance 5. The coefficient of variation