Performance distribution of a fault-tolerant system in the presence of failure correlation

IIE Transactions, June, 2006 by Gregory Levitin, Min Xie

1. Introduction

The reliability and performance of computing systems can be improved by the incorporation of redundancy or fault tolerance into the system. Two of the best known fault-tolerant software design methods are N-version programming (Chen and Avizienis, 1978) and the recovery block scheme (Randell, 1975). For some recent references on these schemes and related studies, the interested reader is referred to Berman and Kumar (1999), Hsu and Chen (2000), Yeung and Schneider (2003), Chatterjee et al. (2004) and Dai et al. (2004). Both of these methods are based on the redundancy created by functionally equivalent but independently developed software modules (called versions) and the assumption that coincident failures of modules are relatively rare. The fault-tolerant approach usually requires additional resources that result in performance penalties (particularly with regard to computation time) thereby creating a tradeoff between software performance and reliability (Gutjahr and Uchida, 2002; Wattanapongsakorn and Levitan, 2004; Xie et al., 2004).

N-Version Programming (NVP) was first proposed by Chen and Avizienis (1978) and it has subsequently received considerable attention, especially for use in safety critical systems. This approach presumes the execution of N different versions that receive the same input and send their outputs to a voter, which determines the system output. The voter produces an output if at least M-out-of-N outputs agree, otherwise, the system fails. In some applications, the available computational resources do not allow all of the versions to be executed simultaneously. In these cases, the versions are executed according to some predefined sequence and the program execution terminates either when M versions produce the same output (success) or when after the execution of all of the N versions the number of equivalent outputs is less than M (failure). The entire program execution time is a random variable that depends on the parameters of the versions and on the order in which they are executed.

The Recovery Block Scheme (RBS) was proposed by Randell (1975). This approach presumes the consecutive execution of different versions. After execution of each individual version, the output is tested by an acceptance test block. If it accepts a version output then the process is terminated and that version output becomes the output of the entire system. If the acceptance test block does not accept an output then the next version is executed. If all N versions do not produce an accepted output then the system fails. It can be easily seen that in the RBS the entire program execution time is also a random variable that depends on the parameters of the versions and on the order of their execution.

Estimating the effect of fault-tolerant programming on system performance is especially important in safety critical real-time computer applications. This effect has been studied by Tai et al. (1993) and by Goseva-Popstojanova and Grnarov (1995). Whereas Randell (1975) considered a basic NVP realization (N = 3, M = 2) consisting of versions with identical fault probabilities and different execution times, Goseva-Popstojanova and Grnarov (1995) studied NVP with an arbitrary N with both the time to failure and execution time of individual versions being identically distributed independent random variables.

In many cases information about version reliability and execution time is available from separate testing and/or reliability prediction models (Belli and Jedrzejowicz, 1990). This information can be incorporated into a fault-tolerant program model in order to obtain an evaluation of its reliability and performance. A reliability model for NVP with the versions having a different reliability was considered in Ashrafi et al. (1994), however, the system performance evaluation problem was not addressed and no general algorithm to evaluate the NVP reliability for an arbitrary N and M was suggested.

A universal model for both types of fault-tolerant programs and an algorithm to evaluate measures of their reliability and performance have been suggested in Levitin (2004). The model is based on the assumption that the failures of versions for each component are statistically independent. However, both empirical and theoretical studies (for example, these performed by Knight and Leveson, 1986; Nicola and Goyal (1990), Eckhardt et al. (1991) and Feldt (1998), show that this assumption is not justified. Different programmers are prone to make the same or similar mistakes when they independently develop the versions. This causes common faults in different software versions. Even when the different versions are generated automatically (using a genetic programming technique) common faults still exist (Feldt, 1998), which can be explained by the fact that some inputs that a program might receive are "intrinsically harder" than others and the chance of obtaining a correct output varies from input to input, which creates the failure correlation (Eckhardt and Lee, 1985).

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
CXO UnpluggedSmart Business interviews on BNET

See and hear how senior level executives across the Asia Pacific are developing smart business ideas across a variety of sectors. The focus is on the future, and on how businesses need to evolve.

advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale