Performance distribution of a fault-tolerant system in the presence of failure correlation

IIE Transactions, June, 2006 by Gregory Levitin, Min Xie

In applications where the execution time of each task is of critical importance, the system reliability R(T*) is defined (using a performability concept in Tai et al. (1993) and Meyer (1980)) as the probability that the correct output is produced in a time that is less than T*. In order to obtain the function R(T*) = Pr(T < T*) one has to know the probability mass function of the discrete random variable T (the system performance distribution).

2.2. The failure correlation model

Failures are usually not independent and how to model and estimate the dependence has attracted considerable attention recently (Gutjahr and Uchida, 2002; Kotz et al., 2003; Dai et al., 2004; Yadavalli et al., 2005; Xie et al., 2005). Usually approaches based on CCFs are used with the components being assumed to be affected by some kind of common cause that leads to failure of more than one component. In order to model the failure correlation we also adopt the CCF approach and consider different sets of versions (Common-Cause Groups, CCGs) affected by different common causes. It is assumed that if CCF j occurs, all of the versions belonging to CCG [[OMEGA].sub.j] fail. All versions belong to at least one CCG. The independent failure of a single version i can also be considered to be a CCF i with [[OMEGA].sub.i] = {i}.

The CCF probabilities can be elicited from testing, evaluated from theoretical models or estimated with prior information or information from other similar systems. A probabilistic framework for empirically estimating CCF probabilities was developed by Eckhardt and Lee (1985). Littlewood and Miller (1989) have suggested a conceptual model able to simulate CCFs. Nicola and Goyal (1990) have proposed a beta-binomial distribution model for CCF probability evaluation assuming that the single-failure probabilities of versions are equal. This model was extended to the case with different single-failure probabilities by Gutjahr (2001). A hyperparameter CCF model with a beta prior was recently suggested by Czarnowski et al. (2003).

3. Version termination times

The version termination time is important in scheduling and planning since it significantly affects the performance of computing systems. In each component the sequence in which the versions begin their execution is defined by the ordered list v = ([v.sub.1],..., [v.sub.N]). This means that each version [v.sub.i] does not begin execution earlier than versions [v.sub.1],..., [v.sub.i-1] and not latter than versions [v.sub.i 1],..., [v.sub.N]. Here and in all subsequent discussions we will omit the index c when a given system component is considered.

Knowledge of the execution time of each version [tau]([v.sub.i]) (1 [less than or equal to] i [less than or equal to] N) and the number of versions that can run simultaneously L, means that we can obtain the termination time t(i) for each version using the following simple algorithm.

Step 1. Assign [[delta].sub.1] = ... = [[delta].sub.L] = 0.

Step 2. For i = 1,..., N repeat:


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
CXO UnpluggedSmart Business interviews on BNET

See and hear how senior level executives across the Asia Pacific are developing smart business ideas across a variety of sectors. The focus is on the future, and on how businesses need to evolve.

advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale