Proactive management of software aging

IBM Journal of Research and Development, Mar 2001 by Castelli, V, Harper, R E, Heidelberger, P, Hunter, S W, Et al

Software failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to "software aging" as a common phenomenon, in which the state of a software system degrades with time. Exhaustion of system resources, data corruption, and numerical error accumulation are the primary symptoms of this degradation, which may eventually lead to performance degradation of the software, crash/hang failure, or other undesirable effects. "Software rejuvenation" is a proactive technique intended to reduce the probability of future unplanned outages due to aging. The basic idea is to pause or halt the running software, refresh its internal state, and resume or restart it. Software rejuvenation can be performed by relying on a variety of indicators of aging, or on the time elapsed since the last rejuvenation. In response to the strong desire of customers to be provided with advance notice of unplanned outages, our group has developed techniques that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the

pervasiveness of the resource exhaustion and our ability to pinpoint the source. This technology has been incorporated into the IBM Director for xSeries servers. To quantitatively evaluate the impact of different rejuvenation policies on the availability of cluster systems, we have developed analytical models based on stochastic reward nets (SRNs). For time-- based rejuvenation policies, we determined the optimal rejuvenation interval based on system availability and cost. We also analyzed a rejuvenation policy based on prediction, and showed that it can further increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures, which we are just beginning to explore.

1. Introduction

Software aging

Unplanned computer system outages are more likely to be the result of software failures than of hardware failures [1, 2]. Moreover, software often exhibits an increasing failure rate over time, typically because of increasing and unbounded resource consumption, data corruption, and numerical error accumulation. This constitutes a phenomenon called software aging [3], and may be caused by errors in the application, middleware, or operating system. Under aging conditions, the state of the software degrades gradually with time, inevitably resulting in undesirable consequences. Some typical causes of this degradation are memory bloating and leaking, unterminated threads, unreleased file-locks, data corruption, storage-space fragmentation, and accumulation of round-off errors. This phenomenon has been reported by Huang et al. [3] in telecommunications billing applications, where over time the application experiences a crash or a hang failure. Avritzer and Weyuker discuss aging in telecommunication switching software, in which the effect manifests itself as gradual performance degradation [4]. Software aging has been observed not only in specialized software, but also in widely used software, where rebooting to clear a problem is a common practice.

Aging occurs because software is extremely complex and never wholly free of errors. It is almost impossible to fully test and verify that a piece of software is bug-free. This situation is further exacerbated by the fact that software development tends to be extremely time-to-market-driven, which results in applications which meet the short-term market needs, yet do not account very well for long-term ramifications such as reliability. Hence, residual faults have to be tolerated in the operational phase. These residual faults can take various forms, but the ones that we are concerned with cause long-term depletion of system resources such as memory, threads, and kernel tables. The essentially economic problem of developing and producing bug-free code is not the problem at hand; instead we address one of the problems that arises from the prevailing approach to developing software, and one approach to attacking that problem is software rejuvenation.

Software rejuvenation

To counteract software aging, a proactive technique called software rejuvenation has been devised [3]. It involves stopping the running software occasionally, "cleaning" its internal state (e.g., garbage collection, flushing operating system kernel tables, and reinitializing internal data structures) and restarting it. An extreme but well-known example of rejuvenation is a system reboot. There are numerous examples in real-life systems where software rejuvenation is being used. For example, it has been implemented in the real-time system collecting billing data for most telephone exchanges in the United States [5]. Software capacity restoration, a technique similar to rejuvenation, has been used by Avritzer and Weyuker in a large telecommunications-switching software application [4]. In this case, the switching computer is rebooted occasionally, which restores its service rate to the peak value. Grey [6] proposed performing operations solely for fault management in Strategic Defense Initiative (SDI) software which are invoked whether or not the fault exists, and called it operational redundancy. Tai et al. [7] have proposed and analyzed the use of onboard preventive maintenance for maximizing the probability of successful mission completion for spacecraft with very long mission times. The necessity of performing preventive maintenance in a safety-critical environment is evident from the example of aging in Patriot missile software [8]. The failure, which resulted in loss of human lives, might have been prevented had the operators heeded the advice that the system had to be restarted after every eight hours of running time. The Apache Web Server1 (from The Apache Software Foundation) provides a means to prevent itself from becoming too much of a resource burden on a system. Apache has a controlling process and a handler process. The controlling process watches the handler process to ensure that it is running up to standard. The handler process, on the other hand, handles requests from the clients. When the handler process is deemed to be in a bad state, the controlling process stops it and starts another process.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement
Click Here

Content provided in partnership with ProQuest