previous | contents | next

Section 6

Fault-Tolerant Systems

Historically, fault-tolerant computers were limited to military, aerospace, and telephone switching applications, where the consequence of computer failures could be significant economic impact or loss of life. Because of several recent trends, fault-tolerant techniques have become of increasing importance to computers in general. A few of these trends are as follows:

The increased interest in fault tolerance has already had an impact on the industrial world. Large mainframe manufacturers like IBM, UNIVAC, and Amdahl use redundancy both for improving user reliability and for assisting field service personnel in fault isolation. Minicomputer manufacturers have also been incorporating fault-tolerant features (e.g., Hamming error-correcting code on memory), and special LSI chips have been introduced (e.g., cyclic redundancy code encoder/decoders). With low-cost microprocessors, one is tempted to replicate them and "vote" on their outputs; such a system could be built for less than $2,000. The trend has gone so far that companies are being formed to build fault-tolerant computers.

Fault-tolerant computing can be loosely defined as the correct execution of a specified algorithm in the presence of defects. The effect of defects can be overcome through the use of temporal redundancy (repeated calculations) or spatial redundancy (extra hardware or software).

As in all system design, the system goals and specifications constrain the design space and consequently the design techniques that may be used. At the highest level of specification, fault-tolerant systems are categorized as either highly available or highly reliable.

An even more stringent goal than R(t) is sometimes used in aerospace applications: the minimum number of failures anywhere in the system that the system can tolerant while still functioning correctly.

There are three distinct functions a fault-tolerant system can perform: detection, diagnosis, and correction. A highly available system need only worry about fault detection. Diagnosis (fault location) and correction (fault repair) can be done manually. For an ultrareliable system, diagnosis and correction must also be done automatically. Incorporating such features can lead to a significant increase in system cost. Current architectural trends in highly reliable systems are focusing on complete and early detection supported by software and/or firmware (microcode) diagnosis. Repair may be through reconfiguration or spare switching.

Several definitions have become standard in the fault-tolerant literature:

Failure

Fault

Physical damage.

An event in which a logical value differs from the

439

previous | contents | next