previous | contents | next

Chapter 27

The STAR (Self-Testing And Repairing) Computer:

An Investigation of the Theory and Practice of Fault-Tolerant Computer Design1

Algirdas Avizienis / George C. Gilley / Francis P. Mathur / David A. Rennels / John A. Rohr / David K. Rubin

Summary This paper presents the results obtained in a continuing investigation of fault-tolerant computing which is being conducted at the Jet Propulsion Laboratory. Initial studies led to the decision to design and construct an experimental computer with dynamic (standby) redundancy, including replaceable subsystems and a program rollback provision to eliminate transient errors. This system, called the STAR computer, began operation in 1969. The following aspects of the STAR system are described: architecture, reliability analysis, software, automatic maintenance of peripheral systems, and adaptation to serve as the central computer of an outerplanet exploration spacecraft.


Introduction: Chronology and Rationale

This paper presents a summary of the theoretical results and design experience obtained in an investigation of fault-tolerant computing which is being conducted at the Jet Propulsion Laboratory (JPL). Initial studies (1961-1965) led to the conclusion that dynamic (also called standby) redundancy offered the greatest promise in the design of fault-tolerant digital computer systems [Avizienis, 1967a]. The dynamic redundancy [Short, 1968] approach requires a two-step procedure for the elimination of a fault: first, the presence of a fault is determined; second, a corrective action is taken (e.g., replacement of failed unit, repetition of program, reconfiguration of systems, etc.). The alternative to the dynamic approach is static (masking) redundancy [Short, 1968], which was already being utilized in existing component- redundant [Lewis, 1963; Kuehn, 1969] and triple-modular-redundant (TMR) [Kuehn, 1969; Anderson and Macri, 1967; Lyons and Vanderkulk, 1962] computers. Early analytic studies of dynamic redundancy with idealized series-parallel system models indicated that mean life gains of an order of magnitude and more over a nonredundant system could be expected from dynamically redundant systems with standby spares replacing failed units [Reed and Brimley, 1962; Kruus, 1963; Flehinger, 1958; Griesmer, Miller, and Roth, 1962]. This gain compared favorably with the mean life gain of less than two in the typical TMR systems. Other qualitative advantages of the dynamic over the static redundancy were: (1) greater isolation of catastrophic (non-independent) faults which is especially important for densely packed microelectronic circuitry; (2) survival of system until all spares of one type are exhausted; (3) ability to eliminate errors which are caused by transient faults by the use of program rollback; (4) ready adjustability of the number and type of spare units; (5) utilization of the potentially lower failure rate of unpowered components in spare units; (6) avoidance of the circuit-related problems of static redundancy: increases in fan-out, fan-in, power requirements, and the need for isolation and synchronization of separate channels; and (7) facilitation of the check out of spare units by means of standard diagnostic programs.

The attainment of the apparent advantages of a dynamically redundant system had been shown to depend very strongly on the successful execution of the detection and replacement operations [Flehinger, 1958; Griesmer, Miller, and Roth, 1962]; these observations have since been formalized as the concept of "coverage" [Bouricius, Carter, and Schneider, 1969].

The second phase of the investigation (1965-1970) was focused on the identification and solution of the problems involved in the design of a general-purpose digital computer possessing the properties attributed to the abstract model of a dynamically redundant computing system. Three major areas of investigation were: (1) an investigation of fault-detection methods; (2) a study of computer architecture with emphasis on partitioning into subsystems with minimal interconnection requirements; and (3) a study of the "hard-core" problem, i.e., the alternate technologies and logic organizations for implementing the detection and switching functions. The choices among feasible alternatives in all three areas are strongly affected by assumptions on the available component technology and on the computing tasks to be required of the computer. In order to retain contact with the practice of computer design, it was decided to design and construct an experimental general-purpose digital computer which would incorporate dynamic redundancy (i.e., fault detection and re placement of faded subsystems) as integral parts of its structure. The design objectives have been carried out and the system, called the STAR (self-testing and repairing) computer, began operation in 1969. The modular nature of the STAR computer has allowed systematic expansion and modifications that are still being continued.

The first objective of the design is to study the class of problems which are encountered in transforming the theoretical model of a self-repairing system into a working computer. State-of-the-art

 

1IEEE Trans. on Computers, vol. C-20, no. 11, November 1971, pp. 1,312-1,321

448

previous | contents | next