previous | contents | next
Section 6 ½
Fault-Tolerant Systems 441
access type [Cosserat, 1972; Swan, Fuller, and Siewiorek, 1976]; repeating tasks and comparing; comparing results of two different algorithms and encoding for the same task; and reasonability checks on input/output data.
- System. The detection techniques for the task level also apply at the system level. In addition, a sanity or watchdog timer [Downing, Nowak, and Tuomenoksa, 1964] can be used to detect whether a processor is still executing code in a reasonable sequence.
Diagnosis
Location of a failure can be achieved by analysis of the state of the system when the error is detected, The activity of the error-associated components should be stopped and their intermediate state frozen. A mechanism should be provided to notify some other components in the system of the stoppage. Some unaffected intelligence can then examine state, exercise the components, and initiate a recovery. Thus at each conceptual boundary the object should be controllable and observable. If the fault cannot be resolved by the existing state, a diagnostic sequence can be initiated,
- Hardware subsystem. Control, input, and output signals should be made available to at least one other subsystem. Classical combinational and sequential circuit-testing theory can be used for the diagnostics. Note that the diagnostic resolution need only be to the smallest replaceable unit (the chip, the printed circuit board, or even the hardware subsystem itself).
- Task. Each subsystem should be controllable (by halt, start, continue, interrupt, and reinitialize [Wulf and Bell, 1972]) and its internal state (status such as running or disconnected, general-purpose registers, program counter, and error status register) should be observable by at least one external subsystem. Diagnostic programs consist of the functional and implementation-dependent diagnostics typical of any stand-alone computer. Diagnostics for special hardware (e.g., error-detection circuits, controllability and observability logic, and memory protection logic) must also be written. The diagnostics are loaded, initiated, and run by the subsystem that has been notified by an error signal. These autodiagnostics should also be run periodically as a regularly scheduled task or as an idle task.
- System. Same as for task.
Isolation and Corrective Action
The simplest form of isolation is achieved by disconnection or power switching. In either case careful design should ensure that electrical continuity of shared control signals is maintained. For example, most buses have daisy-chained signals, so that disconnecting a module from the bus breaks the daisy chain and denies bus signals to modules downstream from the disconnection.
It should be noted that certain techniques encompass all three steps (detection, diagnosis, and corrective action) in one activity (i.e., massive redundancy techniques, such as triplication and voting [Von Neumann, 1956]). Typically, corrective action takes one of two forms: retry (which is useful for transient-error correction and permanent-failure detection) and standby sparing/graceful degradation. In the latter case, the computation is moved to another part of the system and restarted; enough information must be retained that the restart can be executed cleanly without interference from the side effects of the partially completed first instantiation.
Generally retry is cheaper (as one does not have to keep information for restart) and more effective (the information does not have to be regenerated after negation of side effects). Consider the ARPANET [Heart et al., 19701, where geographically distributed minicomputers form the backbone of a computer communication network (see Chap. 24). All information is buffered by a minicomputer until it receives a positive acknowledgment. Thus a minicomputer can have a transient failure, go through a cold restart by throwing away its memory state, and still have enough information buffered in the rest of the network to pick up with its activities. Even in the case of a permanent failure, the network can reroute the buffered messages without having to regenerate the information.
- Hardware subsystem. Corrective action includes switching in of standby spares (which is effective for combination logic) and transmission retry of buffered information.
- Task. Corrective action includes checkpointing the task and moving to nonfailed hardware subsystems [Avizienis et al., 1971] and instruction retry [IBM, 1972a]. It should be noted that care in design of the processor's instruction set can greatly simplify the retry. For example, PDP-11 instructions can generate up to seven addresses, any of which may cause an error. Certain addressing modes (such as autoincrement) have side effects. Thus it is not enough to know the start of the instruction; it is also necessary to know how far it has progressed so that side effects can be undone prior to instruction retry.
- System. Many timesharing systems have developed techniques that allow retry and graceful degradation. For user programs, the operating system checkpoints the initialization information and buffers the output information until task completion. Thus when an error occurs the task can be restarted. For large tasks the user may decide to issue commands to the operating system that save intermediate states so that computations can restart at the latest intermediate state. Certain user programs that process continuous data (sonar signal processing and speech recognition, for
previous | contents | next