previous | contents | next

Section 6 ½ Fault-Tolerant Systems 441



access type [Cosserat, 1972; Swan, Fuller, and Siewiorek, 1976]; repeating tasks and comparing; comparing results of two different algorithms and encoding for the same task; and reasonability checks on input/output data.


Diagnosis

Location of a failure can be achieved by analysis of the state of the system when the error is detected, The activity of the error-associated components should be stopped and their intermediate state frozen. A mechanism should be provided to notify some other components in the system of the stoppage. Some unaffected intelligence can then examine state, exercise the components, and initiate a recovery. Thus at each conceptual boundary the object should be controllable and observable. If the fault cannot be resolved by the existing state, a diagnostic sequence can be initiated,


Isolation and Corrective Action

The simplest form of isolation is achieved by disconnection or power switching. In either case careful design should ensure that electrical continuity of shared control signals is maintained. For example, most buses have daisy-chained signals, so that disconnecting a module from the bus breaks the daisy chain and denies bus signals to modules downstream from the disconnection.

It should be noted that certain techniques encompass all three steps (detection, diagnosis, and corrective action) in one activity (i.e., massive redundancy techniques, such as triplication and voting [Von Neumann, 1956]). Typically, corrective action takes one of two forms: retry (which is useful for transient-error correction and permanent-failure detection) and standby sparing/graceful degradation. In the latter case, the computation is moved to another part of the system and restarted; enough information must be retained that the restart can be executed cleanly without interference from the side effects of the partially completed first instantiation.

Generally retry is cheaper (as one does not have to keep information for restart) and more effective (the information does not have to be regenerated after negation of side effects). Consider the ARPANET [Heart et al., 19701, where geographically distributed minicomputers form the backbone of a computer communication network (see Chap. 24). All information is buffered by a minicomputer until it receives a positive acknowledgment. Thus a minicomputer can have a transient failure, go through a cold restart by throwing away its memory state, and still have enough information buffered in the rest of the network to pick up with its activities. Even in the case of a permanent failure, the network can reroute the buffered messages without having to regenerate the information.

previous | contents | next