460 Part 2 ½ Regions of Computer Space Section 6 ½ Fault-Tolerant Systems
Software deficiencies include all software errors that cause memory mutilation, and program loops that can only be cleared by major reinitialization. Software faults are the result of improper translation or implementation of the original algorithm. In some cases, the original algorithm may have been incorrectly specified. Program changes and feature additions are continuously incorporated into working offices. Software accounts for 15 percent of the downtime.
Recovery is the system's most complex and difficult function. Deficiencies may include the shortcomings of either hardware or software design to detect faults when they occur. When faults go undetected, the system remains extensively impaired until the trouble is recognized. Another kind of recovery problem can occur if the system is unable to properly isolate a faulty subsystem and configure a working system around it.
The many possible system states which may arise under trouble conditions make recovery a complicated process. Besides those already mentioned, unforeseen difficulties may be encountered in the field, and lead to inadequate recovery. Because of the large number of variables involved and because the recovery function is so strongly related to all other components of maintenance, recovery deficiencies account for 35 percent of the downtime.
Human error on the part of maintenance personnel or office administrators can also cause the system to go down. For example, someone in maintenance may mistakenly pull a circuit pack from the on-line processor while repairing a defective standby processor. Inadequate and incorrect documentation (e.g., users' manuals) may also be classified as human error. Obviously, the number of manual operations must be reduced if procedural errors are to be minimized. Procedural errors account for about 30 percent of the downtime.
The shortcomings and deficiencies of current systems are being continually
corrected to improve system reliability.
When a fault occurs in a nonredundant single processor, the system will remain down until the processor is repaired. In order to meet the ESS reliability requirement, redundancy is included in the system design; continuous and correct operation is maintained by duplicating all functional units within the processor. If one of the units fails, the duplicated unit is switched in, maintaining continuous operation. Meanwhile, the defective unit is repaired. Should a fault occur in the duplicated unit during the repair interval, the system will, of course, go down. If the repair interval is relatively short, the probability of simultaneous faults occurring in two identical units is quite small. This technique of redundancy has been used throughout each ESS.
The first-generation ESS processor structure consists of two store communities: program store (PS) and call store (CS). The program store is a read-only memory (ROM) containing the call processing, maintenance, and administration programs; it also contains long-term translation and system parameters. The call store contains the transient data related to telephone calls in progress. The memory is electrically alterable to allow its data to be changed frequently. In one particular arrangement, shown in Fig. 2, the complete processor is treated as a single functional block and is duplicated. This type of single-unit duplex system has two possible configurations: Either Processor 0 or Processor 1 can
previous | contents | next