Chapter 28
Fault-Tolerant Design of Local ESS Processors1
W. N. Toy
Overview The stored program control of Bell System Electronic
Switching Systems (ESS) has been under development since 1953. During this
period, the No. 1 ESS, the No. 2 ESS, and the No. 3 ESS have been developed
and used extensively by Bell System operating companies to provide commercial
telephone service. These systems serve all types of telephone offices:
The large-capacity No. 1 ESS serves metropolitan offices, the medium-capacity
No. 2 ESS was designed for suburban offices, and the No. 3 ESS can be found
in many small rural offices. The fault tolerant design of ESS processors
provides the same highly dependable telephone service established by the
previous electromechanical systems. Pertinent process architecture features
used to achieve ESS reliability objectives are discussed.
Introduction
Next to computer systems used in space-borne vehicles and U.S. defense installations, no other application has a higher availability requirement than a Bell System Electronic Switching System (ESS). These systems have been designed to be out of service no more than few minutes per year. Furthermore, design objectives permit no more than 0.01 percent of the telephone calls to be processed incorrectly [Downing, Nowak, and Tuomenoksa, 1964]. For example, when a fault occurs in a system, few calls in progress may be handled incorrectly during the recovery process.
At the core of every ESS is a single high-speed central processor [Harr, Taylor, and Ulrich, 1969; Browne et al., 1969; Staehler, 1977]. To establish an ultrareliable switching environment, redundancy of system components and duplication of the processor itself has been the approach taken to compensate for potential machine faults. Without this redundancy, a single component failure in the processor might cause a complete failure of the entire system. With duplication, a standby processor takes over control and provides continuous telephone service.
When the system fails, the fault must be quickly detected and isolated.
Meanwhile, a rapid recovery of the call processing functions (by the redundant
component(s) and/or processor) is necessary to maintain the system's high
availability. Next, the fault must be diagnosed and the defective unit
repaired or replaced. The failure rate and repair time must be such that
the probability is very small for a failure to occur in the duplicated
unit before the first one is repaired.
Allocation and Causes of System Downtime
The outage of a telephone (switching) office can be caused by facilities other than the processor. While a hardware fault in one of the peripheral units generally results in only a partial loss of service, it is possible for a fault in this area to bring the system down, By design, the processor has been allocated two-thirds of the system downtime. The other one-third is allocated to the remaining equipment in the system.
Field experience indicates that system outages due to the processor may be assigned to one of four categories shown in Fig. 1 [Staehler and Watters, 1976]. The percentages in this figure represent the fraction of total downtime attributable to each cause. The four categories are as follows.
Hardware Reliability
Before the accumulation of large amounts of field data, total system downtime was usually assigned to hardware. We now know that the situation is more complex. Processor hardware actually accounts for only 20 percent of the, downtime. With growing use of stored program control, it has become increasingly important to make such systems more reliable. Redundancy is designed into all subsystems so that the system can go down only when hardware failures occur simultaneously in duplicated units. However, the data now show that good diagnostic and trouble location programs are very critical parts of the total system reliability performance.
1Subsetted from Proc. IEEE, vol. 66, no. 10, October 1978, pp. 1,126-1,145.
459