previous | contents | next

386 Part 2½ Regions of Computer Space
Section 4 ½ Multiple-Processor Systems



suitable for many other tasks, and we discuss several of these briefly below.

A. Message Systems

We have made an extensive study of the possibility of using the Pluribus computer as the basis for a message system. By message system we mean not only traditional message-switching such as is done in the Telex system, but also a system of mailboxes and files by which users can exchange and file messages without recourse to the U.S. Postal System, secretaries, or filing cabinets, and which will permit complicated searches and sorts of message files. Such a system must have high availability but could easily tolerate brief outages after a failure.

B. Real-Time Signal Processing

We have already built one system which is the front-end and control processor for a seismic data collection network, and which performs some preprocessing of seismic data [Gudz, 1977]. We believe this application can be extended to other areas of real-time signal processing with requirements for high overall system availability. Since many signal processing tasks can be broken into parallel components, the multiprocessor architecture would be especially appropriate.


C. General-Purpose Timesharing Systems

It seems to us that explicit use of fault-tolerant techniques could benefit general purpose timesharing systems and large operating systems. These systems operate continuously and are subject to minor hardware errors and subtle software bugs, but do not require totally uninterrupted operation. Although most large systems include some self-checking in the software, software fault-tolerance, to be truly effective, must be well integrated into the overall system design, and into the special hardware features which are usually required.

One of the primary purposes of most large operating systems is to provide disk and tape handling features. In this context, reinitialization in response to faults is a much more serious problem than, for example, in the IMP. Various checkpointing procedures may be required to restore the overall system state to a point where restart is possible [Yourden, 1972, pp. 340-353]. Large operating systems often support a variety of checkpointing services since the best techniques to use under these circumstances depend in part on the applications being serviced; in cases involving on-line database updates, the application programs themselves must be designed around their fault-tolerance requirements.

D. Reservations Systems

Airline, hotel, and car rental reservation systems provide good examples of on-line database systems which could benefit from well-designed software fault-tolerance systems. Once a reservation has been accepted, it must not be lost. Backup techniques such as dual updating of two copies of the database, perhaps located in different cities with independent central processors and telecommunications systems, may be worthwhile. On the other hand, minor problems (hardware or software) may be tolerated, especially if the problems can be resolved by reentering on-line transactions which were affected by the fault. Even with dual machines in remote locations, using a machine like the Pluribus would increase the reliability of each site separately, and provide substantial computing power in an expandable package. Further research will be required to understand fully the implications to the Pluribus of database integrity requirements for reservation systems.

E. Process Control

Our approach is clearly more appropriate to some areas of process control than to others. We envision a typical application in the area of overall supervisory systems coordinating a number of subsidiary systems or controllers, and incorporating tasks such as inventory control and job scheduling. Processes that could afford to stop momentarily would be controlled directly. End-to-end error correction and fault-masking hardware would be used in the machine interface for applications needing overall fault-tolerance. As with the previous applications, some form of checkpointing would be built in to preserve context over restarts.

 

References

Avizienis [1975]; Avizienis [1976]; Barnes et al. [1968]; Bressler, Kraley, and Michel [1975]; Enslow [1974]; Goldberg [1975]; Gudz [1977]; Heart [1975b]; Heart, Kahn, Ornstein, Crowther, and Walden [19701; Heart, Ornstein, Crowther, and Barker [1973]; Heart, Ornstein, Crowther, Barker, Kraley, Bressler, and Michel [1976]; Mann, Ornstein, and Kraley [1976]; McKenzie, Cosell, McQuillan, and Thrope [1972]; Myers et al. [1977]; Ornstein, Crowther, Kraley, Bressler, Michel, and Heart [1975]; Ornstein, Head, Crowther, Rusell, Rising and Michel [1972]; Ornstein and Walden [1975]; Roberts and Wessler [1970]; U.S. Pat. 4,035,766 [1977]; Wolf [1973]; Wulf and Bell [1972]; Yourden [1972].

previous | contents | next