446 Part 2 ½ Regions of Computer Space Section 6 ½ Fault-Tolerant Systems
"process-pairs." One I/O process is designated as primary, the other as backup. All file modification messages are delivered to the primary I/O process. The primary sends a message with checkpoint information to the backup so that it can take over if the primary's processor or access path to the I/O device fails. Files can also be duplicated on physically distinct devices controlled by an I/O process-pair on physically distinct processors. All file modification messages are delivered to both I/O processes. Thus, in the
case of physical failure or isolation of the primary, the backup file is up to date and available.
User applications can also use the process-pair mechanism. Consider
a NonStop application program A. Program A starts up a backup process A1
in another processor. There are also duplicate file images, one designated
primary and the other backup. Program A periodically (at user-specified
points) sends checkpoint information to A1. A1 is the same program as A,
but it knows that it is a backup program. A1 reads checkpoint
messages to update its data area, file status, and program counter. A1
loads and executes if the system reports A's processor is down (i.e., if
an error message is sent from A's operating system image or if A's processor
fails to respond to a periodic "I'm alive" message). All file activity
by A is performed on both the primary and backup file copies. When A1 starts
to execute from the last checkpoint, it may attempt to repeat I/O
operations successfully completed by A. The system file handler will recognize
this situation and send A1 a successfully completed I/O message. A1 periodically
asks the operating system whether a backup process exists. Since one no
longer does, it can request the creation and initialization of a copy of
both the process and file structure. More information on the operating
system and the programming of NonStop applications can be found in Bartlett
[1977].