previous | contents | next

both on the objectives of the design (how much speed is worth) and on how much gain can. be made in the overall time of the algorithm.

Array parallelism is not the most general form of parallel computation, defined as many independent- heterogeneous computations going on concurrently. It is distinguished by the fact that all the parallel computations are 'identical, hence there need only be a single control part for them all. This has often been likened to a symphony orchestra with a single conductor (the control) and many instruments. As we have seen, even in our simple example, the computations need not be completely identical. Often a small amount of control can be included in each instrument and still most of the control can be put in the hands of the single conductor. The problem of so orchestrating a computation to make this possible can pose a major challenge in the analysis - of the underlying algorithm.

The cost in hardware of array parallelism is very high. There exists another alternative for parallel organization, commonly called pipelining, that uses less equipment though (in consequence) obtains less speedup. It exploits the fact that any sequential flow diagram can be viewed as a sequence of work stations, through which the data to be processed flows. Each work station is active at the same time (i.e., in parallel), but doing its particular job on a different input. Thus, the computation flows along, as in a pipeline. An equally apt metaphor would have been assembly-line processing, since this' is the same organization that is used in all mass production assembly lines (e.g., for automobiles). To make use of pipelining we need a continuing stream of inputs, for the efficiency depends on not having holes in the stream of data. But this is exactly the-situation we have in the averaging algorithm of Figure 7.

We could pipeline the entire process of Figure 7, but to keep our attention on multiplication let us just build a pipeline for multiplication, viewing the total flow diagram as providing a context in which the demand for multiplications are provided at a sufficiently high rate to justify a pipeline multiplier. Note that to do averaging, the final stage simply has to accumulate a running sum, which eventually gets multiplied by 1/W, or, more correctly divided by W.

The basic stage in the multiplication algorithm is the multiplication-step, which forms the body of the loop in the basic flow diagram of Figure 8. Rather than executing a loop to use the same hardware (the DMgpa's) for each stage, we now wish to form our pipeline by providing separate parallel systems for each stage. Figure 12 shows the organization. There are 8 iteration steps in the algorithm, hence there are 8 stages in the pipeline. Each step consists of taking in P and MPD, looking at P<0>, and then transferring data (the new values of P and MPD) to the next stage. Each stage essentially duplicates the data and memory part of the basic iteration step, here the DMgpa and the Kbus. Memory used for control will in general be different. Here, there is no need for the Mc used to hold the iteration count; but there are two 1-bit flags, P-full and MPD-full, that are required for the pipeline control.

The control for a single stage is given in Figure 13. The two flags for each stage permit the synchronization for the stages to operate at the same time yet pass data to one another. Stage i (in the figure) does not transfer data from the prior stage (i-1) to its own registers, P[i] and MPD[i], until the i-1 registers are full (as guaranteed by the i-1 flags being 1) and its own registers have had their contents transferred to the i+1 stage (as guaranteed by the i flags being 0). Thus, stage i has the responsibility for setting the flags of stage i-1 to 0 after it

124

previous | contents | next