Much more important is to recognize that the loop in the algorithm is basically a way of getting N independent operations performed (multiplying the X's by the W's). Thus, all of the multiplications could be carried out at the same time -- in parallel to use the accepted term. This will radically alter the total time to perform the algorithm, since what took N time intervals originally takes only one with the parallel implementation. The amount of gain depends on the size of N, and can reach any size if N is large enough. However, the gain, also depends on the rest of the algorithm and how it is implemented. For instance, the weighted X's still have to be summed and multiplied by 1/W. Since these operations depend on each other they cannot simply be separated into parallel computation streams. Thus, to evaluate the actual gain requires that we have a total implementation to analyze.

To implement the parallel scheme requires separate data-memory parts for each independent multiplier, in particular a DMgpa and a Kbus. The straightforward way would be also to have independent control parts. This would require a K parallel-merge at the end to be sure that the summation loop did not begin until all the multiplications had been completed.

However, each of the parallel parts has an identical control -- being simply copies of the same algorithm. Thus, one need only construct a single basic control part. Figure 11 gives the RTM structure of this implementation. Note that Kparallel-merges are required in order to coordinate each step. One can certainly not depend on the systems to remain synchronized by themselves, even if the components are of identical physical construction. Note also that separate control must exist for each test on P<0>, since which way control goes is data dependent. The basic control loop is not data dependent, which permits it to be factored out. The basic loop of the Figure 9 is still needed, but it now encompasses only the summation of the weighted X's, and we do not show it. Similarly, the final multiplication by 1/W still occurs in series and is not shown.

We can now make a rough estimate of the increase in speed from the parallelism. The serial scheme takes:

t.serial = N*t.multiply + N*t.sum-loop + t.multiply + t.set-up The parallel takes:

t.parallel = t.multiply + N*t.sum-loop + t.multiply + t.set-up'

The two set-up times and the time for the summation loop are small fractions, *f, *of the multiply time and we can approximate them by f*t.multiply. Thus we get:

t.parallel/t.serial = (2 + f*N + f)/(N + 1 + f*N + f)

As N gets very large -- thus getting the most parallelism -- the increase in speed becomes like f/(1+f) (simply ignoring all terms without an N). If the small operations, represented by f, were such that f -1/8 (e.g., like a single multiply step) then this limiting advantage would be 1/9. We see that the remaining structure of the algorithm puts a limit to how much advantage is to be obtained by parallelism. While N is still small the advantage is about 2/(N+1), reflecting the fact that at least two multiplications must be done in series. If we consider only the multiplication time as the. measure of performance, the ratio t.parallel/t.serial - 1/N. That is, we achieve a factor of N speed-up.

This form of parallelism, which we can call array parallelism, arises repeatedly in RT-level design. Many of the data structures that occur in practice consist of one or two dimensional arrays, and algorithms often specify that the same operation be done independently to each element of the array. Then the option always exists of factoring out the independent part and doing it in parallel, thus reducing that part of the time cost by a factor of N, the size of the array. One pays substantial additional hardware for this, and whether it is worth it depends

122