Consider the parallel prefix operation, for example. It takes an array
of values a0, ..., an-1 and an operator
, and returns,
on each node k , the value
.Any associative operation can be used for
: add (integer or FP),
OR, AND, XOR, MIN, MAX, etc., as well as any user-defined operation;
operands can be of any length.
One well-known use of parallel prefix is the carry-lookahead adder.
On a scheduled router, the prefix implementation can be partly expressed in
the router, with nodes multicasting partial sums to all their descendants
in the prefix tree with a single message send. As a result, there is
no extra overhead in the processor from having to resend the data to
its children, and the processor simply does the minimum number of reads
and writes to the network necessary to provide arguments to the prefix
operation being computed.
Other complex communication structures can also be built in the router to take data where it is generated and deliver it to where it is needed without requiring complex multicast setups in the processor, or requiring processors to forward data within the mesh. For example, data can be scattered and gathered within the mesh by constructing the router code to deliver the first word on a given stream to its processor and route the remainder of the message to a neighbor processor. Conversely, a stream could be set up that took data from its processor and forwarded it, then reconfigured automatically to take data from its neighbor only. Such operations on a traditional online-routed machine require multiple messages, each with its own setup time and header overheads.