next up previous contents
Next: Determining Which Operators are Up: Changing Phases Previous: Changing Phases

Subsections

Operator Termination

There are a variety of ways to determine whether a node has transferred all data related to a given node in a phase. Remember that these techniques are only applied at the time of a phase transition, after the operator has been used for the last time in the phase. When operators are used multiple times throughout the phase, there is no need for any particular technique to guarantee that the data is transferred correctly, since no phase changes are yet being performed.

Message-Bound Implementations

The most straightforward case is to rely on the operator's message size. If an operator writes a relatively short message, it may be entirely buffered in the mesh; the writer may finish writing the message long before the reader starts reading. By contrast, a suitably long message can provide synchronization guarantees. Messages are read synchronously on the remote end by COP (except with online routing, as discussed in Section 4.2). Accordingly, when an operator's message size is longer than the amount of buffering in the mesh between the source and the destination, the reader will have to start reading before the writer can finish writing.

As a result, since data is guaranteed to be being drained from the mesh by the reader, the data currently on the node's router will be forwarded off the router and onto the neighbor node in a sharply bounded period of time. While it is possible that the remote node may encounter cache misses while writing the data read from the interface, that time can be factored into the length of time to delay to ensure that sufficient time has passed for the last data word to be able to leave the writer node. Additional sources of delay on the reader, such as interrupts (e.g., from the online routing mentioned previously) can be temporarily suspended to make the necessary timing guarantees. Implementations that offer global synchronization guarantees in this way are referred to as message-bound.

Operator implementations may have varying degrees of complexity, which result in different definitions of a `long' message. For a simple multicast-based broadcast, the number of VFSM and processor interface buffers present along the longest path from the source to any destination can be used. Writing one more than this many words guarantees that all the other nodes in the mesh are reading the broadcast, since otherwise the writer would have stalled. For a more complex operator, such as a circular shift, the compiler must follow multiple streams from node to node around the mesh to determine the number of possible buffer registers that must be filled for a given operator before it can guarantee that the I/O function completing guarantees a degree of synchrony with the other nodes in the subset. Still other implementations, such as parallel prefix, can automatically make a `long message' synchronization guarantee for any length of message, since the operator itself involves synchronizing the nodes in its subset.

So far the discussion has focussed only on delays for the writer node. This makes sense, since when the reader function completes, clearly all the data on that stream has left the router. However, some operators use multiple streams through a single node. For example, a circular shift operator on a linear array of nodes has one long stream that goes from one end-node to the other. When an intermediate reader node completes, it must also perform a pause to ensure that the tail end of the message on the long stream has time to move through the node.

This technique is implemented in the compiler by having the returned implementation-specific information structure hold a pointer to an array of per-node phase delays, if the operator is message-bound for that implementation. Then, if a phase change follows a particular operator, those delay values are used before changing phase.

Processor-Bound Implementations

When the message size is short, a different technique can be employed to determine when the operator has finished writing. A single operator implementation may be implemented such that when the `read' or `write' function returns control to the processor, it guarantees that the node has finished all routing for that operator's data. Such implementations are called `processor-bound'.

As an example, a standard multicast implementation of broadcast will not be processor-bound--when the write function completes, data may still be present in the router. A processor-bound implementation of broadcast includes the source node itself in the multicast. By ensuring that the last portion of the multicast on the source node is the write back to the processor interface, the processor is guaranteed that, once it reads the final word of the message back from the router, the router will perform no further I/O on behalf of the broadcast operator.

Making such a guarantee can be harder for other operators. For example, a linear, nearest-neighbor end-off shift can include multicast-to-source on each stream, thus making it processor-bound. By contrast, a circular shift on the same array adds a single stream traversing the entire mesh (as mentioned above), and without further explicit multicasting to the intermediate nodes there would be no way to know when a given node had finished routing data for the operator.

Performance of processor-bound operators will suffer if the stream would normally be allocated bandwidth greater than half the processor's interface speed; that is, for an optimal processor interface, a processor-bound operator can't run faster than 50% bandwidth. This is due to the fact that normally, a processor-bound implementation requires some nodes to perform an extra I/O on every cycle. In practice, this limitation is not severe, because if multiple operators are scheduled in a phase there will often be network bandwidth limits such that no one operator will be able to run that fast.

A significant aspect of this technique is that when a node finishes the phase and instructs the router to change phases, it has no idea how long it will be until its neighboring nodes also change phases. This fact is considered further in Section 5.3.


next up previous contents
Next: Determining Which Operators are Up: Changing Phases Previous: Changing Phases
Back to Chris Metcalf's home page