next up previous contents
Next: Implementation Types Up: Operator Implementations Previous: Operator Implementations

Subsections

Approaches to Data Dependency

  The most interesting aspect of implementations for scheduled routing is coming up with techniques to handle run-time values. This section outlines some implementation techniques for doing so.

If an application communicates to some remote node, but it is not known which at compile time, a stream can't simply be scheduled at compile time. However, there are a variety of work-arounds that can be invoked, depending on the degree of information present about the distribution of the possible destinations, the amount of data to be carried by the stream, and the size of the mesh.

An operator can be implemented in one of a number of different ways depending on the anticipated bandwidth and latency demands of the operator as compared with what the different implementations can provide. For example, if a run-time operator will transfer only a few words, the run-time system should not waste a great deal of time online generating a schedule, but should instead try to route the data quickly using online routing. The remainder of this section presents a spectrum of high-level alternatives for supporting data-dependent routing. (Alternatives not implemented in the current compiler are marked with a dagger, `$\dagger$'.)

Pure Compile-Time

The far end of the spectrum is when the operator arguments are all known at compile time. In this case an efficient communications schedule can always be generated. Venturing further from this end of the spectrum incurs a price either in latency or bandwidth or both to route the data. Examples of this are many: to name a few that have already presented elsewhere, both transpose and bitreverse are simply sets of compile-time streams, as is the `shift' phase of the matrix multiplication algorithm shown in Figure 2.6 on p. [*]. Returning to the common broadcast example, a pure compile-time implementation of broadcast is simply a mesh-level multicast for a scheduled routing system.

Multiple Path Allocation$\dagger$

The easiest way to handle data dependency if there is a small number of possible destinations is to schedule all of the paths and use the appropriate one at run time. This is particularly true if the amount of data to be transferred is small, since allocating multiple paths results in no extra message startup overhead time, unlike most of the schemes described below. This scheme rapidly becomes impractical as the the number of destinations grow and the amount of data to be transferred increases; in the limit, for example, to use this scheme for a random linear permutation of size n results in streams running as slow as 1/n words/cycle, and schedules of length $\Theta(n)$.For relatively small meshes, this provides a low-overhead, general-purpose solution to run-time operators. For example, for a two-node mesh with a run-time broadcast, the compiler might simply schedule one stream going each way, each one a separate compile-time `broadcast'.

Output Discarding

A related but less bandwidth-intensive technique is to implement multiple destinations as a multicast tree (or, similarly, multiple sources as a reduction-style tree). Then, when determining where to send data, the destinations that are uninterested simply rewrite their node to discard the data instead of delivering it to the router. On a scheduled-routing mesh, this technique actually consumes less bandwidth than multiple path allocation, since a spanning tree is used with a constant bandwidth component across the entire tree, unlike the above solution. For example, this technique can be used to convert a compile-time broadcast to a stream with a compile-time source and a run-time destination; nodes other than the destination simply discard the broadcast within the router itself.

Dynamic Destination Selection$\dagger$

A dynamic-destination stream operator may have multiple possible destinations, but all of the destinations lie close together. In this case, the problem may be best handled by allocating a single `data stream' to carry the data across the mesh to where the cluster of destinations is, and only then split the message off to where it needs to go. A `control stream' is set up so that at load time it carries a control word to the remote router that splits the data stream; the control words reprograms the router to point to the desired final destination. The data can then be sent at full bandwidth to the desired node, starting only a few cycles behind the control word. Multiple decision points can also be used, with each controlled by a separate control stream.

Multiple Schedules$\dagger$

A different approach, when the number of run-time possibilities is relatively small, is to simply select an appropriate schedule based on the run-time arguments, load that version of the schedule on all the nodes, and then run at full bandwidth. This approach requires that all the nodes involved in the communication know which schedule to load. For example, in a broadcast, if all the nodes know who is broadcasting they can choose the appropriate piece of broadcast schedule to run: either a schedule that generates and forwards the data, or a schedule that expects to read the data from one direction and forward it to the appropriate onward directions. Since each schedule typically only requires a few hundred bytes, a fair number of schedules can be loaded into the processor to be selected at run time. If all the variant schedules can fit simultaneously in the router, furthermore, the run-time decision may simply consist of a comparison or two and a single schedule-change command to the router.

Run-Time Specification

A slightly more complex alternative is to generate a fairly generic schedule which can be customized at run time. If the generic schedule can be expressed as routable streams, the operator can be scheduled in parallel with other operators, since the customization can be applied to the VFSMs that make up a specific stream after stream routing has been performed. For example, a linear broadcast from a runtime source could be implemented by allocating two paths going through the mesh; these paths can be scheduled in parallel with any other paths. Then, the node that is doing the broadcast rewrites each stream to read data from the processor rather than the neighbor node; now, the broadcast node can write to the two streams and reach all the other nodes. (Of course, for subsequent broadcasts from different nodes those streams would need to be reset to their original values again.) This technique requires $2 \times$ the bandwidth of a compile-time broadcast.

Run-Time Operator-Specific Scheduling$\dagger$

If the operator needs to transfer a substantial amount of data, an entire schedule can be generated using operator-specific code, tuned to produce an efficient schedule using the run-time data that is available. In the simplest case, each node computes the same problem (e.g., the scheduling problem for the whole mesh for a given permutation) and then puts the relevant part of the solution into its own router. More sophisticated techniques might involve solving (for this example) the permutation problem in parallel--using a built-in schedule for the routers during the parallel schedule generation--then switching to the resulting schedule to perform the permutation. Since the startup time for this technique is high, however, the compiler must ascertain that the amount of data to be transferred is large enough to amortize the startup schedule computation. This technique also tends not to be suitable in cases where multiple operators are running in parallel, since operator-specific code typically assumes it can use all the available schedule slots.

General Run-Time Scheduling$\dagger$

If the operator has large data-transfer needs, but is merged with other operators, part of the schedule compiler can be included in the run-time image, and allow the nodes to generate their own schedules before they begin to use them. This schedule generation may be constrained by fixed schedule slots taken by continuing operators already scheduled in an earlier phase, but is otherwise the same problem that the compiler normally solves for continuing operators with compile-time arguments. A more ambitious solution might include incorporating parallel algorithms for computing stream allocations at run time.

Processor Intervention

If data-dependent decisions must be made, sometimes the easiest place to do so is in the processor itself. In this case, streams can be routed from one decision-point to another, with the processor involved in deciding which output streams to forward the input data to. As an example, a run-time broadcast can be performed by using a `flood fill' technique. Nearest-neighbor streams are scheduled between all the nodes involved in the broadcast. For each broadcast, since all the nodes know which node is broadcasting, each node reads an appropriate stream and then forwards the data to the appropriate output stream(s). Controlling which dimensions forward the data to other dimensions yields a simple flood-fill algorithm for broadcast.

Online Routing

The ultimate fallback for all operators is online routing; this is effectively a generalized version of processor intervention. As discussed in Section 4.2, dynamic routing can be used to move all the data. Sometimes the compiler may know that the dynamic routing is constrained to one or two dimensions; in this case, it can construct code that does not check the unused dimensions and thus can run more rapidly. Sometimes (e.g., for very short messages, or with a wide range of possible run-time values), the higher startup latency of online schedule computation can make dynamic routing be, in fact, the faster solution. General-purpose dynamic routing is handled by routing the message in progressive hops; after each hop, the processor code ascertains which stream to forward the message on for the next hop. Hops can be a single node or multiple nodes; the online routing decisions can be made either with the processor (for current implementations), or by an online routing module attached to the router (if cost/benefit considerations make it useful in the future).


next up previous contents
Next: Implementation Types Up: Operator Implementations Previous: Operator Implementations
Back to Chris Metcalf's home page