Next: Implementation Types
Up: Operator Implementations
Previous: Operator Implementations
Subsections
The most interesting aspect of implementations for scheduled routing is
coming up with techniques to handle run-time values. This section
outlines some implementation techniques for doing so.
If an application communicates to some remote node, but it is not known
which at compile time, a stream can't simply be scheduled
at compile time. However, there are a variety of work-arounds
that can be invoked, depending on the degree of information
present about the distribution of the possible destinations, the amount
of data to be carried by the stream, and the size of the mesh.
An operator can be implemented in one of a number of different ways
depending on the anticipated bandwidth and latency demands of the
operator as compared with what the different implementations can provide.
For example, if a run-time operator will transfer only a few words, the
run-time system should not waste a great deal of time online generating
a schedule, but should instead try to route the data quickly using
online routing. The remainder of this section presents a spectrum
of high-level alternatives for supporting data-dependent routing.
(Alternatives not implemented in the current compiler are marked with
a dagger, `
'.)
The far end of the spectrum is when the operator arguments are all known
at compile time. In this case an efficient communications schedule can
always be generated. Venturing further from this end of the
spectrum incurs a price either in latency or bandwidth or both to route
the data. Examples of this are many: to name a few that have already
presented elsewhere, both transpose and bitreverse are simply sets of
compile-time streams, as is the `shift' phase of the matrix multiplication
algorithm shown in Figure 2.6 on p.
.
Returning to the common broadcast example, a pure compile-time implementation
of broadcast is simply a mesh-level multicast for a scheduled routing system.
The easiest way to handle data dependency if there is a small number
of possible destinations is to schedule all of the paths and use the
appropriate one at run time. This is particularly true if the
amount of data to be transferred is small, since allocating multiple
paths results in no extra message startup overhead time, unlike most of
the schemes described below. This scheme rapidly becomes impractical
as the the number of destinations grow and the amount of data to be
transferred increases; in the limit, for example, to use this scheme for a
random linear permutation of size n results in streams running as
slow as 1/n words/cycle, and schedules of length
.For relatively small meshes, this provides
a low-overhead, general-purpose solution to run-time operators.
For example, for a two-node mesh with a run-time broadcast, the compiler might
simply schedule one stream going each way, each one a separate
compile-time `broadcast'.
A related but less bandwidth-intensive technique is to implement multiple
destinations as a multicast tree (or, similarly, multiple sources as a
reduction-style tree). Then, when determining where to send data, the
destinations that are uninterested simply rewrite their node to discard
the data instead of delivering it to the router. On a scheduled-routing
mesh, this technique actually consumes less bandwidth than multiple
path allocation, since a spanning tree is used with a constant bandwidth
component across the entire tree, unlike the above solution. For example,
this technique can be used to convert a compile-time broadcast to a stream
with a compile-time source and a run-time destination; nodes other
than the destination simply discard the broadcast within the router
itself.
A dynamic-destination stream operator may have multiple possible
destinations, but all of the destinations lie close together. In this
case, the problem may be best handled by allocating a single `data
stream' to carry the data across the mesh to where the cluster of
destinations is, and only then split the message off to where it needs
to go. A `control stream' is set up so that at load time it carries
a control word to the remote router that splits the data stream;
the control words reprograms the router to point to the desired final
destination. The data can then be sent at full bandwidth to the desired
node, starting only a few cycles behind the control word. Multiple decision
points can also be used, with each controlled by a separate control stream.
A different approach, when the number of run-time possibilities is
relatively small, is to simply select an appropriate schedule based on
the run-time arguments, load that version of the schedule on all the
nodes, and then run at full bandwidth. This approach requires that all
the nodes involved in the communication know which schedule to load.
For example, in a broadcast, if all the nodes know who is broadcasting
they can choose the appropriate piece of broadcast schedule to run:
either a schedule that generates and forwards the data, or a schedule
that expects to read the data from one direction and forward it to
the appropriate onward directions. Since each schedule typically only
requires a few hundred bytes, a fair number of schedules can be loaded
into the processor to be selected at run time. If all the variant schedules
can fit simultaneously in the router, furthermore, the run-time decision
may simply consist of a comparison or two and a single schedule-change
command to the router.
A slightly more complex alternative is to generate a fairly generic
schedule which can be customized at run time. If the generic schedule
can be expressed as routable streams, the operator can be scheduled in
parallel with other operators, since the customization can be applied
to the VFSMs that make up a specific stream after stream routing has
been performed.
For example, a linear broadcast from a runtime source
could be implemented by allocating two paths going through the mesh;
these paths can be scheduled in parallel with any other paths. Then, the node
that is doing the broadcast rewrites each stream to read data from the
processor rather than the neighbor node; now, the broadcast node can
write to the two streams and reach all the other nodes. (Of course,
for subsequent broadcasts from different nodes those streams would need to be
reset to their original values again.) This technique requires
the bandwidth of a compile-time broadcast.
If the operator needs to transfer a substantial amount of data, an entire schedule can
be generated using operator-specific code, tuned to produce an efficient
schedule using the run-time data that is available. In the simplest case,
each node computes the same problem (e.g., the scheduling problem for the
whole mesh for a given permutation) and then puts the relevant part of
the solution into its own router. More sophisticated techniques might
involve solving (for this example) the permutation problem in parallel--using
a built-in schedule for the routers during the parallel schedule
generation--then switching to the resulting schedule to perform the
permutation. Since the startup time for this technique is high, however,
the compiler must ascertain that the amount of data to be transferred is large
enough to amortize the startup schedule computation. This technique also
tends not to be suitable in cases where multiple operators are running
in parallel, since operator-specific code typically assumes it can use
all the available schedule slots.
If the operator has large data-transfer needs, but is merged with other
operators, part of the schedule compiler can be included in the run-time image,
and allow the nodes to generate their own schedules before they begin to
use them. This schedule generation may be constrained by fixed schedule
slots taken by continuing operators already scheduled in an earlier phase,
but is otherwise the same problem that the compiler normally solves for
continuing operators with compile-time arguments. A more ambitious
solution might include incorporating parallel algorithms for computing
stream allocations at run time.
If data-dependent decisions must be made, sometimes the easiest place
to do so is in the processor itself. In this case, streams can be routed
from one decision-point to another, with the processor involved in
deciding which output streams to forward the input data to. As an
example, a run-time broadcast can be performed by using a `flood fill'
technique. Nearest-neighbor streams are scheduled between all the nodes
involved in the broadcast. For each broadcast, since all the nodes
know which node is broadcasting, each node reads an appropriate
stream and then forwards the data to the appropriate output stream(s).
Controlling which dimensions forward the data to other dimensions
yields a simple flood-fill algorithm for broadcast.
The ultimate fallback for all operators is online routing; this is
effectively a generalized version of processor intervention. As discussed
in Section 4.2, dynamic routing can be used to
move all the data. Sometimes the compiler may know that the dynamic routing is
constrained to one or two dimensions; in this case, it can construct code
that does not check the unused dimensions and thus can run more rapidly.
Sometimes (e.g., for very short messages, or with a wide range of possible
run-time values), the higher startup latency of online schedule computation
can make dynamic routing be, in fact, the faster solution.
General-purpose dynamic routing is handled by routing the message in
progressive hops; after each hop, the processor code ascertains which stream to forward
the message on for the next hop. Hops can be a single node or multiple
nodes; the online routing decisions can be made either with the processor
(for current implementations), or by an online routing module attached
to the router (if cost/benefit considerations make it useful in the future).
Next: Implementation Types
Up: Operator Implementations
Previous: Operator Implementations
Back to Chris Metcalf's home page