The techniques and algorithms presented so far would not represent a complete compiler without a structure for making decisions among the various implementations and phase breakpoints. This thesis presents a notion of per-operator `extents' which allow the compiler to pick when to allocate separate threads of control to sets of operators, as well as a search-based algorithm for finding optimal groups of implementations.
A desirable feature of scheduled routing is allowing for disjoint groups of operators to execute independently. These disjoint groups are not explicitly specified to the compiler, but rather derived from parallel groups of sequentially-grouped operators.
With scheduled routing, it is critical to determine which nodes support the routing for an operator. One of the powerful features of scheduled routing is its ability to find good paths for messages at compile-time, avoiding hotspots in the mesh. This need is balanced against the desire to allow multiple threads of control by using spatial extents to control disjoint groups. Each extent corresponds to a subset of the mesh that the compiler can use to route that operator, if it is sharing the mesh with disjoint groups.
Using extents to support disjoint operator groups also vastly simplifies the communication compilation process, since doing so avoids overloading multiple communication phases on particular nodes located between two disjoint groups. The need to reserve bandwidth for multiple phases (spatially and/or temporally distinct) would quickly lead to diminishing returns; the extent technique avoids exploring that part of the communications implementation space, but provides the benefits of allowing disjoint communications groups in a straightforward manner.
This thesis presents a structure for searching the implementation space jointly for phase breakpoints and operator implementations, using a cost function to predict performance. A branch-and-bound search algorithm using this framework is also presented.
Disjoint groups are handled during the search, converting the leading and trailing phase in each disjoint phase-group to a single phase at either end used to allow correct handling of operators that span the initial disjoint operators, but terminate in their first phase. The disjoint management technique also allows simplification of the compiler structure, since it results in only a single phase being active at any point during the search process.
Closing a phase during the search induces a determination of the appropriate barrier to use for the phase (as discussed above), followed by a computation of the predicted time to execute the phase. The timing can be broken down into three components: the sum of the I/O time predicted for all the operators, the time taken by the phase change, and the time to perform any router programming operations. Summing these three values yields the overall predicted time for the phase.
Phases composed of routable streams require extra computation to get a predicted timing from; a stream-alias implementation has no explicit bandwidth associated with each stream, since bandwidth on each stream is determined jointly by all the streams' demands. An algorithm is presented to make a fast estimate of the bandwidth that will be allocated to each stream, based on a notion of resources (node I/O or bisection bandwidth) that are shared jointly by all streams. A single bandwidth value is associated with all streams of an individual operator; dynamic-routing streams for an operator are specially handled to reflect the fact that they are not all present in the mesh at once (unlike for scheduled routing).
COP relies on a stream router to provide static-stream, single-phase routing services for scheduled routing. It invokes the stream router multiple times, once for each phase (or more than once for phases with multiple partitions that perform independent barriers). A key interaction between the stream router and COP is the cost function used by COP to disallow the use of certain wires at certain times (or to provide an incentive to use another wire if possible). This cost function is used to avoid implementation-specific restrictions for particular scheduled-routing substrates, as well as for supporting the wire-reuse restrictions discussed above.
The NuMesh backend requires some specific, idiosyncratic management by the compiler. In particular, with NuMesh reprogramming is done with a separate scheduled VFSM that reads address/data pairs from the interface and applies them to memories in the router; this requires support from the stream router cost function (to ensure that the reprogramming VFSM can be scheduled). The other major restriction in NuMesh is that the processor interface can only be read or written by each pipeline in each cycle, not both. Thus, avoiding schedules that might cause this problem requires careful handling, again implemented through the stream router cost function. Techniques to handle other minor NuMesh idiosyncrasies are also presented in the thesis.
The dynamic-routing backend requires relatively little additional support.
The most significant work done to handle this platform is to generate
legal interface addresses on each node. Unlike scheduled routing,
where interface addresses are bound to a particular phase,
with dynamic routing interface addresses can't be trivially allocated
to operators. Instead, the dependency analysis introduced to handle
wire/timeslot
reuse in scheduled routing is used, and extended
to perform an n -coloring of the interface addresses.
An (unimplemented) technique for virtual registers is also proposed, to
use when the dependency analysis fails to limit the number of distinct
addresses to the number provided in hardware.