Two approaches to reducing latency and increasing bandwidth are possible. One is to reduce cycle times in the network, thus moving data through the network more quickly, and directly improving both latency and bandwidth. The other is to make more intelligent use of the existing bandwidth in the network, such that the useful bandwidth can be increased and message latency better controlled (typically by providing guarantees of maximum latency). Both approaches can be applied at once by using a simple network processing element (with correspondingly fast switching times), running a reprogrammable scheduled router--essentially a simple finite state machine--to control data motion. In this thesis, this technique is referred to as reprogrammable scheduled routing; this class of routers is represented by the NuMesh router [68,58,59,57]. Essentially, routing decisions are made offline (when the application is compiled), and at run time the routers simply execute repetitive schedules that handle all data transfers.
Pre-scheduling communications through the network has several clear advantages. Primarily, it allows for the routing scheduler to choose message paths to maximize specific user criteria (such as throughput and latency). Without some form of pre-scheduling, decisions can only be made based on local criteria; furthermore, routing hardware can only make decisions optimized for what the hardware designer thought was important. More importantly, without pre-scheduling, a router has no ability to look ahead to see what messages will be arriving shortly; it is a strictly ``online'' solution. Routing schedulers can examine the entire application's communication structure before generating schedules for the routers, and therefore represent a more general ``offline'' solution. Additionally, the more complex hardware needed for traditional data-dependent decisions can not achieve the short cycle time that is possible in principle with a highly-pipelined finite state machine. However, there are several limitations to the picture presented thus far.
Firstly, it is usually true that an application's communications
are not completely deducible at compile time. For example, an application
may have
a broadcast that is performed along all the columns of a matrix, but
the exact row may be unknown.
At run time, all the processors may know who is
broadcasting--or perhaps only the broadcasting processor will know.
Using the communications language presented in this thesis,
COP
, an
application
can specify everything that is known about communication patterns
at compile time: compile-time constant components, restrictions on
what dimensions are used or what nodes are involved, whether all the
nodes have full information on the needed pattern or not, and so forth.
Furthermore, each communication operator (e.g., broadcast or parallel
prefix) can be tagged with information that the compiler can use to
optimize the implementation: message size, number of messages expected
to be sent, and maximum desired bandwidth, for example.
Given the information presented by the language, the compiler has a wide range of implementation options for data-dependent communication operations. If only a few router schedules are needed to handle the possible operand values for the operator, the compiler may include them all, along with run-time selection code. Or, if including multiple schedules is infeasible, it may make sense to simply allocate all the possible paths statically, and use only the relevant ones at run time. For more dynamic, but bandwidth-intensive communications, the compiler may choose to include code to compute the desired schedules at run time. Or, in the most dynamic situations, the compiler may fall back on generating code to perform online routing, using nearest-neighbor connections to forward messages through the mesh.
Secondly, it may be the case that an application is composed exclusively of communications whose source and destination can be uniquely deduced at compile time, but that there are simply too many such communications streams. A typical application will often move through a range of communication and computation phases; each phase will have its own static communications graph, and merging all the schedules into one global application schedule may result in greatly decreased bandwidth available on each stream. This thesis examines how to specify, implement, and optimize the use of phases at the router level, by coordinated reprogramming of the routers during the execution of the program.
Finally, it is important to support the notion of multiple threads of control in the program. An application may have different components that are running unrelated code at times during its execution. It is important to support this in the communication language, since otherwise the ability of the application to change phases to match changing communication requirements would be greatly constrained. For example, an application may want to divide the mesh into subcomponents, each of which runs independent loops several phases in length. Without a notion of `disjoint components' it would be necessary to either synchronize all the phase changes, or combine all the communications into a single phase; either way would seriously limit the amount of bandwidth that could be scheduled for each communication.