The first phase is identifying references for which prefetching will be productive. He finds that for the vast majority of programs, a given reference either mostly hits or mostly misses, and that while the low-miss-rate references (goodrefs) account for most of the dynamic references, the remaining high-miss-rate references (badrefs) account for most of the program misses. This partitioning has two advantages: firstly, prefetching can be applied just to the badrefs, avoiding the overhead of generating prefetches for references which will hit in cache anyway; and secondly, the goodrefs themselves can be used to provide parallelism to cover the memory references for the badrefs. Rather than using static techniques to partition the references, Selvidge uses a first-pass compilation and restricted execution environment to identify the two classes of references. This information is fed back to the second-phase compiler which performs the latency tolerance optimizations.
The loop-based prefetching technique used is called RAAD prefetching, for Regularly Accessed Aggregate Data structures. RAADs are characterized by the form of their strongly-connected components in the dependency graph of the program. Three distinct forms of RAADs are recognized: loop-constant stride array accesses (e.g., X[i++]); accesses through loop-constant stride indirection arrays (e.g., Y[X[i++]]); and linked-list dereferencing (e.g., x->a; x = x->next). For array RAADs, prefetch instructions are inserted after the loop reference, targetting a loop the appropriate number of iterations ahead. Indirection RAADs are handled by prefetching the nested index two cycles in advance to compensate:
load a[index[i]] prefetch a[index[i+1]] prefetch index[i+2]
Linked-list RAAD prefetching is done using a sequence of the following type (note that this can only be applied a single iteration ahead, unlike the other techniques):
b = x->a prefetch(newx->a) ... x = x->next newx = x->next prefetch(newx->next)
Prologue code is generated as necessary for first-iteration prefetches; similarly, for indirection arrays and linked-list prefetching, the last loop is peeled to prevent loads to potentially illegal addresses. In conditional constructs, the prefetches are performed at the most infrequently occurring point in a loop that occurs whenever both an address update and a badref occur, to save on overhead.
When loop scheduling is not applicable, the compiler uses miss scheduling. This technique essentially tries to interleave badref references into the rest of the program with a frequency equal to the inverse of the memory bandwidth. The generation of addresses on the one hand, and the use of the referenced addresses on the other, prevent the interleaving from being optimal. Since the procedure operates on traces, however, address generation can be moved back across basic blocks (and duplicated if need be). Note that having identified the goodrefs earlier, it is possible to use them to hide the latencies of the badrefs.
The simulation framework was an extended SPARC instruction set, using code generated by a modified version of gcc. System performance data is reported for detailed simulations. Memory is modelled with two parameters, latency and bandwidth. A prefetch buffer of size one is used, and a 4K fully-associative cache is assumed. The workload used is the SPEC benchmark suite. The compiler chooses RAAD or miss scheduling as appropriate; miss scheduling in loops may be useful for loops with very large bodies, or in loops with many badrefs that execute a small number of times.
Selvidge's results were comparable to to those of Mowry and Gupta.
For a system with 20-cycle latency
and no bandwidth for overlapped requests (much more restrictive than
Mowry and Gupta's memory model), speedups range from 1.1
to
about 1.3
, with numerical programs showing more speedup than symbolic
code. Increasing latency to 160 cycles, with a request still issuable
every 20 cycles, yields speedups of almost 2
in the worst case, and up
to 4
-6
for the more successful benchmarks. Overhead is
seen to be from about 5% to 25%, with the highest overhead
corresponding to the codes that also exhibited the highest speedups.
Latency versus bandwidth limitations are characterized for the
workload. Selvidge shows that both RAAD and miss scheduling are necessary to
provide coverage for all the variety of misses in the SPEC suite, but
that RAAD prefetching does a better job when applicable.
Overall, it is clear that compilers are capable of living up to the promise of Mowry and Gupta's initial paper. Using algorithms like those above, or in Section 4.2, makes it easy to take advantage of prefetching in environments with distant memories, and it can be done with only a minimal amount of hardware.