The Journal of Instruction-Level Parallelism
Data Prefetching Championship
The 1st JILP Data Prefetching Championship (DPC-1)
Sponsored by: Intel, JILP, IEEE TC-uARCH
in conjunction with:
HPCA-15  http://www.comparch.ncsu.edu/hpca/

 

 

Description of the Simulation Infrastructure

The provided simulation framework is based on the CMP$im simulator. The framework models a simple out-of-order core with the following basic parameters:

o   128-entry instruction window with no scheduling restrictions (i.e., any instruction that is ready in the window can be scheduled out-of-order).

o   Processor has a 15-stage, 4-wide pipeline. A maximum of two loads and a maximum of one store can be issued every cycle.

o   Perfect branch prediction (i.e., no front-end or fetch hazards).

o   All instructions have one-cycle latency except for cache misses.  L1 cache misses / L2 cache hits are 20 cycles.  L2 cache misses / memory requests have a 200-cycle latency.  Hence, the total round-trip time from processor to memory is 220 cycles.

o   The memory model will consist of a 2-level cache hierarchy, consisting of an L1 data cache and an L2 data cache. The L1 data cache is 32KB 8-way set-associative with LRU replacement. The L2 data cache is 16-way set-associative with LRU replacement. The L2 data cache size depends on the configurations specified later in the document.

o   Each cache is coupled with a queue for storing outstanding requests to the next level in the hierarchy. L1 misses / L2 requests are queued in an L1 queue. L2 misses / memory requests are queued in an L2 queue. Both demand and prefetch requests to the L1 and the L2 are placed in these queues in FIFO order.

o   Prefetches are not prioritized vs. demand requests, except if they are issued in the same cycle in which case demand requests are inserted earlier in the queue. The bandwidth for these queues can be restricted or infinite, as specified later in the document.

o   Prefetches are issued for whole cache blocks.

o   When a block comes into the cache, it replaces the LRU block in the cache, whether it is a demand request or a prefetch request.

o   Physical addresses are not used by the prefetcher. Only virtual addresses can be requested.

We will distribute the framework as a library that includes a header file to which the contestant will add his/her prefetcher code. The prefetcher API will have access to a processor state structure which reflects processor state from the last cycle. The following is a list of data fields that will be updated every cycle, and can be polled by the prefetcher API.

An L1 state structure is updated with the following information about events in the last cycle:

o   Last cycle at which the L1 data cache received a request. This could be the current cycle or a previous cycle if there are no requests in the current cycle.

o   Virtual address (program counter) of the instruction making the last L1 data cache request.

o   A unique sequence number of the dynamic instruction making the last L1 data cache request.

o   Instruction type (load, store, other) for the instruction making the last L1 data cache request.

o   Virtual address for the data being requested.

o   L1 data cache response (hit/miss).

o   A single bit of additional storage to be associated with the cache line for the contestant to read and/or modify.

Up to four records can be given in a single cycle depending on the number of requests (up to four) issued to the L1 data cache. If a requested address misses in the L1 data cache, a request is made to the L2 data cache. Each cycle, an L2 state structure is updated with the following information:

o   A unique sequence number of the dynamic instruction that originally made the request. The sequence numbers of instructions subsequently requesting the L2 data cache block are not provided.

o   Virtual address of the requested L2 data cache block.

o   L2 data cache response (hit/miss).

o   A single bit of additional storage to be associated with the cache line for the contestant to read and/or modify.

In addition to this information, the contestant is provided with 32Kbits of storage to implement state machines to manage the prefetching algorithm. This storage limit includes all the state required for the L1 and the L2 prefetchers. This limit only applies to the storage available to implement the algorithm. The contestant will not be penalized for the complexity of the algorithm logic. The prefetching algorithm will consist of prefetchers for the L1 data cache and the L2 data cache. The 32Kbit storage can be allocated as the contestant sees fit.

Configurations Used for Evaluation

Submissions will be evaluated based on the measured performance of their prefetching algorithms across three different configurations, that differ in L2 data cache size and cache/memory bandwidth as follows:

o  Configuration 1: L2 size is 2 MB. L1 and L2 queues have almost-infinite bandwidth (realistically, that means that these queues can process up to 1000 requests in parallel).

o  Configuration 2: L2 size is 2 MB. L1 and L2 queues have limited bandwidth (a maximum of one request per cycle from the L1 to the L2, and a maximum of one request every ten cycles from the L2 to memory).

o  Configuration 3: L2 size is 512 KB. L1 and L2 queues have limited bandwidth similar to Configuration 2.