The Journal of
Description of the Simulation Infrastructure
The provided simulation framework is based on the CMP$im simulator. The framework models a simple out-of-order core with the following basic parameters:
o 128-entry instruction window with no scheduling restrictions (i.e., any instruction that is ready in the window can be scheduled out-of-order).
o Processor has an 8-stage, 4-wide pipeline. A maximum of two loads and a maximum of one store can be issued every cycle.
o Perfect branch prediction (i.e., no front-end or fetch hazards).
o All instructions have one-cycle latency except for cache misses. L1 cache misses / L2 cache hits are 10 cycles. L2 cache misses / L3 cache hits are 30 cycles, and L3 cache misses / memory requests have a 200-cycle latency. Hence, the total round-trip time from processor to memory is 240 cycles.
o The memory model will consist of a 3-level cache hierarchy, consisting of an L1 split instruction and data caches, an L2, and an L3 (Last Level) cache. All caches support 64-byte lines. The L1 instruction cache is 32KB 4-way set associative cache with LRU replacement. The L1 data cache is 32KB 8-way set-associative with true LRU replacement. The L2 data cache is 256KB, 8-way set-associative with true LRU replacement. The Last Level Cache (LLC) is a 16-way cache with true LRU replacement implemented as the default configuration. This algorithm is implemented using an additional four bits for every cache line that reflect reference LRU order (with 0 being the most recently used line and 15 being the least recently used line). The LLC is a non-inclusive (also non-exclusive) cache. The LLC size depends on the configurations specified later in the document.
We will distribute the framework as a library that includes a header file to which the contestant will add his/her replacement algorithm code. The replacement API will have access to a structure which reflects the L2 misses that cause LLC references . The following is a list of data fields that will be updated every cycle, and can be polled by the replacement algorithm API. The LLC state structure is updated with the following information about LLC events:
o Event type (Instruction Fetch, Load, or Store) for the instruction that caused the LLC reference.
o Virtual address (program counter) of the instruction that caused the LLC reference.
o Virtual address for the data being requested by the LLC reference.
o Thread number for the instruction that caused the LLC reference. For single core simulation, thread number is always zero. For multi-core simulation, thread number ranges from 0 to Ncores-1.
In addition to this information, contestants are provided with the following storage budget: (1) An additional four bits per cache line on top of the bits used by LRU (for a total of eight bits per cache line); (2) An additional 1K bits for the whole cache to be used for common data structures that are not associated with cache lines. Note that contestants can choose to use some of the bits associated with cache lines as part of the common data structures. For example, a contestant may choose to use only 5 bits with each cache line, and use the remaining storage (3 bits/line x number of lines + 1k bits) to build common structures. This storage limit includes all the state required for the LLC replacement algorithm. This limit only applies to the storage available to implement the algorithm. The contestant will not be penalized for the complexity of the algorithm logic.
Configurations Used for Evaluation
Submissions will be evaluated based on the measured performance of their replacement algorithms over different configurations:
o Configuration 1: A single core, single thread configuration with a 1 MB LLC. Total storage that can be used by the replacement algorithm is is 129 K Bits (16 K lines x 8 bits per line + 1K bits for the whole cache), including storage currently used for LRU.
o Configuration 2: A four-core configuration with a 4 MB shared LLC. Total storage that can be used by the replacement algorithm is 513 K bits (64K lines x 8 bits per line + 1K bits for the whole cache), including storage currently used for LRU.