| The Journal of
  Instruction-Level Parallelism | 
|   | Description
  of the Simulation Infrastructure The
  provided simulation framework is based on the CMP$im
  simulator. The framework models a simple out-of-order core with the following
  basic parameters: o   128-entry instruction
  window with no scheduling restrictions (i.e., any instruction that is ready
  in the window can be scheduled out-of-order). o   Processor
  has a 15-stage, 4-wide pipeline. A maximum of two loads and a maximum of one
  store can be issued every cycle. o   Perfect
  branch prediction (i.e., no front-end or fetch hazards). o   All
  instructions have one-cycle latency except for cache misses.  L1 cache misses / L2 cache hits are 20
  cycles.  L2 cache misses / memory
  requests have a 200-cycle latency.  Hence, the total round-trip time from
  processor to memory is 220 cycles. o   The memory
  model will consist of a 2-level cache hierarchy, consisting of an L1 data
  cache and an L2 data cache. The L1 data cache is 32KB 8-way set-associative
  with LRU replacement. The L2 data cache is 16-way set-associative with LRU
  replacement. The L2 data cache size depends on the configurations specified
  later in the document. o   Each cache
  is coupled with a queue for storing outstanding requests to the next level in
  the hierarchy. L1 misses / L2 requests are queued in an L1 queue. L2 misses /
  memory requests are queued in an L2 queue. Both demand and prefetch requests to the L1 and the L2 are placed in
  these queues in FIFO order. o   Prefetches are not
  prioritized vs. demand requests, except if they are issued in the same cycle
  in which case demand requests are inserted earlier in the queue. The
  bandwidth for these queues can be restricted or infinite, as specified later
  in the document. o   Prefetches are issued
  for whole cache blocks. o   When a
  block comes into the cache, it replaces the LRU block in the cache, whether
  it is a demand request or a prefetch request. o   Physical
  addresses are not used by the prefetcher. Only
  virtual addresses can be requested. We
  will distribute the framework as a library that includes a header file to
  which the contestant will add his/her prefetcher
  code. The prefetcher API will have access to a
  processor state structure which reflects processor state from the last cycle.
  The following is a list of data fields that will be updated every cycle, and
  can be polled by the prefetcher API. An
  L1 state structure is updated with the following information about events in
  the last cycle: o   Last cycle
  at which the L1 data cache received a request. This could be the current
  cycle or a previous cycle if there are no requests in the current cycle. o   Virtual
  address (program counter) of the instruction making the last L1 data cache
  request. o   A unique
  sequence number of the dynamic instruction making the last L1 data cache
  request. o   Instruction
  type (load, store, other) for the instruction making the last L1 data cache
  request. o   Virtual
  address for the data being requested. o   L1 data
  cache response (hit/miss). o   A single
  bit of additional storage to be associated with the cache line for the
  contestant to read and/or modify. Up
  to four records can be given in a single cycle depending on the number of
  requests (up to four) issued to the L1 data cache. If a requested address
  misses in the L1 data cache, a request is made to the L2 data cache. Each
  cycle, an L2 state structure is updated with the following information: o   A unique
  sequence number of the dynamic instruction that originally made the request. The
  sequence numbers of instructions subsequently requesting the L2 data cache
  block are not provided. o   Virtual
  address of the requested L2 data cache block. o   L2 data
  cache response (hit/miss). o   A single
  bit of additional storage to be associated with the cache line for the
  contestant to read and/or modify. In
  addition to this information, the contestant is provided with 32Kbits of
  storage to implement state machines to manage the prefetching
  algorithm. This storage limit includes all the state required for the L1 and
  the L2 prefetchers. This limit only applies to the
  storage available to implement the algorithm. The contestant will not be
  penalized for the complexity of the algorithm logic. The prefetching
  algorithm will consist of prefetchers for the L1
  data cache and the L2 data cache. The 32Kbit storage can be allocated as the
  contestant sees fit. Configurations Used for Evaluation Submissions
  will be evaluated based on the measured performance of their prefetching algorithms across three different
  configurations, that differ in L2 data cache size and cache/memory bandwidth
  as follows: o  Configuration
  1: L2 size is 2 MB. L1 and L2
  queues have almost-infinite bandwidth (realistically, that means that these
  queues can process up to 1000 requests in parallel). o  Configuration
  2: L2 size is 2 MB. L1 and L2
  queues have limited bandwidth (a maximum of one request per cycle from the L1
  to the L2, and a maximum of one request every ten cycles from the L2 to
  memory). o  Configuration
  3: L2 size is 512 KB. L1 and L2
  queues have limited bandwidth similar to Configuration 2. | 
|  |