Memory Subsystem.In addition to the high-performance core, the Itanium processor provides a robust cache and memory subsystem, which accommodates a variety of workloads and exploits the memory hints of the IA-64 ISA. The processor provides three levels of on-package cache for scalable performance across a variety of workloads. At the first level, instruction and data caches are split, each 16 Kbytes in size, four-way set-associative, and with a 32-byte line size. The dual-ported data cache has a load latency of two cycles, is write-through, and is physically addressed and tagged. The L1 caches are effective on moderate-size workloads and act as a first-level filter for capturing the immediate locality of large workloads.
The second cache level is 96 Kbytes in size, is six-way set-associative, and uses a 64-byte line size. The cache can handle two requests per clock via banking. This cache is also the level at which ordering requirements and semaphore operations are implemented. The L2 cache uses a four-state MESI (modified, exclusive, shared, and invalid) protocol for multiprocessor coherence. The cache is unified, allowing it to service both instruction and data side requests from the L1 caches. This approach allows optimal cache use for both instruction-heavy (server) and data-heavy (numeric) workloads. Since floating-point workloads often have large data working sets and are used with compiler optimizations such as data blocking, the L2 cache is the first point of service for floating-point loads. Also, because floating-point performance requires high bandwidth to the register file, the L2 cache can provide four double-precision operands per clock to the floating-point register file, using two parallel floating-point load-pair instructions.
The third level of on-package cache is 4 Mbytes in size, uses a 64-byte line size, and is four-way set-associative. It communicates with the processor at core frequency (800 MHz) using a 128-bit bus. This cache serves the large workloads of server- and transaction-processing applications, and minimizes the cache traffic on the frontside system bus. The L3 cache also implements a MESI protocol for microprocessor coherence.
A two-level hierarchy of TLBs handles virtual address translations for data accesses. The hierarchy consists of a 32-entry first-level and 96-entry second-level TLB, backed by a hardware page walker.
Optimal Cache Management.To enable optimal use of the cache hierarchy, the IA-64 instruction set architecture defines a set of memory locality hints used for better managing the memory capacity at specific hierarchy levels. These hints indicate the temporal locality of each access at each level of hierarchy. The processor uses them to determine allocation and replacement strategies for each cache level. Additionally, the IA-64 architecture allows a bias hint, indicating that the software intends to modify the data of a given cache line. The bias hint brings a line into the cache with ownership, thereby optimizing the MESI protocol latency.
Table 7.2 lists the hint bits and their mapping to cache behavior. If data is hinted to be non-temporal for a particular cache level, that data is simply not allocated to the cache. (On the L2 cache, to simplify the control logic, the processor implements this algorithm approximately. The data can be allocated to the cache, but the least recently used, or LRU, bits are modified to mark the line as the next target for replacement.) Note that the nearest cache level to feed the floating-point unit is the L2 cache. Hence, for floating-point loads, the behavior is modified to reflect this shift (an NT1 hint on a floating-point access is treated like an NT2 hint on an integer access, and so on).
Allowing the software to explicitly provide high-level semantics of the data usage pattern enables more efficient use of the on-chip memory structures, ultimately leading to higher performance for any given cache size and access bandwidth.