The Itanium processor is the first implementation of the IA-64 instruction set architecture (ISA). The design team optimized the processor to meet a wide range of requirements: high performance on Internet servers and workstations, support for 64-bit addressing, reliability for mission-critical applications, full IA-32 instruction set compatibility in hardware, and scalability across a range of operating systems and platforms.
The processor employs EPIC design concepts for a tighter coupling between hardware and software. In this design style the hardware-software interface lets the software exploit all available compilation time information and efficiently deliver this information to the hardware. It addresses several fundamental performance bottlenecks in modern computers, such as memory latency, memory address disambiguation, and control flow dependencies.
EPIC constructs provide powerful architectural semantics and enable the software to make global optimizations across a large scheduling scope, thereby exposing available instruction-level parallelism (ILP) to the hardware. The hardware takes advantage of this enhanced ILP, providing abundant execution resources. Additionally, it focuses on dynamic runtime optimizations to enable the compiled code schedule to flow through at high throughput. This strategy increases the synergy between hardware and software, and leads to higher overall performance. The processor provides a six-wide and 10-stage deep pipeline, running at 800 MHz on a 0.18-micron process. This combines both abundant resources to exploit ILP and high frequency for minimizing the latency of each instruction. The resources consist of four integer units, four multimedia units, two load/store units, three branch units, two extended-precision floating-point units, and two additional single-precision floating-point units (FPUs). The hardware employs dynamic prefetch, branch prediction, nonblocking caches, and a register Scoreboard to optimize for compilation time nondeterminism. Three levels of on-package cache minimize overall memory latency. This includes a 4-Mbyte level-3 (L3) cache, accessed at core speed, providing over 12 Gbytes/s of data bandwidth.
The system bus provides glueless multiprocessor support for up to four-processor systems and can be used as an effective building block for very large systems. The advanced FPU delivers over 3 Gflops of numeric capability (6 Gflops for single precision). The balanced core and memory subsystems provide high performance for a wide range of applications ranging from commercial workloads to high-performance technical computing.
In contrast to traditional processors, the machine's core is characterized by hardware support for the key ISA constructs that embody the EPIC design style. This includes support for speculation, predication, explicit parallelism, register stacking and rotation, branch hints, and memory hints. We describe the hardware support for these novel constructs, assuming a basic level of familiarity with the IA-64 architecture.
EPIC hardware.The Itanium processor introduces a number of unique microarchitectural features to support the EPIC design style.These features focus on the following areas:
? supplying plentiful fast, parallel, and pipelined execution resources, exposed directly to the software;
? supporting the bookkeeping and control for new EPIC constructs such as predication and speculation; and
? providing dynamic support to handle events that are unpredictable at compilation time so that the compiled code flows through the pipeline at high throughput.
Fig. 7.5 presents a conceptual view of the EPIC hardware. It illustrates how the various EPIC instructions set features map onto the micropipelines in the hardware.
The core of the machine is the wide execution engine, designed to provide the computational bandwidth needed by ILP-rich EPIC code that abounds in speculative and predicated operations. The execution control is augmented with a bookkeeping structure called the advanced load address table (ALAT) to support data speculation and, with hardware, to manage the deferral of exceptions on speculative execution. The hardware control for speculation is quite simple: adding an extra bit to the data path supports deferred exception tokens. The controls for both the register Scoreboard and bypass network are enhanced to accommodate predicated execution.
Operands are fed into this wide execution core from the 128-entry integer and floating-point register files. The register file addressing undergoes register remapping, in support of register stacking and rotation. The register management hardware is enhanced with a control engine called the register stack engine that is responsible for saving and restoring registers that overflow or underflow the register stack. An instruction dispersal network feeds the execution pipeline.
This network uses explicit parallelism and instruction templates to efficiently issue fetched instructions onto the correct instruction ports, both eliminating complex dependency detection logic and streamlining the instruction routing network. A decoupled fetch engine exploits advanced prefetch and branch hints to ensure that the fetched instructions will come from the correct path and that they will arrive early enough to avoid cache miss penalties. Finally, memory locality hints are employed by the cache subsystem to improve the cache allocation and replacement policies, resulting in a better use of the three levels of on-package cache and all associated memory bandwidth.
EPIC features allow software to more effectively communicate, high-level semantic information to the hardware, thereby eliminating redundant or inefficient hardware and leading to a more effective design. Notably absent from this machine are complex hardware structures seen in dynamically scheduled contemporary processors. Reservation stations, reorder buffers, and memory ordering buffers are all replaced by simpler hardware for speculation. Register alias tables used for register renaming are replaced with the simpler and semantically richer register-remapping hardware. Expensive register dependency-detection logic is eliminated via the explicit parallelism directives that are precomputed by the software.
Using EPIC constructs, the compiler optimizes the code schedule across a very large scope. This scope of optimization far exceeds the limited hardware window of a few hundred instructions seen on contemporary dynamically scheduled processors. The result is an EPIC machine in which the close collaboration of hardware and software enables high performance with a greater degree of overall efficiency.
Overview of the EPIC core.The engineering team designed the EPIC core of the Itanium processor to be a parallel, deep, and dynamic pipeline that enables ILP-rich compiled code to flow through at high throughput. At the highest level, three important directions characterize the core pipeline:
? wide EPIC hardware delivering a new level of parallelism (six instructions/clock),
? deep pipelining (10 stages) enabling high frequency of operation, and
? dynamic hardware for runtime optimization and handling of compilation time indeterminacies.
New level of parallel execution.The processor provides hardware for these execution units: four integer ALUs, four multimedia ALUs, two extended-precision floating-point units, two additional single-precision floating-point units, two load/store units, and three branch units. The machine can fetch, issue, execute, and retire six instructions each clock cycle. Given the powerful semantics of the IA-64 instructions, this expands to many more operations being executed each cycle.
Fig. 7.6 illustrates two examples demonstrating the level of parallel operation supported for various workloads. For enterprise and commercial codes, the MII/MBB template combination in a bundle pair provides six instructions or eight parallel operations per clock (two load/store, two general-purpose ALU operations, two postincrement ALU operations, and two branch instructions). Alternatively, an MIB/MIB pair allows the same mix of operations, but with one branch hint and one branch operation, instead of two branch operations.
For scientific code, the use of the MFI template in each bundle enables 12 parallel operations per clock (loading four double-precision operands to the registers and executing four double-precision floating-point, two integer ALU, and two postincrement ALU operations). For digital content creation codes that use single-precision floating point, the SIMD (single instruction, multiple data) features in the machine effectively enable up to 20 parallel operations per clock (loading eight single-precision operands, executing eight single-precision floating-point, two integer ALU, and two postincrementing ALU operations).