Instruction bundles capable of full-bandwidth dispersal
First bundle*
Second bundle
MIH
MFI or MLI
MII
MMI
MFH
MLI, MFI, MIB, MBB, or MFB
MLI, MFI, MIB, MBB, BBB, or MFB
MBB, BBB, or MFB
BBB
MII, MLI, MFI, MIB, MBB, MFB
*B slots support branches and branch hints.
*H designates a branch hint operation in the B slot.
Efficient register remapping. After dispersal, the next step in preparing incoming instructions for execution involves implementing the register stacking and rotation functions.
Register stacking is an IA-64 technique that significantly reduces function call and return overhead. It ensures that all procedural input and output parameters are in specific register locations, without requiring the compiler to perform register-register or memory-register moves. On procedure calls, a fresh register frame is simply stacked on top of existing frames in the large register file, without the need for an explicit save of the caller's registers. This enables low-overhead procedure calls, providing significant performance benefit on codes that are heavy in calls and returns, such as those in object-oriented languages.
Register rotation is an IA-64 technique that allows very low overhead, software-pipelined loops. It broadens the applicability of compiler-driven software pipelining to a wide variety of integer codes. Rotation provides a form of register renaming that allows every iteration of a software-pipelined loop to have a fresh copy of loop variables. This is accomplished by accessing the registers through an indirection based on the iteration count. Both stacking and rotation require the hardware to remap the register names. This remapping translates the incoming virtual register specifiers onto outgoing physical register specifiers, which are then used to perform the actual lookup of the various register files. Stacking can be thought of as simply adding an offset to the virtual register specifier. In a similar fashion, rotation can also be viewed as an offset-modulo add. The remapping function supports both stacking and rotation for the integer register specifiers, but only register rotation for the floating-point and predicate register specifiers.
The Itanium processor efficiently supports the register remapping for both register stacking and rotation with a set of adders and multiplexers contained in the pipeline's REN stage. The stacking logic requires only one 7-bit adder for each specifier, and the rotation logic requires either one (predicate or floating-point) or two (integer) additional 7-bit adders. The extra adder on the integer side is needed due to the interaction of stacking with rotation. Therefore, for full six-syllable execution, a total of ninety-eight 7-bit adders and 42 multiplexers implement the combination of integer, floating-point, and predicate remapping for all incoming source and destination registers. The total area taken by this function is less than 0.25 square mm.
The register-stacking model also requires special handling when software allocates more virtual registers than are currently physically available in the register file. A special state machine, the register stack engine (RSE), handles this case ? termed stack overflow. This engine observes all stacked register allocation or deallocation requests. When an overflow is detected on a procedure call, the engine silently takes control of the pipeline, spilling registers to a backing store in memory until sufficient physical registers are available. In a similar manner, the engine handles the converse situation ? termed stack underflow ? when registers need to be restored from a backing store in memory. While these registers are being spilled or filled, the engine simply stalls instructions waiting on the registers; no pipeline flushes are needed to implement the register spill/restore operations.
Register stacking and rotation combine to provide significant performance benefits for a variety of applications, at the modest cost of a number of small adders, an additional pipeline stage, and control logic for a programmer-invisible register stack engine.
Large, multiported register files. The processor provides an abundance of registers and execution resources. The 128-entry integer register file supports eight read ports and six write ports. Note that four ALU operations require eight read ports and four write ports from the register file, while pending load data returns need two additional write ports (two returns per cycle). The read and write ports can adequately support two memory and two integer instructions every clock. The IA-64 instruction set includes a feature known as postincrement. Here, the address register of a memory operation can be incremented as a side effect of the operation. This is supported by simply using two of the four ALU write ports. (These two ALUs and write ports would otherwise have been idle when memory operations are issued off their ports).
The floating-point register file also consists of 128 registers, supports double extended-precision arithmetic, and can sustain two memory ports in parallel with two multiply-accumulate units. This combination of resources requires eight read and four write ports. The register write ports are separated in even and odd banks, allowing each memory return to update a pair of floating-point registers.
The other large register file is the predicate register file. This register file has several unique characteristics: each entry is 1 bit, it has many read and write ports (15 reads/11 writes), and it supports a "broadside" read or write of the entire register file. As a result, it has a distinct implementation, as described in the "Implementing predication elegantly" section (p. 286).