After instructions are fetched in the front-end, they move into the middle pipeline that disperses instructions, implements the architectural renaming of registers, and delivers operands to the wide parallel hardware. The hardware resources in the back-end of the machine are organized around nine issue ports. The instruction and operand delivery hardware maps the six incoming instructions onto the nine issue ports and remaps the virtual register identifiers specified in the source code onto physical registers used to access the register file. It then provides the source data to the execution core. The dispersal and renaminghardware exploits high-level semantic information provided by the IA-64 software, efficiently enabling greater ILP and reduced instruction path length.
Explicit parallelism directives. The instruction dispersal mechanism disperses instructions presented by the decoupling buffer to the processor's issue ports. The processor has a total of nine issue ports capable of issuing up to two memory instructions (ports M0 and Ml), two integer (ports I0 and I1), two floating-point (ports F0 and Fl), and three branch instructions (ports B0, Bl, and B2) per clock. The processor's 17 execution units are fed through the M, I, F, and B groups of issue ports.
The decoupling buffer feeds the dispersal in a bundle granular fashion (up to two bundles or six instructions per cycle), with a fresh bundle being presented each time one is consumed. Dispersal from the two bundles is instruction granular ? the processor disperses as many instructions as can be issued (up to six) in left-to-right order. The dispersal algorithm is fast and simple, with instructions being dispersed to the first available issue port, subject to two constraints: detection of instruction independence and detection of resource oversubscription.
? Independence. The processor must ensure that all instructions issued in parallel are either independent or contain only allowed dependencies (such as a compare instruction feeding a dependent conditional branch). This question is easily dealt with by using the stop-bits feature of the IA-64 ISA to explicitly communicate parallel instruction semantics. Instructions between consecutive stop bits are deemed independent, so the instruction independence detection hardware is trivial. This contrasts with traditional RISC processors that are required to perform O(n2) (typically dozens) comparisons between source and destination register specifier to determine independence.
Oversubscription. The processor must also guarantee that there are sufficient execution resources to process all the instructions that will be issued in parallel. This oversubscription problem is facilitated by the IA-64 ISA feature of instruction bundle templates. Each instruction bundle not only specifies three instructions but also contains a 4-bit template field, indicating the type of each instruction: memory (M), integer (I), branch (B), and so on. By examining template fields from the two bundles (a total of only 8 bits), the dispersal logic can quickly determine the number of memory, integer, floating-point, and branch instructions incoming every clock.
This is a hardware simplification resulting from the IA-64 instruction set architecture. Unlike conventional instruction set architectures, the instruction encoding itself doesn't need to be examined to determine the type of each operation. This feature removes decoders that would otherwise be required to examine many bits of the encoded instruction to determine the instruction's type and associated issue port. A second key advantage of the template-based dispersal strategy is that certain instruction types can only occur on specific locations within any bundle. As a result, the dispersal interconnection network can be significantly optimized; the routing required from dispersal to issue ports is roughly only half of that required for a fully connected crossbar.
Table 7.1 illustrates the effectiveness of the dispersal strategy by enumerating the instruction bundles that may be issued at full bandwidth. As can be seen, a rich mix of instructions can be issued to the machine at high throughput (six per clock). The combination of stop bits and bundle templates, as specified in the IA-64 instruction set, allows the compiler to indicate the independence and instruction-type information directly and effectively to the dispersal hardware. As a result, the hardware is greatly simplified, thereby allowing an efficient implementation of instruction dispersal to a wide execution core.