Parallel Instruction Computing and Instruction Level Parallelism
VLIW or EPIC microarchitectures.The very long instruction word (VLIW) paradigm has been the conceptual basis for Intel's future thrust using the Itanium (IA-64) processors. The promise here is (or was) that much of the hardware complexity could be moved to the software (compiler). The latter would use global program analysis to look for available parallelism and present explicitly parallel instruction computing (EPIC) execution packets to the hardware. The hardware could be relatively simple in that it would avoid much of the dynamic unraveling of parallelism necessary in a modern superscalar processor.
The IA-64 instruction set is based on a set of concepts that we describe as EPIC, for Explicitly Parallel Instruction Computing. Our belief is that EPIC is the next advance beyond RISC that's needed to keep on the performance treadmill defined by Moore's law of doubling performance every 18 months, or an annual growth rate of 1.6 times. Improvements in the underlying silicon technology yield about a 1.2-times annual improvement rate via faster silicon devices. The rest must be made up with improvements in circuit design and in parallel execution, by overlapping the execution of more instructions through deeper pipelining (to enable a higher clock rate) and/or executing more instructions in parallel. This second kind of parallelism is known as instruction-level parallelism, or ILP, and is measured as the number of instructions executed each clock cycle (IPC).
Our belief is that the EPIC techniques will enable us to stay on the curve of increasing levels of ILP [7], [9], [15], [23], [24], [31], [32], [33], [36], [37].
EPIC is based on the premise that the compiler has much better visibility into program execution than does the hardware. Certainly the compiler can look at a much larger window, and it has more time for analysis (seconds versus nanoseconds). What it's missing is knowledge of individual dynamic events such as individual branch taken/not taken, and cache hit/miss, although there may be statistical aggregate information on these events if profile data is available.
There are three main tenets of an EPIC architecture. It provides
? mechanisms to enable the compiler to arrange the computation efficiently based on its global knowledge,
? sufficient resources such as registers and functional units to perform multiple operations in parallel, and to store the "inventory" of intermediate results, and
? instruction formats that let the compiler communicate to the hardware the key information it has gleaned from the program as it's compiled.
Microprocessors continue on the relentless path to provide more performance. Every new innovation in computing ? distributed computing on the Internet, data mining, Java programming, and multimedia data streams ? requires more cycles and computing power. Even traditional applications such as databases and numerically intensive codes present increasing problem sizes that drive demand for higher performance. Design innovations, compiler technology, manufacturing process improvements, and integrated circuit advances have been driving exponential performance increases in microprocessors. To continue this growth in the future, Hewlett-Packard and Intel architects examined barriers in contemporary designs and found that instruction-level parallelism (ILP) can be exploited for further performance increases.
Background and Objectives.IA-64 is the first architecture to bring ILP features to general-purpose microprocessors. Parallel semantics, predication, data speculation, large register files, register rotation, control speculation, hardware exception deferral, register stack engine, wide floating-point exponents, and other features contribute to IA-64's primary objective. That goal is to expose, enhance, and exploit ILP in today's applications to increase processor performance. ILP pioneers developed many of these concepts to find parallelism beyond traditional architectures. Subsequent industry and academic research significantly extended earlier concepts. This led to published works that quantified the benefits of these ILP-enhancing features and substantially improved performance. Starting in 1994, the joint HP-Intel IA-64 architecture team leveraged this prior work and incorporated feedback from compiler and processor design teams to engineer a powerful initial set of features. They also carefully designed the instruction set to be expandable to address new technologies and future workloads.
Architectural Basics.A historical problem facing the designers of computer architectures is the difficulty of building in sufficient flexibility to adapt to changing implementation strategies. For example, the number of available instruction bits, the register file size, the number of address space bits, or even how much parallelism a future implementation might employ have limited how well architectures can evolve over time. The Intel-HP architec?ture team designed IA-64 to permit future expansion by providing sufficient architectu?ral capacity:
? a full 64-bit address space,
? large directly accessible register files,
? enough instruction bits to communicate information from the compiler to the hardware, and
? the ability to express arbitrarily large amounts of ILP.
Fig. 7.1 summarizes the register state; Fig. 7.2 shows the bundle and instruction formats.
Register Resources. IA-64 provides 128 65-bit general registers; 64 of these bits specify data or memory addresses and 1 bit holds a deferred exception token or not-a-thing (NaT) bit. The "Control speculation" section provides more details on the NaT bit. In addition to the general registers, IA-64 contains:
? 128 82-bit floating-point registers,
? space for up to 128 64-bit special-purpose application registers (used to support features such as the register stack and software pipelining),
? eight 64-bit branch registers for function call linkage and return, and
? 64 one-bit predicate registers that hold the result of conditional expression evaluation.
Instruction Encoding.Since IA-64 has 128 general and 128 floating-point registers, instruction encodings use 7 bits to specify each of three register operands. Most instructions also have a predicate register argument that requires another 6 bits. In a normal 32-bit instruction encoding, this would leave only 5 bits to specify the opcode. To provide for sufficient opcode space and to enable flexibility in the encodings, IA-64 uses a 128-bit encoding (called a bundle) that has room for three instructions.
Each of the three instructions has 41 bits with the remaining 5 bits used for the template. The template bits help decode and route instructions and indicate the location of stops that mark the end of groups of instructions that can execute in parallel.
Distributing Responsibility.To achieve high performance, most modern microprocessors must determine instruction dependencies, analyze and extract available parallelism, choose where and when to execute instructions, manages all cache and prediction resources, and generally directs all other ongoing activities at runtime. Although intended to reduce the burden on compilers, out-of-order processors still require substantial amounts of microarchitecture specific compiler support to achieve their fastest speeds. IA-64 strives to make the best trade-offs in dividing responsibility between what the processor must do at runtime and what the compiler can do at compilation time.
Instruction Level Parallelism.Compilers for all current mainstream microprocessors produce code with the understanding that regardless of how the processor actually executes those instructions, the results will appear to be executed one at a time and in the exact order they were written. We refer to such architectures as having sequential in-order execution semantics, or simply sequential semantics.
Conforming to sequential semantics was easy to achieve when microprocessors executed instructions one at a time and in their program specified order. However, to achieve acceptable performance improvements, designers have had to design multiple-issue, out-of-order execution processors. The IA-64 instruction set addresses this split between the architecture and its implementations by providing parallel execution semantics so that processors don't need to examine register dependencies to extract parallelism from a serial program specification. Nor do they have to reorder instructions to achieve the shortest code sequence.
IA-64 realizes parallel execution semantics in the form of instruction groups. The compiler creates instruction groups so that all instructions in an instruction group can be safely executed in parallel. While such a grouping may seem like a complex task, current compilers already have all of the information necessary to do this. IA-64 just makes it possible for the compiler to express that parallelism.
While instruction groups allow independent computational instructions to be placed together, expressing parallelism in computation related to program control flow requires additional support.
Control parallelism is also present when a program needs to select one of several possible branch targets, each of which might be controlled by a different conditional expression. Such cases would normally need a sequence of individual conditions and branches. IA-64 provides multiway branches that allow several normal branches to be grouped together and executed in a single instruction group.
Influencing Dynamic Events.While the compiler can handle some activities, hardware better manages many other areas including branch prediction, instruction caching, data caching, and prefetching. For these cases, IA-64 improves on standard instruction sets by providing an extensive set of hints that the compiler uses to tell the hardware about likely branch behavior (taken or not taken, amount to prefetch at branch target) and memory operations (in what level of the memory hierarchy to cache data). The hardware can then manage these resources more effectively, using a combination of compiler-provided information and histories of runtime behavior.
IA-64 not only provides new ways of expressing parallelism in compiled code, it also provides an array of tools for compilers to create additional parallelism.