High ILP Execution Core

The execution core is the heart of the EPIC implementation. It supports data-speculative and control-speculative execution, as well as predicated execution and the traditional functions of hazard detection and branch execution. Furthermore, the processor's execution core provides these capabilities in the context of the wide execution width and powerful instruction semantics that characterize the EPIC design philosophy.

Stall-based Scoreboard control strategy. As mentioned earlier, the frequency target of the Itanium processor was governed by several key timing paths such as the ALU plus bypass and the two-cycle data cache. All of the control paths within the core pipeline fit within the given cycle time ? detecting and dealing with data hazards was one such key control path. To achieve high performance, a nonblocking cache with a scoreboard-based stall-on-use strategy was adopted. This is particularly valuable in the context of speculation, in which certain load operations may be aggressively boosted to avoid cache miss latencies, and the resulting data may potentially not be consumed. For such cases, it is key that 1) the pipeline not be interrupted because of a cache miss, and 2) the pipeline only be interrupted if and when the unavailable data is needed.

Thus, to achieve high performance, the strategy for dealing with detected data hazards is based on stalls ? the pipeline only stalls when unavailable data is needed and stalls only as long as the data is unavailable. This strategy allows the entire processor pipeline to remain filled, and the in-flight dependent instructions to be immediately ready to continue as soon as the required data is available. This contrasts with other high-frequency designs, which are based on flushing and require that the pipeline be emptied when a hazard is detected, resulting in reduced performance. On the Itanium processor, innovative techniques reap the performance benefits of a stall-based strategy and yet enable high-frequency operation on this wide machine.

The scoreboard control is also enhanced to support predication. Since most operations within the IA-64 instruction set architecture can be predicated, either the producer or the consumer of a given piece of data may be nullified by having a false predicate. Fig. 7.11 illustrates an example of such a case. Note that if either the producer or consumer operation is nullified via predication, there are no hazards. The processor Scoreboard therefore considers both the producer and consumer predicates, in addition to the normal operand availability, when evaluating whether a hazard exists. This hazard evaluation occurs in the REG (register read) pipeline stage.

Given the high frequency of the processor pipeline, there's not sufficient time to both compute the existence of a hazard, and effect a global pipeline stall in a single clock cycle. Hence, we use a unique deferred-stall strategy. This approach allows any dependent consumer instructions to proceed from the REG into the EXE (execute) pipeline stage, where they are then stalled ? hence the term deferred stall.

However, the instructions in the EXE stage no longer have read port access to the register file to obtain new operand data. Therefore, to ensure that the instructions in the EXE stage procure the correct data, the latches at the start of the EXE stage (which contain the source data values) continuously snoop all returning data values, intercepting any data that the instruction requires. The logic used to perform this data interception is identical to the register bypass network used to collect operands for instructions in the REG stage. By noting that instructions observing a deferred stall in the REG stage don't require the use of the bypass network, the EXE stage instructions can usurp the bypass network for the deferred stall. By reusing existing register bypass hardware, the deferred stall strategy is implemented in an area-efficient manner. This allows the processor to combine the benefits of high frequency with stall-based pipeline control, thereby precluding the penalty of pipeline flushes due to replays on register hazards.

Execution resources. The processor provides an abundance of execution resources to exploit ILP. The integer execution core includes two memory and two integer ports, with all four ports capable of executing arithmetic, shift-and-add, logical, compare, and most integer SIMD multimedia operations. The memory ports can also perform load and store operations, including loads and stores with postincrement functionality. The integer ports add the ability to perform the less-common integer instructions, such as test bit, look for zero byte, and variable shift. Additional uncommon instructions are also implemented on only the first integer port.

Implementing predication elegantly. Predication is another key feature of the IA-64 architecture, allowing higher performance by eliminating branches and their associated misprediction penalties. However, predication affects several key aspects of the pipeline design. Predication turns a control dependency (branching on the condition) into a data dependency (execution and forwarding of data dependent upon the value of the predicate). If spurious stalls and pipeline disruptions get introduced during predicated execution, the benefit of branch misprediction elimination will be squandered. Care was taken to ensure that predicates are implemented transparently in the pipeline.

The basic strategy for predicated execution is to allow all instructions to read the register file and get issued to the hardware regardless of their predicate value. Predicates are used to configure the data-forwarding network, detect the presence of hazards, control pipeline advances, and conditionally nullify the execution and retirement of issued operations. Predicates also feed the branching hardware. The predicate register file is a highly multi-ported structure. It is accessed in parallel with the general registers in the REG stage. Since predicates themselves are generated in the execution core (from compare instructions, for example) and may be in flight when they're needed, they must be forwarded quickly to the specific hardware that consumes them. Note that predication affects the hazard detection logic by nullifying either data producer or consumer instructions. Consumer nullification is performed after reading the predicate register file (PRF) for the predicate sources of the six instructions in the REG pipeline stage. Producer nullification is performed after reading the predicate register file for the predicate sources for the six instructions in the EXE stage.

Finally, three conditional branches can be executed in the DET pipeline stage; this requires reading three additional predicate sources. Thus, a total of 15 read ports are needed to access the predicate register file. From a write port perspective, 11 predicates can be written every clock: eight from four parallel integer compares, two from a floating-point compare, and one via the stage predicate write feature of loop branches. These read and write ports are in addition to a broadside read and write capability that allows a single instruction to read or write the entire 64-entry predicate register into or from a single 64-bit integer register. The predicate register file is implemented as a single 64-bit latch with 15 simple 64:1 multiplexers being used as the read ports. Similarly, the 11 write ports are efficiently implemented, with each being a 6:64 decoder, with an AND-OR structure used to update the actual predicate register file latch. Broadside reads and writes are easily implemented by reading or writing the contents of the entire 64 bit latch. In-flight predicates must be forwarded quickly after generation to the point of consumption. Taking advantage of the-fact that all predicate-writing instructions have deterministic latency eliminates the costly bypass logic that would have been needed for this. Instead, a speculative predicate register file (SPRF) is used and updated as soon as predicate data is computed. The source predicate of any dependent instruction is then read directly from this register file, obviating the need for bypass logic. A separate architectural predicate register file (APRF) is only updated when a predicate-writing instruction retires and is only then allowed to update the architectural state.

In case of an exception or pipeline flush, the SPRF is copied from the APRF in the shadow of the flush latency, undoing the effect of any misspeculative predicate writes. The combination of latch-based implementation and the two-file strategy allow an area-efficient and timing-efficient implementation of the highly ported predicate registers.

Fig. 7.12 shows one of the six EXE stage predicates that allow or nullify data forwarding in the data-forwarding network. The other five predicates are handled identically. ANDing the predicate value with the destination-valid signal present in conventional bypass logic networks implements predication control of the bypass network very efficiently. Instructions with false predicates are treated as merely not writing to their destination register. Thus, the impact of predication on the operand-forwarding network is fairly minimal.

Optimized speculation support in hardware. With minimal hardware impact, the Itanium processor enables software to hide the latency of load instructions and their dependent uses by boosting them out of their home basic block. This is termed speculation. To perform effective speculation, two key issues must be addressed. First, any exceptions that are detected must be deferrable until an operation's home basic block is encountered; this is termed control speculation. Second, all stores between the boosted load and its home location must be checked for address overlap. If there is an overlap, the latest store should forward the correct data; this is termed data speculation. The Itanium processor provides effective support for both forms of speculation.

In case of control speculation, normal exception checks are performed for a control-speculative load instruction. In the common case, no exception is encountered, and therefore no special handling is required. On a detected exception, the hardware examines the exception type, software-managed architectural control registers, and page attributes to determine whether the exception should be handled immediately (such as for a TLB miss) or deferred for future handling.

For a deferral, a special deferred exception token called NaT (Not a Thing) bit is retained for each integer register, and a special floating-point value, called NaTVal and encoded in the NaN space, is set for floating-point registers. This token indicates that a deferred exception was detected. The deferred exception token is then propagated into result registers when any of the source registers indicates such a token. The exception is reported when either a speculation check or nonspeculative use (such as a store instruction) consumes a register that is flagged with the deferred exception token. In this way, NaT generation leverages traditional exception logic simply, and NaT propagation uses straightforward data path logic.

The existence of NaT bits and NaTVals also affect the register spill-and-fill logic. For explicit software-driven register spills and fills, special move instructions (store.spill and load.fill) are supported that don't take exceptions when encountering NaT'ed data. For floating-point data, the entire data is simply moved to and from memory. For integer data, the extra NaT bit is written into a special register (called UNaT, or user NaT) on spills, and is read back on the load.fill instruction. The UNaT register can also be written to memory if more than 64 registers need to be spilled. In the case of implicit spills and fills generated by the register save engine, the engine collects the NaT bits into another special register (called RNaT, or register NaT), which is then spilled (or filled) once for every 64 register save engine stores (or loads).

For data speculation, the software issues an advanced load instruction. When the hardware encounters an advanced load, it places the address, size, and destination register of the load into the ALAT structure. The ALAT then observes all subsequent explicit store instructions, checking for overlaps of the valid advanced load addresses present in the ALAT. In the common case, there's no match, the ALAT state is unchanged, and the advanced load result is used normally. In the case of an overlap, all address-matching advanced loads in the ALAT are invalidated.

After the last undisambiguated store prior to the load's home basic block, an instruction can query the ALAT and find that the advanced load was matched by an intervening store address. In this situation recovery is needed. When only the load and no dependent instructions were boosted, a load-check (ld.c) instruction is used, and the load instruction is reissued down the pipeline, this time retrieving the updated memory data. As an important performance feature, the ld.c instruction can be issued in parallel with instructions dependent on the load result data. By allowing this optimization, the critical load uses can be issued immediately, allowing the ld.c to effectively be a zero-cycle operation. When the advanced load and its dependent uses were boosted, an advanced check-load (chk.a) instruction traps to a user-specified handler for a special fix-up code that reissues the load instruction and the operations dependent on the load. Thus, support for data speculation was added to the pipeline in a straightforward manner, only needing management of a small ALAT in hardware.

As shown in Fig. 7.13, the ALAT is implemented as a 32-entry, two-way set-associative structure. The array is looked up based on the advanced load's destination register lD, and each entry contains an advanced load's physical address, a special octet mask, and a valid bit. The physical address is used to compare against subsequent stores, with the octet mask bits used to track which bytes have actually been advance loaded. These are used in case of partial overlap or in cases where the load and store are different sizes. In case of a match, the corresponding valid bit is cleared. The later check instruction then simply queries the ALAT to examine if a valid ALAT entry still exists for the ALAT.

The combination of a simple ALAT for the data speculation, in conjunction with NaT bits and small changes to the exception logic for control speculation, eliminates the two fundamental barriers that software has traditionally encountered when boosting instructions. By adding this modest hardware support for speculation, the processor allows the software to take advantage of the compiler's large scheduling window to hide memory latency, without the need for complex dynamic scheduling hardware.

Parallel zero-latency delay-executed branching. Achieving the highest levels of performance requires a robust control flow mechanism. The processor's branch-handling strategy is based on three key directions. First, branch semantics providing more program details are needed to allow the software to convey complex control flow information to the hardware. Second, aggressive use of speculation and predication will progressively lead to an emptying out of basic blocks, leaving clusters of branches. Finally, since the data flow from compare to dependent branch is often very tight, special care needs to be taken to enable high performance for this important case. The processor optimizes across all three of these fronts.

The processor efficiently implements the powerful branch vocabulary of the IA-64 instruction set architecture. The hardware takes advantage of the new semantics for improved branch handling. For example, the loop count (LC) register indicates the number of iterations in a For-type loop, and the epilogue count (EC) register indicates the number of epilogue stages in a software-pipelined loop.

By using the loop count information, high performance can be achieved by software pipelining all loops. Moreover, the implementation avoids pipeline flushes for the first and last loop iterations, since the actual number of iterations is effectively communicated to the hardware. By examining the epilogue count register information, the processor automatically generates correct stage predicates for the epilogue iterations of the software-pipelined loop. This step leverages the predicate-remapping hardware along with the branch prediction information from the loop count register-based branch predictor.

Unlike conventional processors, the Itanium processor can execute up to three parallel branches per clock. This is implemented by examining the three controlling conditions (either predicates or the loop count/epilogue count counter values) for the three parallel branches, and performing a priority encode to determine the earliest taken branch. All side effects of later instructions are automatically squashed within the branch execution unit itself, preventing any architectural state update from branches in the shadow of a taken branch. Given that the powerful branch prediction in the front end contains tailored support for multiway branch prediction, minimal pipeline disruptions can be expected due to this parallel branch execution. Finally, the processor optimizes for the common case of a very short distance between the branch and the instruction that generates the branch condition. The IA-64 instruction set architecture allows a conditional branch to be issued concurrently with the integer compare that generates its condition code ? no stop bit is needed. To accommodate this important performance optimization, the processor pipelines the compare-branch sequence. The compare instruction is performed in the pipeline's EXE stage, with the results being known by the end of the EXE clock. To accommodate the delivery of this condition to the branch hardware, the processor executes all branches in the DET stage. (Note that the presence of the DET stage isn't an overhead needed solely from branching. This stage is also used for exception collection and prioritization, and for the second clock of execution for integer-SIMD operations.) Thus, any branch issued in parallel with the compare that generates the condition will be evaluated in the DET stage, using the predicate results created in the previous (EXE) stage. In this manner, the processor can easily handle the case of compare and dependent branches issued in parallel.

As a result of branch execution in the DET stage, in the rare case of a full pipeline flush due to a branch misprediction, the processor will incur a branch misprediction penalty of nine pipeline bubbles. Note that we expect this to occur rarely, given the aggressive multitier branch prediction strategy in the front-end. Most branches should be predicted correctly using one of the four progressive resteers in the front-end.

The combination of enhanced branch semantics, three-wide parallel branch execution, and zero-cycle compare-to-branch latency allows the processor to achieve high performance on control-flow-dominated codes, in addition to its high performance on more computation-oriented data-flow-dominated workloads.

Date: 2016-06-12; view: 228

<== previous page	\|	next page ==>
Instruction bundles capable of full-bandwidth dispersal	\|	The Itanium Organization

doclecture.net - lectures - 2014-2025 year. Copyright infringement or personal data (0.108 sec.)