Implementation of cache hints

Hint	Semantics	L1 response	L2 response	L3 response
NTA	Nontemporal (all levels)	Don?t allocate	Allocate, mark as next replace	Don?t allocate
NT2	Nontemporal (2 levels)	Don?t allocate	Allocate, mark as next replace	Normal allocation
NT1	Nontemporal (1 level)	Don?t allocate	Normal allocation	Normal allocation
T1 (default)	Temporal	Normal allocation	Normal allocation	Normal allocation
Bias	Intent to modify	Normal allocation	Allocate into exclusive state	Allocate into exclusive state

System Bus.The processor uses a multidrop, shared system bus to provide four-way glueless multiprocessor system support. No additional bridges are needed for building up to a four-way system. Systems with eight or more processors are designed through clusters of these nodes using high-speed interconnects. Note that multidrop buses are a cost-effective way to build high-performance four-way systems for commercial transaction processing and e-business workloads. These workloads often have highly shared writeable data and demand high throughput and low latency on transfers of modified data between caches of multiple processors.

In a four-processor system, the transaction-based bus protocol allows up to 56 pending bus transactions (including 32 read transactions) on the bus at any given time. An advanced MESI coherence protocol helps in reducing bus invalidation transactions and in providing faster access to writeable data. The cache-to-cache transfer latency is further improved by an enhanced "defer mechanism," which permits efficient out-of-order data transfers and out-of-order transaction completion on the bus. A deffered transaction on the bus can be completed without reusing the address bus. This reduces data return latency for deferred transactions and efficiently uses the address bus. This feature is critical for scalability beyond four-processor systems. The 64-bit system bus uses a source-synchronous data transfer to achieve 266-Mtransfers/s, which enables a bandwidth of 2.1 Gbytes/s. The combination of these features makes the Itanium processor system a scalable building block for large multiprocessor systems.

The Itanium processor is the first IA-64 processor and is designed to meet the demanding needs of a broad range of enterprise and scientific workloads. Through its use of EPIC technology, the processor fundamentally shifts the balance of responsibilities between software and hardware. The software performs global scheduling across the entire compilation scope, exposing ILP to the hardware. The hardware provides abundant execution resources, manages the bookkeeping for EPIC constructs, and focuses on dynamic fetch and control flow optimizations to keep the compiled code flowing through the pipeline at high throughput. The tighter coupling and increased synergy between hardware and software enable higher performance with a simpler and more efficient design. Additionally, the Itanium processor delivers significant value propositions beyond just performance. These include support for 64 bits of addressing, reliability for mission-critical applications, full IA-32 instruction set compatibility in hardware, and scalability across a range of operating systems and multiprocessor platforms.

With minimal hardware impact, the Itanium processor enables software to hide the latency of load instructions and their dependent uses by boosting them out of their home basic block. The Itanium processor is the first IA-64 processor and is designed to meet the demanding needs of a broad range of enterprise and scientific workloads.

IA-32 Compatibility.Another key feature of the Itanium processor is its full support of the IA-32 instruction set in hardware (Fig. 7.14). This includes support for running a mix of IA-32 applications and IA-64 applications on an IA-64 operating system, as well as IA-32 applications on an IA-32 operating system, in both uniprocessor and multiprocessor configurations. The IA-32 engine makes use of the EPIC machine's registers, caches, and execution resources. To deliver high performance on legacy binaries, the IA-32 engine dynamically schedules instructions. The IA-64 Seamless Architecture is defined to enable running IA-32 system functions in native IA-64 mode, thus delivering native performance levels on the system functionality.

Floating-Point Feature Set.The FPU in the processor is quite advanced. The native 82-bit hardware provides efficient support for multiple numeric programming models, including support for single, double, extended, and mixed-mode-precision computations. The wide-range 17-bit exponent enables efficient support for extended-precision library functions as well as fast emulation of quad-precision computations. The large 128-entry register file provides adequate register resources. The FPU execution hardware is based on the floating-point multiply-add (FMAC) primitive, which is an effective building block for scientific computation. The machine provides execution hardware for four double-precision or eight single-precision flops per clock. This abundant computation bandwidth is balanced with adequate operand bandwidth from the registers and memory subsystem. With judicious use of data prefetch instructions, as well as cache locality and allocation management hints, the software can effectively arrange the computation for sustained high utilization of the parallel hardware.

FMAC Units.The FPU supports two fully pipelined, 82-bit FMAC units that can execute single, double, or extended-precision floating-point operations. This delivers a peak of 4 double-precision flops/clock, or 3.2 Gflops at 800 MHz. FMAC units execute FMA, FMS, FNMA, FCVTFX, and FCVTXF operations. When bypassed to one another, the latency of the FMAC arithmetic operations is five clock cycles. The processor also provides support for executing two SIMD-floating-point instructions in parallel. Since each instruction issues two single-precision FMAC operations (or four single-precision flops), the peak execution bandwidth is 8 single-precision flops/clock or 6.4 Gflops at 800 MHz. Two supplemental single-precision FMAC units support this computation. (Since the read of an 82-bit register actually yields two single-precision SIMD operands, the second operand in each case is peeled off and sent to the supplemental SIMD units for execution.) The high computational rate on single precision is especially suitable for digital content creation workloads.

The divide operation is done in software and can take advantage of the twin fully pipelined FMAC hardware. Software-pipelined divide operations can yield high throughput on division and square-root operations common in 3D geometry codes.

The machine also provides one hardware pipe for execution of FCMPs and other operations (such as FMERGE, FPACK, FSWAP, FLogicals, reciprocal, and reciprocal square root). Latency of the FCMP operations is two clock cycles; latency of the other floating-point operations is five clock cycles.

Operand Bandwidth.Care has been taken to ensure that the high computational bandwidth is matched with operand feed bandwidth (Fig. 7.15). The 128-entry floating-point register file has eight read and four write ports. Every cycle, the eight read ports can feed two extended-precision FMACs (each with three operands) as well as two floating-point stores to memory. The four write ports can accommodate two extended-precision results from the two FMAC units and the results from two load instructions each clock. To increase the effective write bandwidth into the FPU from memory, we divided the floating-point registers into odd and even banks. This enables the two physical write ports dedicated to load returns to be used to write four values per clock to the register file (two to each bank), using two ldf-pair instructions. The ldf-pair instructions must obey the restriction that the pair of consecutive memory operands being loaded in sends one operand to an even register and the other to an odd register for proper use of the banks.

The earliest cache level to feed the FPU is the unified L2 cache (96 Kbytes). Two ldf-pair instructions can load four double-precision values from the L2 cache into the registers. The latency of loads from this cache to the FPU is nine clock cycles. For data beyond the L2 cache, the bandwidth to the L3 cache is two double-precision operations/clock (one 64-byte line every four clock cycles).

Obviously, to achieve the peak rating of four double-precision floating-point operations per clock cycle, one needs to feed the FMACs with six operands per clock. The L2 memory can feed a peak of four operands per clock. The remaining two need to come from the register file. Hence, with the right amount of data reuse, and with appropriate cache management strategies aimed at ensuring that the L2 cache is well primed to feed the FPU, many workloads can deliver sustained performance at near the peak floating?-point operation rating. For data without locality, use of the NT2 and NTA hints enables the data to appear to virtually stream into the FPU through the next level of memory.

FPU and Integer Core Coupling.The floating-point pipeline is coupled to the integer pipeline. Register file read occurs in the REG stage, with seven stages of execution extending beyond the REG stage, followed by floating-point write back. Safe instruction recognition (SIR) hardware enables delivery of precise exceptions on numeric computation. In the FP1 (or EXE) stage, an early exam?ination of operands is performed to determine the possibility of numeric exceptions on the instructions being issued. If the instructions are unsafe (have potential for raising exceptions), a special form of hardware micro replay is incurred. This mechanism enables instructions in the floating-point and integer pipelines to flow freely in all situations in which no exceptions are possible.

The FPU is coupled to the integer data path via transfer paths between the integer and floating-point register files. These transfers (self, getf) are issued on the memory ports and made to look like memory operations (since they need register ports on both the integer and floating-point registers). While self can be issued on either M0 or M1 ports, getf can only be issued on the M0 port. Transfer latency from the FPU to the integer registers (getf) is two clocks. The latency for the reverse transfer (setf) is nine clocks, since this operation appears like a load from the L2 cache. The FPU is enhanced to support integer multiply inside the FMAC hardware. Under software control, operands are transferred from the integer registers to the FPU using setf. After multiplication is complete, the result is transferred to the integer registers using getf. This sequence takes a total of 18 clocks (nine for setf, seven for fmul to write the registers, and two for getf). The FPU can execute two integer multiply-add (XMA) operations in parallel. This is very useful in cryptographic applications. The presence of twin XMA pipelines at 800 MHz allows for over 1,000 decryptions per second on 1,024-bit RSA (Rivest Shamir Adleman - RSA Labs.) using private keys (server-side encryption/decryption).

FPU Controls.The FPU controls for operating precision and rounding are derived from the floating-point status register (FPSR). This register also contains the numeric execution status of each operation. The FPSR also supports speculation in floating-point computation. Specifically, the register contains four parallel fields or tracks for both controls and flags to support three parallel speculative streams, in addition to the primary stream.

Special attention has been placed on delivering high performance for speculative streams. The FPU provides high throughput in cases where the FCLRF instruction is used to clear status from speculative tracks before forking off a fresh speculative chain. No stalls are incurred on such changes. In addition, the FCHKF instruction (which checks for exceptions on speculative chains on a given track) is also supported efficiently. Interlocks on this instruction are track-granular, so that no interlock stalls are incurred if floating-point instructions in the pipeline are only targeting the other tracks. However, changes to the control bits in the FPSR (made via the FSETC instruction or the MOV GR FPSR instruction) have a latency of seven clock cycles.

The FPU feature set is balanced to deliver high performance across a broad range of computational workloads. This is achieved through the combination of abundant execution resources, ample operand bandwidth, and a rich programming environment.

Date: 2016-06-12; view: 245

<== previous page	\|	next page ==>
The Itanium Organization	\|	Instruction-Level Parallelism

doclecture.net - lectures - 2014-2025 year. Copyright infringement or personal data (0.091 sec.)