Virtual memory is the core of an operating system's multitasking and protection mechanisms. Compared to 32-bit virtual memory, management of 64-bit address spaces requires new mechanisms primarily because of the increase in address space size: 32 bits can map 4 Gbytes, while 64 bits can map 16 billion Gbytes of virtual space.
A linear 32-bit page table requires 1 million page table entries (assuming a 4-Kbyte page size), and can reside in physical memory. A linear 64-bit page table would be 4 billion times larger ? too big to be physically mapped in its entirety. Additionally, 64-bit applications are likely to populate the virtual address space more sparsely. Due to larger data structures than those in 32-bit applications, these applications may have a larger footprint in physical memory.
All of these effects result in more pressure on the processor's address translation structure: the translation look-aside buffer. While growing the size of on-chip TLBs helps, IA-64 provides several architectural mechanisms that allow operating systems to significantly increase the use of available capacity:
? Regions and protection keys enable much higher degrees of TLB entry sharing.
? TLB entries are tagged with address space identifiers (called region IDs) to avoid TLB flushing on context switch.
Regions.As shown in Fig. 7.4, bits 63 to 61 of a virtual address index into eight region registers that contain 24-bit region identifiers (RIDs). The 24-bit RID is concatenated with the virtual page number (VPN) to form a unique lookup into the TLB. The TLB lookup generates two main items: the physical page number and access privileges (keys, access rights, and access bits among others). The region registers allow the operating system to concurrently map 8 out of 224 possible address spaces, each 261 bytes in size. The operating system uses the RID to distinguish shared and private address spaces. Typically, operating systems assign specific regions to specific uses. For example, region 0 may be used for user private application data, region 1 for shared libraries and text images, region 2 for mapping of shared files, and region 7 for mapping of the operating system kernel itself. On context switch, instead of invalidating the entire TLB, the operating system only rewrites the user's private region registers with the RID of the switched-to process. Shared-region's RIDs remain in place, and the same TLB entries can be shared between different processes, such as shared code or data.
Protection Keys.While RIDs provide efficient sharing of region-size objects, software often is interested in sharing objects at a smaller granularity such as in object databases or operating system message queues. IA-64 protection key registers (PKRs) provide page-granular control over access while continuing to share TLB entries among multiple processes. As shown in Fig. 7.4, each TLB entry contains a protection key field that is inserted into the TLB when creating that translation. When a memory reference hits in the TLB, the processor looks up the matching entry's key in the PKR register file. A key match results in additional access rights being consulted to grant or deny the memory reference. If the lookup fails, hardware generates a key miss fault.
The software key miss handler can now manage the PKR contents as a cache of most recently used protection keys on a per-process basis. This allows processes with different permission levels to access shared data structures and use the same TLB entry. Direct address sharing is very useful for multiple process computations that communicate through shared data structures; one example is producer-consumer multithreaded applications.
The IA-64 region model provides protection and sharing at a large granularity. Protection keys are orthogonal to regions and allow fine-grain page-level sharing. In both cases, TLB entries and page tables for shared objects can be shared, without requiring unnecessary duplication of page tables and TLB entries in the form of virtual aliasing.
IA-64 Floating-Point Architecture.The IA-64 FP architecture is a unique combination of features targeted at graphical and scientific applications. It supports both high computation throughput and high-precision formats. The inclusion of integer and logical operations allows extra flexibility to manipulate FP numbers and use the FP functional units for complex integer operations.
The primary computation workhorse of the FP architecture is the FMAC instruction, which computes A * B + C with a single rounding. Traditional FP add and subtract operations are variants of this general instruction. Divide and square root is supported using a sequence of FMAC instructions that produce correctly rounded results. Using primitives for divide and square root simplifies the hardware and allows overlapping with other operations. For example, a group of divides can be software pipelined to provide much higher throughput than a dedicated nonpipelined divider.
The XMA instruction computes A * B + C with the FP registers interpreted as 64-bit integers. This reuses the FP functional units for integer computation. XMA greatly accelerates the wide integer computations common to cryptography and computer security. Logical and field manipulation instructions are also included to simplify math libraries and special-case handling.
The large 128-element FP register file allows very fast access to a large number of FP (or sometimes integer) variables. Each register is 82-bits wide, which extends a double-extended format with two additional exponent bits. These extra-exponent bits enable simpler math library routines that avoid special-case testing. A register's contents can be treated as a single (32-bit), double (64-bit), or double-extended (80-bit) formatted floating-point number that complies with the IEEE/ANSI 754 standard. Additionally, a pair of single-precision numbers can be packed into an FP register. Most FP operations can operate on these packed pairs to double the operation rate of single-precision computation. This feature is especially useful for graphics applications in which graphic transforms are nearly doubled in performance over a traditional approach.
All of the parallel features of IA-64 ? predication, speculation, and register rotation ? are available to FP instructions. Their capabilities are especially valuable in loops. For example, regular data access patterns, such as recurrences, are very efficient with rotation. The needed value can be retained for as many iterations as necessary without traditional copy operations. Also, if statements in the middle of software-pipelined loops are simply handled with predication.
To improve the exposed parallelism in FP programs, the IEEE standard-mandated flags can be maintained in any of four different status fields. The flag values are later committed with an instruction similar to the speculative check. This allows full conformance to the standard without loss of parallelism and performance.