CATEGORIES:

Biology Chemistry Construction Culture Ecology Economy Electronics Finance Geography History Informatics Law Mathematics Mechanics Medicine Other Pedagogy Philosophy Physics Policy Psychology Sociology Sport Tourism

Questions for Self-Testing

1. What does Moore?s law state?

2. What correction had Gordon E. Moore done in the primary statement of his law ten years after?

3. What principles were used to build the VLIW-architecture?

4. Characterize the IA-64.

5. What kind of parallelism is used for increasing the throughput according to the EPIC-technology?

6. What system of instructions is the IA-64 oriented on?

7. What is the essence of the very long instruction method?

8. Explain the three basic principles of building the EPIC-architecture.

9. What is the length of a VLIW-instruction?

10. How many elementary instructions may be in a VLIW-instruction?

11. How is a token formed according to the VLIW-technology?

12. What information does the VLIW-format contain?

13. What determines the maximal number of elementary instructions in a VLIW-instruction?

14. What functions does the compiler perform in the VLIW-architecture?

Predication

Branching is a major cause of lost performance in many applications. To help reduce the negative effects of branches, processors use branch prediction so they can continue to execute instructions while the branch direction and target are being resolved. To achieve this, instructions after a branch are executed speculatively until the branch is resolved. Once the branch is resolved, the processor has determined its branch prediction was correct and the speculative instructions are okay to commit, or that those instructions need to be thrown away and the correct set of instructions fetched and executed. When the prediction is wrong, the processor will have executed instructions along both paths, but sequentially (first the predicted path, then the correct path). Thus, the cost of incorrect prediction is quite expensive.

In general, predication is performed in IA-64 by evaluating conditional expressions with compare (cmp) operations and saving the resulting true (1) or false (0) values in a special set of 1-bit predicate registers. Nearly all instructions can be predicated. This simple, clean concept provides a very powerful way to increase the ability of an IA-64 processor to exploit parallelism, reduce the performance penalties of branches (by removing them), and support advanced code motion that would be difficult or impossible in instruction sets without predication.

Scheduling and Speculation.Compilers attempt to increase parallelism by scheduling instructions based on predictions about likely control paths. Paths are made of sequences of instructions that are grouped into basic blocks. Basic blocks are groups of instructions with a single entry point and single exit point. The exit point can be a multiway branch. If a particular sequence of basic blocks is likely to be in the flow of control, the compiler can consider the instructions in these blocks as a single group for the purpose of scheduling code. Fig. 7.3 illustrates a program fragment with multiple basic blocks, and possible control paths. The highlighted blocks indicate those most likely to be executed.

Since these regions of blocks have more instructions than individual basic blocks, there is a greater opportunity to find parallel work. However, to exploit this parallelism, compilers must move instructions past barriers related to control flow and data flow. Instructions that are scheduled before it is known whether their results will be used are called speculative. Of the code written by a programmer, only a small percentage is actually executed at runtime. The task of choosing important instructions, determining their dependencies, and specifying which instructions should be executed together is algorithmically complex and time-consuming. In non-EPIC architectures, the processor does much of this work at runtime. However, a compiler can perform these tasks more efficiently because it has more time, memory, and a larger view of the program than the hardware.

The compiler will optimize the execution time of the commonly executed blocks by choosing the instructions that are most critical to the execution time of the critical region as a whole. Within these regions, the compiler performs instruction selection, prioritization, and reordering. Without the IA-64 features, these kinds of transformations would be difficult or impossible for a compiler to perform. The key features enabling these transformations are control speculation, data speculation, and predication.

Control Speculation.IA-64 can reduce the dynamic effects of branches by removing them; however, not all branches can or should be removed using predication. Those that remain affect both the processor at runtime and the compiler during compilation.

Since loads have a longer latency than most computational instructions and they tend to start time-critical chains of instructions, any constraints placed on the compiler's ability to perform code motion on loads can limit the exploitation of parallelism. One such constraint relates to properly handling exceptions. For example, load instructions may attempt to reference data to which the program hasn't been granted access. When a program makes such an illegal access, it usually must be terminated. Additionally, all exceptions must also be delivered as though the program were executed in the order the programmer wrote it. Since moving a load past a branch changes the sequence of memory accesses relative to the control flow of the program, non-EPIC architectures constrain such code motion.

IA-64 provides a new class of load instructions called speculative loads, which can safely be scheduled before one or more prior branches. In the block where the programmer originally placed the load, the compiler schedules a speculation check (chk.s).

Data Speculation.Popular programming languages such as C provide pointer data types for accessing memory. However, pointers often make it impossible for the compiler to determine what location in memory is being referenced. More specifically, such references can prevent the compiler from knowing whether a store and a subsequent load reference the same memory location, preventing the compiler from reordering the instructions.

IA-64 solves this problem with instructions that allow the compiler to schedule a load before one or more prior stores, even when the compiler is not sure if the references overlap. This is called data speculation; its basic usage model is analogous to control speculation.

Register Model.Most architecture provides a relatively small set of compiler-visible registers (usually 32). However, the need for higher performance has caused chip designers to create larger sets of physical registers (typically around 100), which the processor then manages dynamically even though the compiler only views a subset of those registers.

The IA-64 general-register file provides 128 registers visible to the compiler. This approach is more efficient than a hardware-managed register file because a compiler can tell when the program no longer needs the contents of a specific register. These general registers are partitioned into two subsets: 32 static and 96 stacked, which can be renamed under software control. The 32 static registers (r0 to r31) are managed in much the same way as regis?ters in a standard RISC architecture.

The stacked registers implement the IA-64 register stack. This mechanism automatically provides a compiler with a set of up to 96 fresh registers (r32 to r127) upon procedure entry. While the register stack provides the compiler with the illusion of unlimited register space across procedure calls, the hardware actually saves and restores on-chip physical registers to and from memory.

By explicitly managing registers using the registers allocation instruction (alloc), the compiler controls the way the physical register space is used.

The compiler specifies the number of registers that a routine requires by using the alloc instruction. Alloc can also specify how many of these registers are local (which are used for computation within the procedure), and how many are output (which are used to pass parameters when this procedure calls another). The stacked registers in a procedure always start at r32.

On a call, the registers are renamed such that the local registers from the previous stack frame are hidden, and what were the output registers of the calling routine now have register numbers starting at r32 in the called routine. The freshly called procedure would then perform its own alloc, setting up its own local registers (which include the parameter registers it was called with), and its own output registers (for when it, in turn, makes a call). On a return, this renaming is reversed, and the stack frame of the calling procedure
is restored.

The register stack really only has a finite number of registers. When procedures request more registers than are currently available, an automatic register stack engine (RSE) stores registers of preceding procedures into memory in parallel with the execution of the called procedure. Similarly, on return from a call, the RSE can restore registers from memory.

As described here, RSE behavior is synchronous; however, IA-64 allows processors to be built with asynchronous RSEs that can speculatively spill and fill registers in the background while the processor core continues normal execution. This allows spills and fills to be performed on otherwise unused memory ports before the spills and fills are actually needed.

Compared to conventional architectures, IA-64's register stack removes all the save and restore instructions, reduces data traffic during a call or return, and shortens the critical path around calls and returns.

Software Pipelining.Computers are very good at performing iterative tasks, and for this reason many programs include loop constructs that repeatedly perform the same operations. Since these loops generally encompass a large portion of a program's execution time, it's important to expose as much loop-level parallelism as possible.

Although instructions in a loop are executed frequently, they may not offer a sufficient degree of parallel work to take advantage of all of a processor's execution resources. Conceptually, overlapping one loop iteration with the next can often increase the parallelism. This is called software pipelining, since a given loop iteration is started before the previous iteration has finished. It's analogous to the way hardware pipelining works. While this approach sounds simple, without sufficient architectural support a number of issues limit the effectiveness of software pipelining because they require many additional instructions:

? managing the loop count,

? handling the renaming of registers for the pipeline,

? finishing the work in progress when the loop ends,

? starting the pipeline when the loop is entered, and

? unrolling to expose cross-iteration parallelism.

In some cases this overhead could increase code size by as much as 10 times the original loop code. Because of this, software pipelining is typically only used in special technical computing applications in which loop counts are large and the overheads can be amortized.

With IA-64, most of the overhead associated with software pipelining can be eliminated. Special application registers to maintain the loop count (LC) and the pipeline length for draining the software pipeline (the epilogue count, or EC) help reduce overhead from loop counting and testing for loop termination in the body of the loop.

In conjunction with the loop registers, special loop-type branches perform several activities depending on the type of branch. They

? automatically decrement the loop counters after each iteration,

? test the loop count values to determine if the loop should continue, and

? cause the subset of the general, floating, and predicate registers to be automatically renamed after each iteration by decrementing a register rename base (rrb) register.

For each rotation, all the rotating registers appear to move up one higher register position, with the last rotating register wrapping back around to the bottom. Each rotation effectively advances the software pipeline by one stage. The set of general registers that rotate are programmable using the alloc instruction. The set of predicate (p16 to p63) and floating (f32 to f127) registers that rotate is fixed. Instructions br.ctop and br.cexit provide support for counted loops (similar instructions exist to support pipelining of while-type loops).

The rotating predicates are important because they serve as pipeline stage valid bits, allowing the hardware to automatically drain the software pipeline by turning instructions on or off depending on whether the pipeline is starting up, executing, or draining. Mahlke et al. provide some highly optimized specific examples of how software pipelining and rotating registers can be used.

The combination of these loop features and predication enables the compiler to generate compact code, which performs the essential work of the loop in a highly parallel form. All of this can be done with the same amount of code as would be needed for a non-software-pipelined loop. Since there is little or no code expansion required to software-pipeline loops in IA-64, the compiler can use software pipelining much more aggressively as a general loop optimization, providing increased parallelism for a broad set of applications.

Although out-of-order hardware approaches can approximate a software-pipelined approach, they require much more complex hardware, and do not deal as well with problems such as recurrence (where one loop iteration creates a value consumed by later loop iteration).

Summary of Parallelism Features.These parallelism tools work in a synergistic fashion, each supporting the other. For example, program loops may contain loads and stores through pointers. Data speculation allows the compiler to use the software-pipelining mechanism to fully overlap the execution, even when the loop uses pointers that may be aliased. Also, scheduling a load early often requires scheduling it out of its basic block and ahead of an earlier store. Speculative advanced loads allow both control and data speculation mechanisms to be used at once. This increased ILP keeps parallel hardware functional units busier, executing a program's critical path in less time. While designers and architects have a W model for how IA-64 features will be implemented and used, we anticipate new ways to use the IA-64 architecture as software and hardware designs mature. Each day brings discoveries of new code-generation techniques and new approaches to old algorithms. These discoveries are validating that ILP does exist in programs, and the more you look, the more you find.

ILP is one level of parallelism that IA-64 is exploiting, but we continue to pursue other sources of parallelism through on-chip and multichip multiprocessing approaches. To achieve best performance, it is always best to start with the highest performance uniprocessor, and then combine those processors into multiprocessor systems.

In the future, as software and hardware technologies evolve, and as the size and computation demands of workloads continue to grow, ILP features will be vital to allow processors' continued increases in performance and scalability. The Intel-HP architecture team designed the IA-64 from the ground up to be ready for these changes and to provide excellent performance over a wide range of applications.

Date: 2016-06-12; view: 234

<== previous page	\|	next page ==>
Parallel Instruction Computing and Instruction Level Parallelism	\|	IA-64 Virtual Memory Model

doclecture.net - lectures - 2014-2025 year. Copyright infringement or personal data (0.179 sec.)