CATEGORIES: BiologyChemistryConstructionCultureEcologyEconomyElectronicsFinanceGeographyHistoryInformaticsLawMathematicsMechanicsMedicineOtherPedagogyPhilosophyPhysicsPolicyPsychologySociologySportTourism |
The Design Space of Register Renaming TechniquesRegister renaming is a technique to remove false data dependencies ? write after read (WAR) and write after write (WAW) ? that occur in straight line code between register operands of subsequent instructions. By eliminating related precedence requirements in the execution sequence of the instructions, renaming increases the average number of instructions that are available for parallel execution per cycle. This results in increased IPC (number of instructions executed per cycle). The identification and exploration of the design space of register-renaming lead to a comprehensive understanding of this intricate technique. The design space of register renaming is spanned by four main dimensions: the scope of register renaming, the layout of the rename buffers, the method of register mapping, and the rename rate. Relevant aspects of the design space give rise to eight basic alternatives for register-renaming. In addition, the kind of operand fetch policy significantly affects how the processor carries out the rename process, which duplicates the eight basic alternatives to 16 possible implementation schemes. As register renaming is usually implemented in conjunction with shelving, the underlying microarchitecture is assumed to employ shelving. Register Renaming.The principle of register renaming is straightforward. If the processor encounters an instruction that addresses a destination register, it temporarily writes the instruction's result into a dynamically allocated rename buffer rather than into the specified destination register. For instance, in the case of the following WAR dependency: i1: add..., r2, ...; [... i2: mul r2, ..., ...; [r2 the destination register of i2 (r2) is renamed, say to r33. Then, instruction i2 becomes i2': mul r33, ..., ...; [r33 Its result is written into r33 instead of into r2. This resolves the previous WAR dependency between il and i2. In subsequent instructions, however, references to source registers must be redirected to the rename buffers allocated to them as long as this renaming remains valid [11], [33], [44], [61], [62]. Tomasulo?s Algorithm. Let us discuss the history of the first pipelined processors, the earliest superscalar, the development of out-of-order and speculative techniques, as well as important developments in the accompanying compiler technology. It is generally agreed that one of the first general-purpose pipelined computers was Stretch, the IBM 7030. Stretch followed the IBM 704 and had a goal of being 100 times faster than the 704. The goals were a ?stretch? of the state of the art at that time ? hence the nickname. The plan was to obtain a factor of 1.6 from overlapping fetch, decode, and execute, using a four-stage pipeline; apparently the rest was to come from much more hardware and faster logic. Stretch was also a training ground for both the architect of the IBM 360 and the architect of the IBM RS/6000. Control Data Corporation (CDC) delivered what is considered to be the first superscalar computer, the CDC 6600, in 1964. The core instructions of Cray?s subsequent computers have many similarities to those of the original CDC 6600. The CDC 6600 was unique in many ways. The interaction between pipelining and instruction set design was understood, and the instruction set was kept simple to promote pipelining. The CDC 6600 also used an advanced packaging technology. CDC developed the original 2-bit branch prediction scheme and also explored several techniques for enhancing instruction issue [27], [34], [37], [44], [53], [61], [62], [63], [65]. The IBM 360/91 introduced many new concepts, including dynamic detection of memory hazards, generalized forwarding, and reservation stations. This approach is normally named Robert Tomasulo?s algorithm, after an engineer who worked on the project, for his pioneering work on out-of-order processing. The internal organization of the 360/91 shares many features with the Pentium III and Pentium 4, as well as several other microprocessors. One major difference was that there was no branch prediction in the 360/91 and hence no speculation. Another major difference was that there was no commit unit, so once the instructions finished execution, they updated the registers. Out-of-order instruction commit led to imprecise interrupts, which proved to be unpopular and led to the commit units in dynamically scheduled pipelined processors since that time. Although the 360/91 was not a success, the key ideas were resurrected later and exist in some form in the majority of modern microprocessors. The centralized chart of discovery of conflicts during organization of conveyer instructions execution, first applied in computer of CDC 6600, assumes unregulated execution of instructions at presence of sufficient resources and absence of dependence from data. Such unregulated execution causes the necessity of the simultaneous finding of a few instructions on the stage of implementation. For this purpose in a computer there must be a few functional units with which a centralized chart co-operates at the management advancement of every instruction from one stage to other. In the computer of IBM System/360/90 the Tomasulo?s algorithm is first applied for organization of concurrent instructions execution at presence of conflicts, methods are combined in which: renaming of registers and spooling of operands from register file. The discovery of conflicts and management by instructions execution in this algorithm are up-diffused. For this purpose in the complement of conveyer functional units the stations are included backspin reservation stations which determine when in each of them it can be begun instruction execution. The results of implementation of operations are sent straight in functional units, passing registers. Thus common data bus allows making the simultaneous load of all devices expecting an operand. The reservation stations keeps instructions which are produced and expect implementation in the proper functional unit, and also information necessary for the management by work of the most functional unit on execution of the instruction storable in it. For the conflicts management all the buffers and stations of backuping have the fields of tags, which are the three-plus-one-bit value sufficient for denotation of one of five stations of backuping and one of six buffers of loading. Tagis a field in a table used for a memory hierarchy that contains the address information required to identify whether the associated block in the hierarchy corresponds to a requested word. The tag field is used for description of that, what functional unit will supply with a result necessary as the source of operand. The not used values, such as zero, show that an operand is accessible. Tags in the Tomasulo?s algorithm refers on buffers and on the devices which will be supply the result; when an instruction is produced on the station of backuping, the numbers of registers are eliminated from consideration. In Tomasulo?s algorithm distributing of logic of discovery of conflicts allows by means of translation on the bus of completion simultaneously to execute a few instructions expecting the same result and already having the second operand. Removal of the transpositions related to the conflicts as WAW and WAR, which arise up, when operands are accessible in registers, is arrived at by spooling of sources of operands. Conflict as WAR it is possible to remove also by means of renaming of register together with spooling of result until there are the addresses to the old version of register. A precursor to register renaming was introduced for floating-point instructions in 1967 by Tomasulo in the IBM 360/91, a scalar supercomputer of that time that pioneered both pipelining and shelving (dynamic instruction issue). The 360/91 renamed floating-point registers to preserve the logical consistency of the program execution rather than to remove false data dependencies. Tjaden and Flynn first suggested the use of register renaming for removing false data dependencies for a limited set of instructions that corresponds more or less to the load instructions. However, they didn't use the term "register renaming." Keller introduced this designation in 1975 and extended renaming to cover all instructions including a destination register. He also described how to implement register renaming in processors. Even so, due to the complexity of this technique almost two decades passed after its conception before register renaming came into widespread use in superscalars at the beginning of the 1990s. Early superscalars such as the HP PA 7100, Sun SuperSparc, DEC Alpha 21064, MIPS R8000, and Intel Pentium typically didn't use renaming. Renaming appeared gradually ? first in a restricted form called partial renaming ? in the early 1990s in the IBM RS/6000 (Powerl), Power2, PowerPC 601, and the NextGen Nx586 processors. Full renaming emerged later, beginning in 1992, first in the high-end models of the IBM mainframe ES/9000, then in the PowerPC 603. Subsequently, renaming spread into virtually all superscalar processors with the notable exception of Sun's UltraSparc line. At present, register renaming is considered to be a standard feature of performance-oriented superscalar processors. The main dimensions of register renaming are as follows: ? scope of register renaming, ? layout of the renamed registers, ? method of register mapping, and ? rename rate as indicated in subsequent sections. Scope of Register Renaming.To indicate how extensively the processor makes use of renaming, let us distinguish between partial and full renaming. Partial renaming is restricted to one or only a few instruction types, for instance, only to floating-point instructions. Early processors typically employed this incomplete form of renaming; the Powerl (RS/6000), Power2, PowerPC 601, and the Nx586 are examples. Of these, the Powerl (RS/6000) renames only floating-point loads. As the Powerl has only a single floating-point unit, it executes floating-point instructions in sequence. Thus there's no need for renaming floating-point register instructions. The Power2 processor introduces multiple floating-point units and for this reason it extends renaming to all floating-point instructions. The Power PC 601 renames only the link and count register. The Nx586, which has an integer core, obviously restricted renaming to fixed-point instructions. Full renaming covers all instructions including a destination register. Rename Buffer Layout.Rename buffers establish the actual framework for renaming. There are three essential design aspects in their layout: ? type of rename buffers ? number of rename buffers, and ? number of read and write ports. The choice of which type of rename buffers to use in a processor has far-reaching impact on the implementation of the rename process. Given its importance, designers must consider the various design options. To simplify this presentation, let us initially assume a common architectural register file for all processed data types, and then extend the discussion to the split-register scenario that is commonly employed. There are four fundamentally different ways to implement rename buffers. The range of choices includes a) using a merged architectural and rename register file, b) employing a stand-alone rename register file, c) keeping renamed values either in the reorder buffer (ROB), or d) in the shelving buffers. In the first approach, rename buffers are implemented along with the architectural registers in the same physical register file called the merged architectural and rename register file or the merged register file for short. Here, both architectural and rename registers are dynamically allocated to particular registers of the same physical file. Each physical register of the merged architectural and rename register file is at any time in one of four possible states. These states reflect the actual use of a physical register as follows: ? uncommitted (available) state, ? used as an architectural register (architectural register state), ? used as a rename buffer ? but this register doesn't yet contain the result of the associated instruction (rename buffer, not-valid state), and ? used as a rename buffer ? this register already contains the result of the associated instruction (rename buffer, valid state). During instruction processing, the states of the physical registers are changed as described in the following and indicated in the state transition diagram in Fig. 6.11. As part of the initialization the first n physical registers are assigned to the architectural registers, where n is the number of the registers declared by the instruction set architecture (ISA). These registers are set to be in the architectural register (AR) state; the remaining physical registers take on the available state. When an issued instruction includes a destination register, a new rename buffer is needed. For this reason, one physical register is selected from the pool of the available registers and allocated to the concerned destination register. Accordingly, its state is set to the rename buffer, not-valid state, and its valid bit is reset. After the associated instruction finishes execution, the produced result is written into the allocated rename buffer. Its valid bit is set, and its state changes to rename buffer, valid. Later, when the associated instruction completes, the allocated rename buffer will be declared to be the architectural register that implements the destination register specified in the just-completed instruction. Its state then changes to the architectural register state to reflect this.
Note that in contrast to other rename buffer types, no data transfer is required for updating the architectural registers in merged architectural and rename register files. Instead, only the status of the related registers needs to be changed. Finally, when an old architectural register is reclaimed, it is set to the available state. Assuming dispatch-bound operand fetching, a possibility for reclaiming such old architectural registers is to keep track of the physical registers that have been the previous instances of the same architectural register, and reclaim the previous instance when the instruction incorporating the new instance completes. Also, not-yet-completed instructions must be canceled for exceptions or faulty executed speculative instructions. Then allocated rename buffers in a) the rename buffer, not-valid and b) the rename buffer, valid states are deallocated and changed to available states. In addition, the corresponding mappings ? kept in either the mapping table or the rename buffer (as discussed later) ? must be canceled. Merged architectural and rename register files are employed, for instance, in the high-end models (520-based models) of the IBM ES/9000 mainframes, Power and Rlx000 processors, and Alpha 21264s. All other alternatives separate rename buffers from architectural registers. In the first "separated" variant, a standalone rename register file (or rename register file for short) is used exclusively to implement rename buffers. The PowerPC 603/620 and PA8x00 processors are examples of using rename register files. Alternatively, renaming can also be based on the reorder buffer (ROB). The ROB has recently been widely used to preserve the sequential consistency of instruction execution. When using a ROB, an entry is assigned to each issued instruction for the duration of its execution. It's quite natural to use this entry for renaming as well ? basically by extending it with a new field that will hold the result of that instruction. Examples of processors using the ROB for renaming are the Am29000 superscalar, K5, K6, Pentium Pro, Pentium II, Pentium III, and Pentium IV. The ROB can even be extended further to serve as a central shelving buffer as well. In this case, the ROB is also occasionally designated as the DRIS (deferred scheduling register renaming instruction shelf). The Lightning processor proposal and the K6 made use of this solution. Because the Lightning proposal ? which dates back to the beginning of the 1990s ? was too ambitious in the light of the technology available at that time, it couldn't be economically implemented and never reached the market. The last conceivable implementation alternative of rename buffers is to use the shelving buffers for renaming. In this case, each shelving buffer must be extended functionally to also perform the task of a rename buffer. But this alternative has a drawback resulting from the different deallocation mechanisms of the shelving and rename buffers. While shelving buffers can be reclaimed as soon as the instruction has been dispatched, rename buffers can be deallocated only at a later time, not earlier than the instruction has been completed. Thus, a deeper analysis is needed to reveal the appropriateness of using shelving buffers for renaming. To date, no processor has chosen this alternative. Instruction Shelving Principle.Inearly superscalars, decoded and executable instructions are issued immediately to the execution units. However, using this scheme control and data dependencies, and busy execution units, cause issue bottlenecks. The basic technique used to remove an issue bottlenecks is instruction shelving, also known as dynamic instruction issue. Shelving presumes the availability of dedicated buffers, called shelving buffers, in front of the execution units. The processor first issues instructions into available shelving buffers without checking for data or control dependencies, or for busy execution units. As data dependencies or busy execution units no longer restrict instruction issue, the issue bottleneck problem occurring in early superscalar is removed. In a second step, instructions held in the shelving buffers are dispatched for executed. During dispatching, instructions are checked for dependencies, and not-dependent instructions are forwarded to the execution units. At the time being, there's no consensus on the use of terms instruction issue and instruction dispatch. Both terms are used in both possible interpretations. ROB Principle.The reorder buffer is implemented basically as a circular buffer whose entries are allocated and deallocated by means of two revolving pointers. It operates as follows. When instructions are issued, a ROB entry is allocated to each instruction, strictly in program order. Each ROB entry keeps track of the execution status of the associated instruction. The ROB allows instructions to complete (commit, retire) only in program order by permitting an instruction to complete only if it has finished its execution and preceding instructions are already completed. In this way, instructions update the program state in exactly the same way, as a sequential processor would have done. After an instruction has completed, the associated ROB entry is deallocated and becomes eligible for reuse. Operand Fetch Policies. If a processor uses the issue-bound fetch policy, it fetches referenced register operands during instruction issue - that is, while it forwards decoded instructions into the shelving buffers. In contrast, the dispatch-bound fetch policy defers operand fetching until executable instructions are forwarded from the shelving buffers to the execution units. When a processor fetches issue-bound operands, shelving buffers hold the source operand values. In contrast, in dispatch-bound operand fetching, shelving buffers have much shorter entries as they contain only the register identifiers.
When the processor employs the split-register principle, distinct fixed-point and floating-point register files are needed for merged files and standalone rename register files. In this case separate data paths are also needed to access the fixed-point and the floating-point registers. Recent processors typically incorporate split rename registers. When renaming takes place within the ROB, usually a single mechanism is maintained for the preservation of the sequential consistency of instruction execution. Then all renamed instructions are kept in the same ROB queue despite using split architectural register files for fixed-point and floating-point data. In this case, clearly, each ROB entry is expected to be long enough to hold either fixed-point or floating-point data. Number of Rename Buffers.Rename buffers keep register results temporarily until instructions complete. By taking into account that not every instruction produces a register result, we can state that in a processor up to as many rename buffers are needed as the maximum number of instructions that are in execution; that is, issued but not yet completed. Issued but not yet completed instructions are either ? held in shelving buffers waiting for execution (if shelving is employed), ? just been processed in execution units, ? in the load queue waiting for cache access (if there is a load queue), or ? in the store queue waiting for completion, then forwarded to the cache to execute the required store operation (if there is a store queue). Thus, the maximal number of instructions that may have been issued but not yet completed in the processor (npmax) is npmax = wdw + nEU+ nLq + nSq, where(6.7) wdw is the width of the dispatch window (total number of shelving buffers), nEU isthe number of the execution units that may operate in parallel, nLq is the number of the entries in the load queue, and nSq is the number of the entries in the store queue. Assuming a worst-case design approach, from this formula we can determine that the total number of rename buffers required (nrmax) is nrmax = wdw + nEU+ nLq, (6.8) since instructions held in the store queue don't require rename buffers. Furthermore, if the processor includes a ROB, based on Equation (6.7) we can say that the total number of ROB entries required (nROBmax) is nROBmax = npmax. (6.9) Nevertheless, if the processor has fewer rename buffers or fewer ROB entries than expected, according to the worst-case approach (as given by Equations 6.8 and 6.9) issue blockages can occur due to missing free rename buffers or ROB entries. With a decreasing number of entries provided, we expect smooth, slight performance degradation. Hence, a stochastic design approach is also feasible. There, the required number of entries is derived from the tolerated level of performance degradation. Based on Equations (6.7) to (6.9), the following relations are typically valid concerning the width of the processor's dispatch window (wdw), the total number of the rename buffers (nr), and the reorder width (nROB), which equals the total number of ROB entries available: wdw < nr ≤ nROB. (6.10) Number of Read and Write Ports.By taking into account current practice, in the following discussion we assume split register files. Clearly, as many read ports are required in the rename buffers as there are data items that the rename buffers may need to supply in any one cycle. Note that rename buffers supply required operands for the instructions to be executed and also forward the results of the completed instructions to the addressed architectural registers. The number of operands that need to be delivered in the same cycle depends first of all on whether the processor fetches operands during instruction issue or during instruction dispatch. If operands are fetched issue bound, the rename buffers need to supply the operands for all instructions that are issued into the shelving buffers in the same cycle. Thus, both the fixed-point and floating-point rename buffers are expected to deliver in each cycle all required operands for up to as many instructions as the issue rate. This means that in recent four-way superscalar processors the fixed-point and the floating-point rename buffers typically need to supply 8 and 12 operands respectively, assuming up to two fixed-point and three floating-point operands in each fixed-point and floating-point instruction, respectively. If, however, there are some issue restrictions, the required number of read ports is decreased accordingly. In contrast, if the processor uses the dispatch-bound fetch policy, the rename buffers should provide the operands for all instructions that are forwarded from the dispatch window (instruction window) for execution in the same cycle. In this case, the fixed-point rename buffers need to supply the required fixed-point operands for the integer and load-store units (including register operands for the specified address calculations and fixed-point data for the fixed-point store instructions). The floating-point rename buffers need to deliver operands for the floating-point units (floating-point register data) and also for the load-store units (floating-point operands of the floating-point store instructions). In the Power3, for instance, this implies the follow?ing read port requirements. The fixed-point rename buffers need to have 12 read ports (up to 3 × 2 operands for the 3 integer-units as well as 2 × 2 address operands and 2 × 1 data operands for the 2 load-store units). On the other hand, the floating-point rename regis?ters need to have 8 read ports (up to 2 × 3 operands for the 2 floating-point units and 2 × 1 operands for the 2 load-store units). In addition, if rename buffers are implemented separately from the architectural registers, the rename buffers must forward in each cycle as many result values to the architectural registers as the completion rate (retire rate) of the processor. Since recent processors usually complete up to four instructions per cycle, this task typically increases the required number of read ports in the rename buffers by four. Too many read ports in a register file may unduly increase the physical size of the data path and consequently the cycle time. To avoid this problem, a few high-performance processors (such as the Power2, Power3, and Alpha 21264) implement two copies of particular register files. The Power2 duplicates the fixed-point architectural register file, the Power3 doubles both the fixed-point rename and the architectural file, and the Alpha 21264 provides two copies of the fixed-point merged architectural and rename register file. As a result, fewer read ports are needed in each of the copies. For example, with two copies of the fixed-point merged register file, the Power3 needs only 10 read ports in each file, instead of 16 read ports in a fixed-point register file. A drawback of this approach is, however, that a scheme is also required to keep both copies coherent. Now let's turn to the required number of write ports (input ports). Since in each cycle rename buffers need to accept all results produced by the execution units, these buffers must provide as many write ports as results the execution units may produce per cycle. The fixed-point rename buffers receive results from the available integer-execution units and from the load-store units (fetched fixed-point data). In contrast, the floating-point rename buffers hold the results of the floating-point execution units and the bad-store units (fetched floating-point data). Most results are single data items requiring one write port. However, there are a few exceptions. When execution units generate two data items, they require two write ports as well, similar to the PowerPC processor load-store units. After execution of the LOAD-WITH-UPDATE instruction, these units return both the fetched data value and the updated address value. Register Mapping Methods.During renaming, the processor needs to allocate rename buffers to the destination registers of the instructions (or usually to every instruction to simplify logic). It also must keep track of the mappings actually used and deallocate rename buffers no longer used. Accordingly, the related aspect of the design space has three components: ? allocation scheme of the rename buffers, ? method of keeping track of actual mappings, and ? deallocation scheme of rename buffers. As far as the allocation scheme of rename buffers is concerned, rename buffers are usually allocated to instructions during instruction issue. If rename buffers are assigned to the instructions as early as during instruction issue, rename buffer space is wasted, since rename buffers are not needed until the results become available in the last execution cycle. Delaying the allocation of rename buffers to the instructions saves rename buffer space. Various schemes have been proposed for this, such as virtual renaming and others. In fact, a virtual allocation scheme has already been introduced into the Power3. Established mappings must be maintained until their invalidation. There are two possibilities for keeping track of the actual mapping of particular architectural registers to allocated rename buffers. The processor can use a mapping table for this or can track the actual register mapping within the rename buffers themselves (Fig 6.13). A mapping table has as many entries as there are architectural registers provided by the instruction set architecture (usually 32). Each entry holds a status bit (called the entry valid bit in Fig. 6.13,a), which indicates whether the associated architectural register is renamed. Each valid entry supplies the index of the rename buffer, which is allocated to the architectural register belonging to that entry (called the RB index). For instance, Fig. 6.13,a shows that the mapping table holds a valid entry for architectural register r7, which contains the RB index of 12. This indicates that architectural register r7 is actually renamed to rename buffer 12.
Obviously, for split architectural register files, separate fixed-point and floating-point mapping tables are needed. As discussed earlier, mapping tables should provide one read port for each source operand that may be fetched in any one cycle and one write port for each rename buffer that may be allocated in any one cycle. The other fundamentally different alternative for keeping track of actual register mappings relies on an associative mechanism (see Fig. 6.13,b). In this case no mapping table exists, but each rename buffer holds the identifier of the associated architectural register (usually the register number of the renamed destination register) and additional status bits. These entries are set up during instruction issue when a particular rename buffer is allocated to a specified destination register. As Fig. 6.13,b shows, in this case each rename buffer holds the following five pieces of information: ? a status bit, which indicates that this rename buffer is actually allocated (called the entry valid bit in the figure), ? the identifier of the associated architectural register (Dest. reg. no.), ? a further status bit, called the latest bit, whose role will be explained later, ? another status bit, called the value valid bit, which shows whether the actual value of the associated architectural register has already been generated, and ? the value itself, provided that the value valid bit signifies an already produced result. The latest bit marks the last allocation of a given architectural register if it has more than one valid allocation due to repeated renaming. For instance, in our example architectural register r7 has two subsequent allocations. From these, entry 12 is the latest one as its latest bit has been set. Thus, in Fig. 6.13,b, renaming source register r7 would yield the RB index of 12. This method of renaming source registers requires an associative lookup in all entries searching for the latest allocation of the given source register. With issue-bound operand fetching, source registers are both renamed and accessed during the issue process. For this reason, in this case, processors usually integrate renaming and operand accessing, and therefore keep track of the register mapping within the rename buffers. For dispatch-bound operand fetching, however, these tasks are separated. Source registers are renamed during instruction issue, whereas the source operands are accessed while the processor dispatches the instructions to the execution units. Therefore, in this case, processors typically use mapping tables. If rename buffers are no longer needed, they should be reclaimed (deallocated). The scheme of deallocation depends on key aspects of the overall renaming process. In particular, they depend on the allocation scheme of the rename buffers, the type of rename buffers used, the method of keeping track of actual allocations, and even whether issue-bound or dispatch-bound operands are fetched. However, lack of space restricts discussion of this aspect of the design space in detail here. Rename Rate.As its name suggests, the rename rate stands for the maximum number of renames that a processor can perform in a cycle. Basically, the processor should rename all instructions issued in the same cycle to avoid performance degradation. Thus, the rename rate should equal the issue rate. This is easier said than done, since it is not at all an easy task to implement a high rename rate (four or higher). Two reasons make it difficult. First, for higher rename rates the detection and handling of inter instruction dependencies during renaming become a more complex task. Second, higher rename rates require a larger number of read and write ports on register files and mapping tables. For instance, the four-way superscalar R10000 can issue any combination of four fixed-point and floating-point instructions. Accordingly, its fixed-point mapping table needs 12 read ports and 4 write ports, and its floating-point table requires 16 read and 4 write ports. This many ports are needed since fixed-point instructions can refer up to 3 and floating-point instructions up to 4 source operands in this processor. Date: 2016-06-12; view: 292
|