2. Why do RISC designs generally provide only a minimal set of instructions from which more complex instructions can be constructed?
3. What is the basic idea of ?ISC?
4. What architectural ideas have been realized in the first computer with RISC?
5. What architectural peculiarities are in the Berkeley RISC and Stanford MIPS projects?
Pipelining
Pipelining is an implementation technique in which multiple instructions are overlapped in execution, much like to an assembly line.
Flynn Classificationis based on estimation of amount of interactive instructions and data flow streams and selects four classes of architectures: SISD (single instruction stream / single data stream); SIMD (single instruction stream / multiple data stream); MISD (multiple instruction stream /single data stream); MIMD (multiple instruction stream/multiple data stream).
One Instruction per Clock Cycle.A goal to which the designers of all RISC processors aspire is the ability to sustain an execution rate of one instruction per cycle, or even possibly more than one instruction per cycle. It is important to understand that in a literal sense a processor cannot really execute an instruction in one cycle. It is difficult to imagine a processor that could take an instruction and, boom, cause everything required by the instruction to happen at once. That is simply not the way processors are designed. The only way in which it is possible to effectively execute an instruction in a single cycle is through the use of pipelineswhich several instructions are being executed at once, or through the use of instruction lookahead or parallel execution units. The general idea is to do more than one thing at a time and to execute instructions in parallel to the greatest extent possible. In practice, most RISC processors currently do not quite achieve the one instruction per clock goal, but they do come much closer than CISC chips.
Pipeliningis a general term used to describe the design approach in which a complex operation is broken down into stages so that the stages can be executed in parallel. In RISC processors the application of this technique to the basic instruction interpretation and execution is a key to achieving the high throughput. Furthermore, many of the other characteristics of RISC, such as simplified instruction sets, are consequences of the decision to pipeline instruction execution [11], [38], [56], [57], [61], [62], [66].
To pipeline instruction execution, the execution of an individual instruction is broken down into a sequence of separate phases. The exact division of work between the phases varies between processors, but the following is a common description:
? Instruction Fetch (IF). This phase is the actual loading of the instruction from memory into the processor.
? Instruction Decode (ID). During this phase, the instruction is decoded. The decoding phase on a RISC processor is much simpler than the decoding that is required by CISC processors.
? Operand Fetch (OF). During the operand fetch phase, operands (typically operands in registers) are read from the registers into an appropriate place on the chip where the operation is performed.
? Operate (OP).The operation is actually executed during this phase.
? Write Back (WB). The result of an operation is written back into a register (or into a memory location).
These phases are the logical phases of instruction execution. They are not necessarily distinct in the sense that they must be executed sequentially, each in its own separate cycle. It is possible, for example, to overlap the instruction decode and operand fetch phases if the format of the instructions and the chip's logic allow it, resulting in a pipeline with four stages as shown in Fig. 3.2. If we look at these phases, and consider that they are supposed to be executed independently, we can immediately see why it would be quite difficult to pipeline a typical CISC machine. Consider an ADD instruction on the 68030. The first problem is that the instruction can vary in length from 2 to 14 bytes, depending on the addressing mode. The instruction fetch phase needs to know the length in order to fetch the bytes of the instruction. In the non-pipelined implementation used by the 68030, this is not a problem, since the necessary words are loaded from the instruction stream as they are needed. In a pipelined approach, however, we are deliberately trying to avoid this kind of intermixed processing. All these problems add up to make it impossible or at least extremely difficult to fully pipeline a CISC instruction set like that found on the x86. A RISC processor, on the other hand, is specially designed to facilitate the pipelined approach.
Both fetching an instruction and decoding an instruction on most RISC processors is simplified. When fetching an instruction, the processor is able to assume on most machines that the instruction length is 32 bits. This means that the instruction fetch phase simply loads a word from memory, without needing to examine any of the initial bits of the instruction to determine the number of bytes in the instruction. Instruction decoding is simplified because there are very few instruction formats. The few formats, which do exist, place the fields such as the opcode and operand fields in consistent positions. This means that the instruction decoding and the process of identifying the operands are much simplified. When it comes to fetching the operands, an important simplification is that only load and store instructions can reference memory. As we will see later, loads in particular raise some special problems, which we will deal with separately. By requiring all operands to be in registers for other instructions, such as arithmetic and logical operations, the process of fetching operands is reduced to referencing the necessary registers. Again, this is a much simpler process than interpreting the complex addressing modes of the CISC processors. With these kinds of simplifications, it is not only practical but also relatively straightforward to pipeline the execution of the successive phases. Suppose we have the following sequence of operations:
ADD R1, R2, R3 instruction 1 (result into R1)
ADD R4, R5, R6 instruction 2
ADD R7, R8, R9 instruction 3
ADD R10, R11, R12 instruction 4
The state of the pipeline just as the fourth instruction is being fetched, and with the first three instructions in various stages of execution, is as follows:
1. The IF phase is loading instruction 4.
2. The ID/OF phase is interpreting instruction 3 and fetching its operands (R8 and R9).
3. The OP phase is executing instruction 2 (adding the contents of the R5 and R6).
4. The WB phase is writing back the result from instruction 1 (the sum of R2 and R3 to be stored in Rl).
Although each instruction takes 4 clocks to execute, the throughput is one instruction per clock, since a new instruction can enter the IF phase on each clock. As long as the pipeline is not disturbed, this throughput can be maintained, and the RISC goal of executing one instruction per clock is achieved.
Keeping the Pipeline Flowing. In the example just above, the four ADD instructions were completely independent. This is not always the case, since simple-minded code frequently uses values just computed in subsequent instructions. A more typical example of a sequence of instructions would be
ADD R1, R2, R3 ; instruction 1 (sets R1)
ADD R5, R1, R4 ; instruction 2 (uses R1, sets R5)
ADD R7, R5, R6 ; instruction 3 (uses R5, sets R7)
ADD R8, R7, R8 ; instruction 4 (uses R7, sets R8)
In this sequence, each instruction needs results from the previous instruction. Looking again at Fig. 3.2 it seems that this would hold up the pipeline, because the operand fetch phase of each instruction execution seems to need the result stored by the write back phase of the previous instruction, and the result isn't there soon enough.
This problem is solved by using a technique called forwarding inwhich the result from one operation is immediately fed into the next operation in the special case where the output register of one operation is the same as one of the input registers of the next operation. Since the instruction formats are simple, this case can be detected early on, and arrangements can be made to forward the necessary result so that it does not become necessary to hold up the pipeline. Logically what we have is an extra connection between pipeline stages (Fig. 3.3). By forwarding results in this manner, the pipeline can be kept flowing at maximum speed, so such "conflicts" in the use of registers do not reduce the instruction throughput.
Loads and the Pipeline.Load instructions are different from other instructions in one important respect?the data may not be immediately available in a register when it is requested since there may be a delay of several cycles before the data arrives. It is certainly not economically practical to build large memories that will yield data fast enough to keep a pipeline supplied if the data is needed immediately. As time goes on the technology for memories improves?faster memory has become cheaper. However, it is a good guess that processor technology will always march along as fast or even faster than memory technology so that the discrepancy will always exist. Consider the following sequence of instructions:
L R2, CLUNK
L R3, JUNK
ADD R4, R2, R3
A key part of the RISC approach is to make heavy use of fast caches to reduce effective memory access time, but even if JUNK is in the cache it is still quite difficult to make the data immediately available on the following instruction. If we could guarantee that there would always be a sufficient number of intervening instructions between the load and the use of the data to allow the data to arrive, then the pipelined execution could proceed without delay. One approach would be to require that compilers (and humans writing in assembly language) ensure that a sufficient number of instructions do intervene, inserting NOP (no operation) instructions if necessary. The trouble with this approach is that we would be forced to consider the worst case ? a cache miss with the slowest possible memory ? all the time. Such assumptions are much too pessimistic because most data references will result in cache hits and the data will be available much earlier. If compilers were forced to assume the worst case all the time, the code generated would be severely and unnecessarily degraded. The most common approach, used on almost all RISC processors, is to implement an interlock scheme. Whenever a value is loaded from memory, a Scoreboard is used to keep track of the register into which the value is being loaded so that it cannot be used until the data arrives. If the register is referenced before the data arrives, the pipeline is held up until the data is available. If, on the other hand, the data arrives before it is needed, the register is unmarked in the Scoreboard, and the pipeline can proceed.
This extra step of checking that the data is available is just the sort of complication that we prefer to avoid in a pipelined design, where we want to keep the complexity of the hardware at a minimum. Nevertheless, we do need some solution to the problem of data arriving late.
Among the microprocessors that we examine in detail, one, the MIPS, takes a slightly different approach. It represents a partial attempt to implement the method of having the compiler insert sufficient instructions to guarantee that the data is available. On the MIPS, if the data is in the cache, then it is guaranteed to appear in the appropriate register one instruction later. Consider the following piece of code with no NOPs inserted.
L R3, GUNK; load GUNK into R3
ADD R4, R3, 1
The ADD instruction in this sequence appears too early, since even if there is a cache hit, GUNK will not arrive in time. The first instruction requests that GUNK be loaded into R3, but even if the variable GUNK happens to be in the data cache, it will require one additional cycle to be loaded. However, if one NOP is inserted:
L R3, GUNK
NOP
ADD R4, R3, 1
then as long as GUNK is in the cache, the data is guaranteed to be available in time for the ADD without the need to have the hardware check. This position in the instruction sequence is known as a load delay slot.
Of course, we have to ask what happens if GUNK is not in the cache. In this case the pipeline is held up right away, at the point of the load, so that the data is definitely available on time. Note that this contrasts with the interlock method in that the pipeline is unconditionally held up on a cache miss.
Store Instructions.Store instructions do not pose the same difficulties as loads in that the processor does not need to wait around for the result of a store before subsequent computations can complete. However, consider the sequence
ST R1, A; store R1 to A
L R2, A; reload A into R2
We do have to be careful that the value of A loaded into R2 is the value just stored from Rl and not some previous value of A.
To maintain the 1 clock per instruction performance, this will require some special logic in the load/store circuits that detect this case and handle it properly. It is not always easy to provide this extra logic, so an alternative approach is to hold things up until the store completes. In particular, some early versions of the SPARC take this approach and suffer as a result from a rather lengthy 3-clock store instruction.
Instruction Scheduling.In the case of the MIPS approach, we have to put at least one instruction between the load and the use of the loaded data. In the previous example, we showed a NOP as the inserted instruction, but of course it is much better to do something useful than to waste time doing nothing! Often it is possible for the compiler to rearrange instructions so that the requirement that there be an intervening instruction is met without needing to insert a NOP. For example, suppose that the initial code is something like
L R2, JUNK
L R3, GUNK
ADD R4, R2, R3
L R5, XYZ
L R6, DEF
ADD R7, R5, R6
We assume for this example that the processor allows multiple loads to proceed simultaneously. This is not true on all RISC processors. Instruction scheduling is often quite processor-dependent and will be governed by a detailed set of rules indicating what overlaps and delays apply to a particular processor. Instead of inserting a NOP after the instructions which loads GUNK and DEF, it is possible to rearrange the instructions as follows:
L R2, JUNK
L R3, GUNK
L R5, XYZ
L R6, DEF
ADD R4, R2, R3
ADD R7, R5, R6
With the instructions rearranged, NOP instructions are not needed, since the new sequence, which is functionally equivalent to the initial sequence, meets the requirements of not using data immediately after it is loaded. Of course, the compiler must be sure that any rearrangement does not affect the meaning of the program ? this is the sort of condition that compilers are used to dealing with in other kinds of optimization. Using the MIPS approach, minimal scheduling (insertion of no-ops) is required. With the interlock approach, scheduling is not required to keep the program correct, but is desirable from an efficiency point of view. Using interlocking, scheduling is more important. In the MIPS case, there is no point in moving the reference further than one instruction from the load, but in the interlock approach, if a cache miss occurs, it pays to have as much separation as possible between the load and the reference. Looking at our rearranged code in the previous example, the MIPS conditions were met, and for the MIPS there is no point in doing anything more. However, for an interlock approach, the instruction ADD R7, R5, R6 occurs rather soon after the load for R6. If DEF is in the cache, there is no problem, but if we get a cache miss in loading DEF, then this ADD will interlock. So in the interlock case, the compiler should look for additional instructions to insert before the ADD. Of course, there is no point in inserting NOP instructions, but if some useful instructions can be found to insert, then the performance in the case of a cache miss will be improved.
Branching and Branch Delays.Now consider a program for adding a list of n numbers. A generalization of the program outlined in Fig. 3.4,a. The addresses of the memory locations containing the n numbers are symbolically given as NUM1, NUM2, ?, NUMn, and the resulting sum is to placed in memory location SUM. Instead of using a long list of Add instructions, it is possible to place a single Add instruction in a program loop and arrange to have it executed the required number of times. This is illustrated in Fig. 3.4,b. The fundamental idea in program loops is to cause a straight-line sequence of instructions to be executed repeatedly. The number of repetitions, which obviously depends on the problem, must be controlled by means of additional instructions. In the above example, register R1 is used as a counter. It is initially loaded with the value n from memory location N. The instruction
Decrement R1
decrements the contents of R1 by 1 each time through the loop. We now introduce the concept of a branch instruction. To cause a return to the first instruction of the loop, the conditional branch instruction
Branch>0 LOOPSTART
is used. The effect of executing such an instruction is as follows. Branching takes place to the instruction at the location indicated if the specified condition is fulfilled; otherwise, straight-line sequencing of instructions continues. The mechanics of causing of branch during the execution of a conditional branch instruction are simple. If it is determined that branching is to take place, the PC is loaded with the address named in the instruction; otherwise the PC is incremented in the usual way. Branch conditions are usually related to the result of an arithmetic or logic operation. In our example, the desired condition is that the result of the most recent arithmetic operation is greater than 0. Since the Decrement instruction immediately precedes the conditional branch instruction, this means that a branch to LOOPSTART occurs as long as the contents of R1 remain greater than 0. After the loop has been executed n times, R1 will have been decremented to 0, and the branch to LOOPSTART will not occur. Instead, the next instruction moves the final result from R0 into the memory location SUM, and the program execution stops with the execution of the Halt instruction.
The capability for testing conditions, and subsequently choosing one of a set of alternative ways to continue computation, has many more applications than just loop control. This capability is reflected in the instruction set of all computers and is fundamental to the programming of most nontrivial tasks. Until now, we have considered only instructions where operand addresses are explicitly given within the instruction. This simple addressing mechanism means that the address field of the Add instruction in the specified block of instructions in Fig. 3.4,b must be modified in successive passes through the loop. It is possible to do this by adding 1 to the address field of the Add instruction. This is required if a single instruction must reference successive entries in the list of numbers. Another important problem associated with the use of an instruction pipeline occurs when a branch is encountered in an executing program. There are two aspects to this problem. First, the instruction following the branch has already entered the pipeline by the time the processor realizes it has a branch on its hands. Second, the instruction that is needed next will not have been loaded by the instruction fetch phase in time to keep the pipeline running at full speed.
Real programs contain a surprisingly high proportion of jump and call instructions; as many as 20% of the executing instructions are branches in many programs. This means that execution efficiency of branches is important and must be addressed as part of the RISC design if the goal of one instruction per clock is to be approached. The most common technique used on RISC machines for dealing with the branch problem is called a delayed branch (or a delayed jump). When a branch is taken, since the instruction immediately following the branch is already in the pipeline, rather than sending a message along saying "Hey, send that instruction back, it shouldn't have been put on the conveyer belt," it is allowed to proceed and be fully executed.
The instruction fetch phase realizes that the flow of control has been changed by the second instruction after the branch and can load the target instruction in time for it to enter the pipeline one instruction later. To see how this works, consider the following sequence of instructions.
L R9. 4
JMP LBL
ADD R9. 1
ADD R9. 1
LBL:
L R3. R9
When the JMP instruction is interpreted, the first ADD is already in the pipeline and is allowed to proceed. The next instruction to enter the pipeline is the load instruction just after the label LBL, and so the final result stored in R3 is 5. We should note that the description in this module is a little idealized. Not allRISC processors can guarantee 1-clock execution of jumps, but this goal certainly can be achieved and is achieved by at least some of the processors that we consider in this book. Some of the processors we will look at, notably the SPARC, provide a feature to deal with this problem. In the case of other processors such as the MIPS, the problem of filling branch delay slots can be tricky.
Register Structure of the CPU.In keeping with RISC principles, the register structure of the CPU is far simpler than those. The general structure of the MIPS chip is shown in Fig. 3.5. There are thirty-two 32-bit registers, which, with two exceptions, are treated in a uniform way by the instruction set. The sole exceptions are that register 0 always contains the value 0, while the link register, register 31, is used for procedure linkage. The only other CPU registers on the machine are the program counter (PC) and two registers, HI and LO, that are used in multiplication and division. The coprocessor CP0 has a separate set of registers whose use is related to memory management and exception handling.
Even though all the registers are identical in terms of the hardware (other than the two exceptions noted above), the MIPS User's Manual does suggest some software, conventions as to how they should be used. Register R29, for example, is used by convention as the stack pointer. Registers R8 through R15 are used by a compiler to store temporary results whose values are not preserved across procedure calls.
Without such clearly stated conventions, one possible disadvantage of having completely indistinguishable registers is that there is a potential for software anarchy. If the implementers of a C compiler and a Pascal compiler, for example, choose to use different registers as the stack pointer, someone implementing software using more than one language will have trouble, for example, in calling procedures written in C from Pascal. By establishing appropriate software conventions as an adjunct to the hardware design, these kinds of incompatibilities are avoided. Software conventions such as these often appear in hardware reference manuals even though, from a purely formal point of view, they have nothing to do with the hardware.
Even though this flexibility exists in the hardware, such conventions will nevertheless become embedded in operating systems and other software in such a way that from a programmer's view they have almost the same status as a hardware convention. For example, the IBM 370 has registers that are largely indistinguishable from each other. Very early on, however, IBM committed itself to a badly designed, but standard, calling sequence. The fact that compilers designed for the 370 have had to meet these conventions has resulted in a higher procedure call overhead than necessary. Programmers have been forced to live with this design for a very long time.
The Instruction Pipeline.The acronym MIPS was originally derived from the phrase Microprocessor without Interlocked Pipeline Stages. This acronym was used by the Stanford MIPS project but has been officially abandoned by MIPS Computer Systems, Inc. Nevertheless, we shall see that one of the features that distinguish the MIPS from the other RISC processors described in this text is a reliance on software conventions rather than hardware interlocks in handling load delay slots. The MIPS uses a five-stage pipeline (shown in Fig. 3.6). This pipeline consists of the following stages (we assume in this description that no stalls occur):
? During the instruction fetch stage (IF), the instruction address is calculated and then loaded from the instruction cache.
? During the read stage (RD), any register operands required by the instruction are read from the CPU registers; instruction decoding is overlapped with this phase.
? During the ALU phase, one of two different operations can occur. In the case of a general computational instruction, the operation is actually performed on the operands, which must be in registers. In the case of load and store instructions, the effective address calculation is performed.
? During the memory (MEM) phase, any operands that are in memory are accessed if they are required by the instruction.
? The writeback (WB) phase writes back the ALU result or the result of a load to the appropriate register.
An instruction does not really take a single cycle, but rather several overlapped cycles (see Fig. 3.7), five in the case of the MIPS; that is, the latency of an instruction is 5 cycles, but the throughput can approach one instruction per cycle. Remember that two things could potentially affect the instruction pipeline: instructions that change the flow of the program (jumps and branches) and data dependencies between instructions when values are being loaded into registers. The data dependency disruptions in the pipeline are solved in most RISC machines by having the hardware interlock the instruc-tions, that is, a Scoreboard is used to record which registers are busy, and a subsequent instruction requiring a busy register is delayed until the data arrives.
The feature that distinguishes the MIPS from other RISC proces?sors is that it avoids the need for implementing hardware interlock by shifting the responsibility from the hardware to the software. Compilers and assemblers for the MIPS must be sure to fill a load delay slot either with a useful instruction or, in the worst case, with a no-op since there is no hardware interlock. In addition, compilers and assemblers must make sure that all instructions placed after a branch can be executed whether the branch is taken or not. A typical instruction to load a word from memory into a register is written as: LW R2, A
A rule in the MIPS reference manual states: All load operations have a latency of one instruction. That is, the instruction immediately following a load cannot use the contents of the register which will be loaded with the data being fetched from storage. An exception is that the target register for the load word left and load word right instructions may be specified as the same register used as the destination of a load instruction that immediately precedes it.
It is hard to tell whether this simply gives advice or states a definite rule. Such a warning should probably be read to mean that the effect of referencing R2 in the next instruction is undefined. Perhaps you get the old value, or some rubbish value, or some other strange behavior. The behavior may be predictable, but the hardware designer does not care to tell you what happens, so you must avoid the problem.
The Instruction Set.The MIPS, in keeping both the number of instructions and the number of instruction formats to a minimum, is quintessentially RISC. This should not come as much of a surprise since the MIPS is, after all, the descendant of the original Stanford MIPS, one of the machines that in some sense define what it means to be RISC. The register-type (abbreviated R-type) format contains the standard 6-bit opcode field, followed by three 5-bit register fields, rs, rt, and rd. The last two fields of the address format, shamt (for Shift Amount) and funct (for Function), have specialized uses in connection with certain instructions. The Shift Left Logical (SLL) instruction, for example, uses the shamt field to specify the number of bits by which to shift. The immediate-type (I-type) instruction format has four fields and may be thought of as a variant of the R-type format that is used when an immediate operand is needed. This format consists of a 6-bit opcode field, a 5-bit target register field rt, a 5-bit source register field rs, and a 16-bit immediate field. The opcode determines how these fields are used. As an example, consider the load and store instructions, all of which fall into this format. To load a value from memory into the target register rt, the register specified by rs is used as a base address into memory, and the offset field is used as an offset into memory. The Add Immediate instruction, on the other hand, uses the offset field to hold a 16-bit signed constant that is then added to the contents of the source register, with the results placed in the target register. The branch instructions use the offset field as a branch displacement from the current program counter location. The jump-type (J-type) instruction format also uses a 6-bit opcode field, which is followed by a 26-bit target field that specifies the address to which to jump. Because instructions are all 32 bits long, this format uses the somewhat standard trick of extending the range of a jump by adding two zeros to the end of the target field. This can be done because all instructions are required to be aligned on a 4-byte boundary, and therefore the two low-order bits of an instruction address are always 0. This means that the effective range of addressing is plus or minus 227 bytes. Even though the only instructions that use this format are the call instructions and the unconditional jumps, those are the two crucial instances where this kind of full addressing capability is needed.
Interrupts.Hardware interrupts are handled like any other exception, by branching to the general exception vector. There are six hardware interrupt lines, which can be individually masked and unmasked by setting and resetting bits in the status register. The cause register contains a code indicating that a hardware interrupt has occurred, and the trap routine reacts by processing the hardware interrupt, communicating with the corresponding device as needed.