1. Why has the capability of realizing a stack data structure in the main memory become a common feature in modern computers?
2. Is it worthwhile to design computers that are stack-oriented to a great extent?
3. Might it be advantageous to have a machine in which the stack structure is its dominant feature?
4. Which of the following technical decisions is more promising: building a stack in the main memory or organization a stack using a set of registers?
5. What two approaches may be used to build a stack of a set of registers?
6. How are hardware registers used in the HP-computer?
7. What peculiarities do stack markers used in the HP-computer have?
8. What basic strategy is used in stack computers?
9. What format do ?Stack? instructions have in the HP-computer?
10. What are the architecture peculiarities of the FRISC-3?
11. What are the architecture peculiarities of the SF1?
12. What does the SF1 have two busses for?
13. What high level languages are supported by the SF1?
14. What is the instruction counter used for in stack computers?
15. What purposes is the FRISC-3 designed for?
4. 3. CISC + RISC = PENTIUM
The Pentium processor is the next generation member of the Intel ?86 microprocessor family. The Pentium processor implements several enhancements to increase performance. The two instruction pipelines and floating-point unit on the Pentium processor are capable of independent operation. Each pipeline issues frequently used instructions in a single clock. Together, the dual pipes can issue two integer instructions in one clock, or one floating-point instruction (under certain circumstances, 2 floating-point instructions) in one clock. Branch prediction is implemented in the Pentium processor. To support this, the Pentium processor implements two prefetch buffers, one to prefetch code in a linear fashion, and one that prefetches code according to the Branch Target Buffer (BTB) so the needed code is almost always prefetched before it is needed for execution [25], [28], [44], [60], [61], [62].
The Pentium processor includes separate code and data caches integrated on chip to meet its performance goals. Each cache is 8 Kbytes in size, with a 32-byte line size and is 2-way set associative. Each cache has a dedicated Translation Lookaside Buffer (TLB) to translate linear addresses to physical addresses. The data cache is configurable to be writeback or writethrough on a line by line basis and follows the MESI? protocol. The data cache tags are triple ported to support two data transfers and an inquire cycle in the same clock. The code cache is an inherently write protected cache. The code cache tags are also triple ported to support snooping and split line accesses. Individual pages can be configured as cacheable or non-cacheable by software or hardware. The caches can be enabled or disabled by software or hardware. The Pentium processor has increased the data bus to 64-bits to improve the data transfer rate. Burst read and burst writeback cycles are supported by the Pentium processor. In addition, bus cycle pipelining has been added to allow two bus cycles to be in progress simultaneously. The Pentium processor Memory Management Unit contains optional extensions to the architecture which allow 4 Mbyte page sizes [3], [4], [34].
The Pentium processor has added significant data integrity and error detection capability. Data parity checking is still supported on a byte by byte basis. Address parity checking, and internal parity checking features have been added along with a new exception, the machine check exception. In addition, the Pentium processor has implemented functional redundancy checking to provide maximum error detection of the processor and the interface to the processor. When functional redundancy checking is used, a second processor, the "checker" is used to execute in lock step with the "master" processor. The checker samples the master's outputs and compares those values with the values it computes internally, and asserts an error signal if a mismatch occurs.
As more and more functions are integrated on chip, the complexity of board level testing is increased. To address this, the Pentium processor has increased test and debug capability. The Pentium processor implements IEEE Boundary Scan (Standard 1149.1). In addition, the Pentium processor has specified 4 breakpoint pins that correspond to each of the debug registers and externally indicate a breakpoint match. Execution tracing provides external indications when an instruction has completed execution in either of the two internal pipelines, or when a branch has been taken. System management mode has been implemented along with some extensions to the SMM? architecture. Fig. 4.10 shows a block diagram of the Pentium processor. The block diagram shows the two instruction pipelines, the "u" pipe and the "v" pipe. The u-pipe can execute all integer and floating-point instructions. The v-pipe can execute simple integer and floating-point instructions. The separate caches are shown, the code cache and data cache. The data cache has two ports, one for each of the two pipes (the tags are triple ported to allow simultaneous inquire cycles). The data cache has a dedicated Translation Lookaside Buffer (TLB) to translate linear addresses to the physical addresses used by the data cache. The code cache, branch target buffer and prefetch buffers are responsible for getting raw instructions into the execution units of the Pentium processor. Instructions are fetched from the code cache or from the external bus. Branch addresses are remembered by the branch target buffer. The code cache TLB translates linear addresses to physical addresses used by the code cache.
The decode unit decodes the prefetched instructions so the Pentium processor can execute the instruction. The control ROM contains the microcode which controls the sequence of operations that must be performed to implement the Pentium processor architecture. The control ROM unit has direct control over both pipelines. The Pentium processor contains a pipelined floating-point unit that provides a significant floating-point performance advantage over previous generations of the Pentium processor.
Component Operation.The Pentium processor has an optimized superscalar micro-architecture capable of executing two instructions in a single clock. A 64-bit external bus, separate 8-Kbyte data and instruction caches, write buffers, branch prediction, and a pipelined floating-point unit combine to sustain the high execution rate. These architectural features and their operation are discussed in this chapter.
Pipeline and Instruction Flow.Like the CPU, integer instructions traverse a 5 stage pipeline. The pipeline stages are as follows: PF - Prefetch; Dl - Instruction Decode; D2 Address Generate; EX ? Execute - ALU and Cache Access; WB ? Writeback. Fig. 4.11 shows how instructions move through the Intel 486 CPU pipeline. Unlike the Intel 486 microprocessor, the Pentium processor is a superscalar machine capable of executing two instructions in parallel. Two five stage pipelines operate in parallel allowing integer instructions to execute in a single clock in each pipeline. Fig. 4.12 depicts instruction flow in the Pentium processor. The pipelines in the Pentium processor are called the "u" and "v" pipes and the process of issuing two instructions in parallel is termed "pairing." The u-pipe can execute any instruction in the Intel architecture while the v-pipe can execute ''simple" instructions as defined in the "Instruction Pairing Rules" Section of this chapter. When instructions are paired, the instruction issued to the v-pipe is always the next sequential instruction after the one issued to the u-pipe.
Pentium Processor Pipeline Description and Improvements.The Pentium processor pipeline has been optimized to achieve higher throughput. The first stage of the pipeline is Prefetch (PF) stage in which instructions are prefetched from the on-chip instruction cache or memory. Because the Pentium processor has separate caches for instructions and data, prefetches no longer conflict with data references for access to the cache. If the requested line is not in the code cache, a memory reference is made. In the PF stage, two independent pairs of line-size (32-byte) prefetch buffers operate in conjunction with the branch target buffer. This allows one prefetch buffer to prefetch instructions sequentially, while the other prefetches according to the branch target buffer predictions. The prefetch buffers alternate their prefetch paths. The next pipeline stage is Decodel (Dl) in which two parallel decoders attempt to decode and issue the next two sequential instructions. The Pentium processor requires an extra Dl clock to decode instruction prefixes. Prefixes are issued to the u-pipe at the rate of one per clock without pairing. After all prefixes have been issued, the base instruction will then be issued and paired according to the pairing rules. The one exception to this is that the Pentium processor will decode near conditional jumps (long displacement) in the second opcode map (0Fh prefix) in a single clock in either pipeline.
The Dl stage is followed by Decode2 (D2) in which the address of memory resident operands is calculated similar to CPU. In the Intel486 CPU, instructions containing both a displacement and an immediate, or instructions containing a base and index addressing mode require an additional D2 clock to decode. The Pentium processor removes both of these restrictions and is able to issue instructions in these categories in a single clock.
The Pentium processor uses the Execute (EX) stage of the pipeline for both ALU operations and for data cache access; therefore those instructions specifying both an ALU operation and a data cache access will require more than one clock in this stage. In EX all u-pipe instructions and all v-pipe instructions except conditional branches are verified for correct branch prediction. The final stage is Writeback (WB) where instructions are enabled to modify processor state and complete execution. In this stage v-pipe conditional branches are verified for correct branch prediction.
During their progression through the pipeline instructions may be stalled due to certain conditions. Both the u-pipe and v-pipe instructions enter and leave the Dl and D2 stages in unison. When an instruction in one pipe is stalled then the instruction in the other pipe is also stalled at the same pipeline stage. Thus both the u-pipe and the v-pipe instructions enter the EX stage in unison. Once in EX if the u-pipe instruction is stalled, then the v-pipe instruction (if any) is also stalled. If the v-pipe instruction is stalled then the instruction paired with it in the u-pipe is allowed to advance. No successive instructions are allowed to enter the EX stage of either pipeline until the instructions in both pipelines have advanced to WB.
Branch Prediction and Instruction Prefetch.In the PF stage, two independent pairs of line-size (32-byte) prefetch buffers operate in conjunction with the branch target buffer. Only one prefetch buffer actively requests prefetches at any given time. Prefetches are requested sequentially until a branch instruction is fetched. When a branch instruction is fetched, the branch target buffer (BTB) predicts whether the branch will be taken or not. If the branch is predicted not taken, prefetch requests continue linearly. On a predicted taken branch the other prefetch buffer is enabled and begins to prefetch as though the branch was taken. If a branch is discovered mis-predicted, the instruction pipelines are flushed and prefetching activity starts over. The dynamic branch prediction algorithm speculatively runs code fetch cycles to addresses corresponding to instructions executed some time in the past. Such code fetch cycles are run based on past execution history, regardless of whether the instructions retrieved are relevant to the currently executing instruction sequence. One effect of the branch prediction mechanism is that the Pentium processor may run code fetch bus cycles to retrieve instructions, which are never executed. Although the opcodes retrieved are discarded, the system must complete the code fetch bus cycle by returning BRDY#. It is particularly important that the system return BRDY# for all code fetch cycles, regardless of the address. Furthermore, it is possible that the Pentium processor may run speculative code fetch cycles to addresses beyond the end of the current code segment (CS). Although the Pentium processor may prefetch beyond the CS limit, it will not attempt to execute beyond the CS limit, it will raise a GP fault instead. Thus, segmentation cannot be used to prevent speculative code fetches to inaccessible areas of memory. On the other hand, the Pentium processor never runs code fetch cycles to inaccessible pages, so the paging mechanism; guards against both the fetch and execution of instructions in inaccessible pages.
For memory reads and writes, both segmentation and paging prevent the generation of bus cycles to inaccessible regions of memory.
Instruction Pairing Rules.The Pentium processor can issue one or two instructions every clock. In order to issue two instructions simultaneously they must satisfy the following conditions: both instructions in the pair must be "simple" as defined below; there must be no read-after-write or write-after-write register dependencies between them; neither instruction may contain both a displacement and an immediate; instructions with prefixes (other than 0F of JCC instructions) can only occur in the u-pipe.
Simple instructions are entirely hardwired; they do not require any microcode control and, in general, execute in one clock. The exceptions are the ALUmem,reg and ALU reg,mem instructions which are three and two clock operations respectively. Sequencing hardware is used to allow them to function as simple instructions. The following integer instructions are considered simple and may be paired: 1. mov reg, reg/mem/imm; 2. mov mem, reg/imm; 3. ALU reg, reg/mem/imm; 4. ALU mem, reg/imm; 5. inc reg/mem; 6. dec reg/mem; 7. push reg/mem; 8. pop reg; 9. lea reg,mem; 10. jmp/call/jcc near; 11. nop.
In addition, conditional and unconditional branches may be paired only if they occur as the second instruction in the pair. They may not be paired with the next sequential instruction. Also, SHIFT/ROT by 1 and SHIFT by imm may pair as the first instruction in a pair.
The register dependencies that prohibit instruction pairing include implicit dependencies via registers or flags not explicitly encoded in the instruction. For example, an ALU instruction in the u-pipe (which sets the flags) may not paired with an ADC (Add with Carry) or an SBB (Subtract with Borrow) instruction in the v-pipe. There are two exceptions to this rule. The first is the commonly occurring sequence of compare and branch which may be paired. The second exception is pairs of pushes or pops. Although these instructions have an implicit dependency on the stack pointer, special hardware is included to allow these common operations to proceed in parallel. Although in general two paired instructions may proceed in parallel independently, there is an exception for paired "read-modify-write" instructions. Read-modify-write instructions are ALU operations with an operand in memory. When two of these instructions are paired there is a sequencing, delay of two clocks in addition to the three clocks required to execute the individual instructions. Although instructions may execute in parallel their behavior as seen by the programmer is exactly the same as if they were executed sequentially.
Branch Prediction Consideration.The Pentium processor uses a Branch Target Buffer (BTB) to predict the outcome of branch instructions which minimizes pipeline stalls due to prefetch delays. The processor accesses the BTB with the address of the instruction in the Dl stage. In the event of a correct prediction, a branch will execute without pipeline stalls or flushes. Branches which miss the BTB are assumed to be not taken. Conditional and unconditional near branches and near calls execute in 1 clock and may be executed in parallel with other integer instructions. A mispredicted branch (whether a BTB hit or miss) or a correctly predicted branch with the wrong target address will cause the pipelines to be flushed and the correct target to be fetched. Incorrectly predicted unconditional branches will incur an additional three clock delay, incorrectly predicted conditional branches in the u-pipe will incur an additional three clock delay, and incorrectly predicted conditional branches in the v-pipe will incur an additional four clock delay. The benefits of branch prediction are illustrated in the following example. Consider the following loop from a benchmark program for computing prime numbers:
for (k=i+prime;k<=SIZE;k+=prime) flags[k]=FALSE;
A popular compiler generates the following assembly code:
(prime is allocated to ecx, k is allocated to edx, and al contains the value FALSE)
inner_loop:
mov byte ptr flags[edx],al
add edx.ecx
cmp edx, SIZE
jle inner_loop
Each iteration of this loop will execute in 6 clocks on the Intel486 CPU. On the Pentium processor, the mov is paired with the add; the cmpwith the jle. With branch prediction, each loop iteration executes in 2 clocks.
Write Buffers and Memory Ordering.The Pentium processor has two write buffers, one corresponding to each of the pipelines, to enhance the performance of consecutive writes to memory. These write buffers are one quad-word wide (64-bits) and can be filled simultaneously in one clock e.g., by two simultaneous write misses in the two instruction pipelines. Writes in these buffers are driven out on the external bus in the order they were generated by the processor core. No reads (as a result of cache miss) are reordered around previously generated writes sitting in the write buffers. The implication of this is that the write buffers will be flushed or emptied before a subsequent bus cycle is run on the external bus (unless BOFF# is asserted and a writeback cycle becomes pending).
The Pentium processor supports strong write ordering only. That is, writes generated by the Pentium processor will be driven to the bus or updated in the cache in the order that they occur. The Pentium processor will not write to E or M-state lines in the data cache if there is a write in either write buffer, if a write cycle is running on the bus, or if EWBE# is inactive. Note that only memory writes are buffered and I/O writes are not.
Serializing Operations. After executing certain instructions the Pentium processor serializes instruction execution. This means that any modifications to flags, registers, and memory for previous instructions are completed before the next instruction is fetched and executed. The prefetch queue is flushed as a result of serializing operations. The Pentium processor serializes instruction execution after executing one of the following instructions: MOV to Debug Register, MOV to Control Register, INVD, INVLPG, IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD, CPUID, RSM and WRMSR. Notes:
1. The CPUID instruction can be executed at any privilege level to serialize instruction execution.
2. When the Pentium processor serializes instruction execution, it ensures that it has completed any modifications to memory, including flushing any internally buffered stores; it then waits for the EWBE# pin to go active before fetching and executing the next instruction. Pentium processor systems may use the EWBE# pin to indicate that a store is pending externally. In this manner, a system designer may ensure that all externally pending stores will complete before the Pentium processor begins to fetch and execute the next instruction.
3. The Pentium processor does not generally writeback the contents of modified data in its data cache to external memory when it serializes instruction execution. Software can force modified data to be written back by executing the WBINVD instruction.
4. Whenever an instruction is executed to enable/disable paging, this instruction must be followed with a jump. The instruction at the target of the branch is fetched with the new value of PG (i.e., paging enabled/disabled), however, the jump instruction itself is fetched with the previous value of PG Pentium processors have slightly different requirements to enable and disable paging.
5. Whenever an instruction is executed to change the contents of CR3 while paging is enabled, the next instruction is fetched using the translation tables that correspond to the new value of CR3. Therefore the next instruction and the sequentially following instructions should have a mapping based upon the new value of CR3.
6. The Pentium processor implements branch-prediction techniques to improve performance by prefetching the destination of a branch instruction before the branch instruction is executed. Consequently, instruction execution is not generally serialized when a branch instruction is executed.
7. Although the I/O instructions are not "serializing" because the processor does not wait for these instructions to complete before it prefetches the next instruction, they do have the following properties that cause them to function in a manner that is identical to previous generations. I/O reads are not re-ordered within the processor; they wait for all internally pending stores to complete. Note that the Pentium processor does not sample the EWBE# pin during reads. If necessary, external hardware must ensure that externally pending stores are complete before returning BRDY#. The OUT and OUTS instructions are also not ''serializing," as they do not stop the prefetcher. They do, however, ensure that all internally buffered stores have completed, that EWBE# has been sampled active indicating that all externally pending stores have completed and that the I/O write has completed before they begin to execute the next instruction.
Linefill and Writeback Buffers. In addition to the write buffers corresponding to each of the internal pipelines, the Pentium processor has 3 writeback buffers. Each of the writeback buffers are 1 deep and 32-bytes (1 line) wide. There is a dedicated replacement writeback buffer which stores writebacks caused by a linefill that replaces a modified line in the data cache. There is one external snoop writeback buffer that stores writebacks caused by an inquire cycle that hits a modified line in the data cache. Finally, there is an internal snoop writeback buffer that stores writebacks caused by an internal snoop cycle that hits a modified line in the data cache. Write cycles are driven to the bus with the following priority: contents of external snoop writeback buffer; contents of internal snoop writeback buffer; contents of replacement writeback buffer; contents of write buffers.
Note that the contents of whichever write buffer was written into first are driven to the bus first. If both write buffers were written to in the same clock, the contents of the u-pipe buffer is written out first.
The Pentium processor also implements two line fill buffers, one for the data cache and one for the code cache. As information (data or code) is returned to the Pentium processor for a cache line fill, it is written into the line fill buffer. After the entire line has been returned to the processor it is transferred to the cache. Note that the processor requests the needed information first and uses that information as soon as it is returned! The Pentium processor does not wait for the line fill to complete before using the requested information.
If a linefill causes a modified line in the data cache to be replaced, the replaced line will remain in the cache until the line fill is complete. After the line fill is complete, the line being replaced is moved into the replacement writeback buffer and the new line fill is moved into the cache.