Floating-Point on the SPARC

The SPARC architecture includes an IEEE 754 compatible floating-point specification, which means that even though not the entire standard has been implemented, there is enough support in the hardware that the full standard can be implemented without great difficulty.

The first implementations of the SPARC architecture have depended on the use of a separate IEEE-compatible floating-point coprocessor. If the floating-point hardware is not integrated onto the chip itself, it is possible that different implementations of the processor may have different floating-point characteristics. SPARC cures this problem by including the specification of the floating-point behavior as part of the architecture.

The floating-point load and store operations are issued by the integer unit rather than the floating-point unit. This is important because of the way the load and store instructions behave. In a typical implementation of the SPARC architecture, floating-point operations are overlapped with integer unit operations. However, the idea is that this overlap should be completely transparent to the program, which should be able to think of the floating-point instructions as being executed sequentially in a strictly synchronized manner. This is easily enough achieved with operations like addition, using techniques similar to those used to interlock integer load instructions. The result register of the addition is marked busy, and any attempt to use it before the addition is complete will hold things up.

On the SPARC the floating-point store is handled by the IU, which guarantees it is synchronized in the same sense that ordinary stores are synchronized. A subsequent load of the memory location, by either a floating-point or an integer load, is guaranteed to always get the new value, obviating the need for any explicit synchronization.

Floating-Point Registers.The floating-point register set includes thirty-two 32-bit registers, which are organized in such a way that they can be used as 32-bit, 64-bit, and 128-bit operands with only 80 bits used for the extended precision format (although the extended precision is optional and not currently provided by SPARC chips). The floating-point register model of the SPARC does not include register windows or any of the complications that arise with the integer unit register set. A compiler for the SPARC will have to treat these registers in the same way as a compiler treats the general registers on a processor with no register windows. Even though the presence of register windows simplifies register allocation as far as the integer unit is concerned, the compiler writer is still required to manage floating-point registers in the standard manner, that is, without register windows. This somewhat undermines the argument that register windows simplify the job of writing the compiler. For example, in the case of floating-point registers, it is no longer the case that allocation of registers in separate procedures is completely independent, and optimal register allocation can be done only at link time. Compared to other floating-point implementations, the SPARC specification of floating-point is somewhat richer than that of some other RISC chips, because it includes the square root function and remainder functions. On the other hand, it does not have as complete a floating-point implementation as the Intel 387, which also includes the transcendental functions such as sine and cosine.

Overlapped Multiplication and Addition.Separate functional units may realize a specific implementation of the SPARC?s floating-point unit. The first implementation of SPARC, for example, consisted of a floating-point controller, a floating-point multiplier, and an ALU that included the logic to perform addition and subtraction (we will refer to this as the adder). A floating-point unit in which the addition and multiplication logic are on separate chips is capable of performing two operations in parallel. The dispatch logic can keep issuing floating-point instructions as long as there is a free unit to handle them.

The processor could, for example, dispatch a multiplication out to the multiplication unit and then subsequently dispatch a subtract instruction out to the adder unit, without waiting for the multiply to complete. A third floating-point operation would have to wait for one of the previous ones to complete. The amount of overlap is an architectural decision ? there is nothing to prevent an implementation from having lots of multipliers and lots of adders, allowing a significant overlap of instructions.

Taking Advantage of Overlapped Operations.To see the usefulness of having separate adds and multiply units, consider the standard problem of computing the product of two matrices. The inner loop involves computing a dot product:

P(I,J) = A(I,1)*B(1 ,J) + A(I,2)*B(2,J) + A(I,3)*B(3,J) + ...

If you think about the way data flows in that computation, you can do a multiply, and then when you get a result, you can start adding that result to the sum and meanwhile start on the next multiply. If you program a matrix multiply in the obvious naive way, this scheme of overlap will mean that you are always overlapping multiplication and addition. It may be possible to double the floating-point throughput if there are separate multiplication and addition units.

The SPARC architecture allows any number of functional units, so the maximum number of simultaneous floating-point operations is not specified. In the matrix multiplication example, we could take advantage of separate multiply and add units without special programming. However, there are other cases where the style of programming is affected by the number of functional units. Consider the polynomial: R = ax⁴ + bx³ + cx² + dx + e.

The familiar efficient scheme for evaluating this polynomial uses Horner's rule, which reduces the number of multiplications by factoring as follows: R = a * x; R =R + b; R = R * x; R = R + c; R = R * x; R = R + d; R = R * x; R = R + e.

This approach minimizes the number of operations (a total of eight operations are required). However, it is not suitable for taking advantage of parallel execution units, since each computation involves the previous result. Consider the following alternative computation scheme:

T1 := d * x; X2 := x * x;

T2:=c*X2; T3:=T1+e; X3 := X2 * x;

T4 := T2 + T3; T5 := b * X3; X4 := X3 * x;

T6 := T4 + T5; T7 := a * X4; R := T6 + T7;

If we have two multiplication units and two addition units, all of which can operate in parallel, then the set of operations on each line can be executed in parallel. This means that in such circumstances, this computation scheme is faster than Horner's rule (five operation times instead of eight), even though the total number of operations is greater (11 instead of 8).

Handling Traps in Overlapped Operations. One difficulty with this kind of overlapped approach is that one of these instructions may cause a trap, due to the requirements of the IEEE standard. The traps are purely synchronous. You cannot get a trap occurring on one of these instructions between the two floating-point operations.

Even synchronous traps can be a problem if they occur other than on the instruction causing the trap condition. The floating-point unit on the MIPS pre-examines the operand exponents to insure that this never occurs, but on the SPARC, the expectation is that it will be the case that a trap caused by one operation may be signaled on a subsequent operation. Consider the following pattern of operations:

FPOP × (instruction A)

FPOP + (instruction B)

FPOP × (instruction C)

If a trap occurs during execution of instruction C in this example, it might be instruction A that actually caused the trouble. Meanwhile instruction B may be completed, waiting to get started, or at some intermediate stage of execution. The trap routine at this stage has to have some way of knowing what is going on.

What sort of things must be done by a trap routine? It might be necessary to abandon the computation completely and go on to the next one. In that case it is sufficient to have a scheme for flushing out the operations that have not completed. But you may also want to substitute some sort of value in the computation and then carry on with the computations. Again the IEEE standard requires this capability. A typical example is that if you get an underflow trap, you may want to record the fact that you have an underflow trap on some sort of log file and then replace the result with 0 or some other small value. There are different numerical reasons for proceeding in different ways. If you get a divide overflow, you may want to register this fact and then replace it with a positive infinity value. In this case, the operating system must simulate things as though the trap had occurred at the appropriate instruction, with the proper value substituted. But it can't just jump back to the second instruction, because all sorts of IU instructions may have executed in between and obviously cannot be reexecuted. There must be a way for the floating-point unit to tell the processor what has happened in some detail.

That is what the privileged Store Double Floating-point Queue (SDFQ) instruction is. The idea is that it be used in standard code supplied by the operating system and would be executed only as part of an exception routine operating in supervisor state. SDFQ stores a double word. The first word is the address of the floating-point instruction that caused the trap (which in this case would be instruction A). The second word is a copy of the floating-point instruction that caused the trap. This copy of the floating-point instruction is often not really needed, since it could be loaded from the program, but it helps not to have to worry about accessing the instructions from the trap routine with attendant paging problems. Furthermore, in a memory-mapped environment with separate data and instruction spaces, it may not be straightforward to load instructions as data. Since the floating-point unit has obviously picked up the instruction, it can easily give it back.

When the processor executes an SDFQ instruction, the first element that it gives you is the one corresponding to the instruction that caused the trap (instruction A in this example). At this stage you can examine the instruction and decode it. You can also interrogate the status flags in the floating-point unit to see what kind of interrupt occurred and what the situation was when the interrupt occurred. You can then either reexecute the instruction, having fixed up some of the operands, or you can just simulate the instruction and provide an appropriate substitute result.

Now we have dealt with the instruction that caused the trap (instruction A). But what about instructions B and C? They have to be simulated, since control will finally return from the trap routine just past instruction C. To solve this problem, we issue another SDFQ. It gives us another double-word pair, which references the first undone, uncompleted floating-point instruction (in this case instruction B). The exception routine now reexecutes instruction B. If this instruction causes a trap as well, then we need some software convention to decide what should happen if a number of instructions cause traps. How will you signal that to the application program in a clear manner? That has to be addressed. Finally, the third SDFQ yields the final instruction, the one where the trap was actually signaled (instruction C in this example).

There is a status bit in the floating-point register that shows whether the queue is empty, so the exception routine issues SDFQ instructions until this status is set. The maximum number of queue entries is one of the implementation parameters of t architecture. It depends on the maximum number of floating-point instructions that can be executing simultaneously. In general, the exception routine will be written to handle any number of elements in the queue, and then this routine will work regardless of the overlap provided by the implementation.

Date: 2016-06-12; view: 214

<== previous page	\|	next page ==>
SPARC Addressing Modes and Instruction Set	\|	The SPARC Computers Family

doclecture.net - lectures - 2014-2025 year. Copyright infringement or personal data (0.179 sec.)