Until now in this chapter we have been exclusively concerned with signed, fixed-point numbers. It has been convenient to consider them as integers, that is, with an implied binary point at the right end of each number. It would have been just as easy to consider that the binary point is at left end, thus dealing with fractions. In the 2's-complement system, the signed value F, represented by the n-bit binary fraction
where the range of F is -1 ≤ F ≤ -2-(n-1)F = 0 or 2-(n-1) ≤ F ≤ 1 - 2-(n-1).
Consider the range of values representable in these fixed-point formats. Let us assume 16-bit signed values. Interpreted as integers, the value range is -32,768 (= -215) to +32,767 (= +215 - 1). If we consider them to be fractions, the range is approximately ?3 × 10-5 (≈ ?2-15) to ?1. Neither of these ranges is sufficient for scientific calculations, which might involve parameters like Avogadro's number (6.0247 ×1023 mole-1) or Planck's constant (6.6254 × 10-27 erg ? s). Hence, there is a need to easily accommodate very large integers and very small fractions. This means that a facility should be provided for both representing numbers and operating on numbers in such a way that the position of the binary point is variable and is automatically adjusted as computation proceeds. In such a case, the binary point is said to float, and the numbers are called floating-point numbers. This distinguishes them from fixed-point numbers, whose binary point is always in the same position.
Because the position of the binary point in a floating-point number is variable, it must be given explicitly in the floating-point representation. For example, in the familiar decimal scientific notation, numbers may be written as 6.0247 × 1023, 6.6254 × 10-27, -1.0341 × 102, -7.3000 × 10-14, etc. These numbers are said to be given to five significant digits. The scale factor (1023, 10-27, etc.) indicate the true position of the decimal point with respect to the significant digits. By convention, a decimal point is placed to the right of the first (nonzero) significant digit, and the number is said to be normalized. Note that the base 10 in the scale factor is fixed and does not need to appear explicitly in the machine representation of a floating-point number. The sign, the significant digits, and the exponent in the scale factor comprise the representation. We are thus motivated to define a floating-point number representation as one in which a number is represented by its sign, followed by a string of significant digits, commonly called a mantissa, and an exponent to an implied base. Let us state a general form for such numbers in the decimal system and then relate the form to a comparable binary representation. A widely used form is ?X1.X2X3X4X5X6X7 × where Xi and Yi are decimal digits.
This is sufficient for a wide range of scientific calculations. As we shall see, it is possible to approximate this range and mantissa precision in a binary representation that occupies 32 bits. A 24-bit number can approximately represent a seven-digit decimal number. Therefore, 24 bits are assigned to represent the mantissa in the binary representation. One bit is needed for the sign of the number, leaving 7 bits for a signed exponent.
A specific binary format for floating-point numbers is shown in Fig. 2.25,a. Let us first assume that the implied base is 2 and that the 7-bit signed exponent is expressed as a 2's-complement integer. The 24-bit mantissa is considered to be a fraction with the binary point at its left and the sign of the number is given in the leftmost bit of the format. To remain as many significant bits as possible, the fractional mantissa is kept in a normalized form in which, for nonzero values, its leftmost bit is always 1. Thus the magnitude of the mantissa M is either 0 or lies in the range of 1/2 ≤ M < 1. A number that is not in this form can always be put in normalized form by shifting the fraction and adjusting the exponent, assuming that exponent overflow/underflow does not occur. Fig. 2.25,b shows an unnormalized value 0.001? × 29 and its normalized version 0.1? × 27. A 7-bit, 2's-complement exponent has a range of -64 to +64, which means that the scale factor has a range of 2-64 to 263, not large enough to represent the desired scale factor of 10-99 to 1099. If we reduce the size of the mantissa to allocate more bits to the exponent, then we will not be able to approximate the desired seven-decimal-digit accuracy. The solution that has been used in a number of computers is to change the value of the implied base in the scale factor. The base should be of the form 2q so that a right or left shift of the mantissa by q binary positions with respect to its binary point corresponds to a decrease or increase of 1 in the exponent of the scale factor, respectively. If we let the implied base be 16, then the range of the scale factor becomes 16-64 to 1663, which corresponds approximately to the decimal range 10-76 to 1076. This representation for floating-point numbers has both a reasonable range and number of bits in the mantissa. (However, as we shall see later, a floating-point standard has been developed in which a significantly larger range and mantissa accuracy is provided.) Since the base is now 16, shifting of the mantissa to perform normalization must take place in steps of 4-bit shifts. This corresponds to the smallest changes (?1) in the exponent. A representation is now considered to be normalized if any of the leading 4 bits of its mantissa is 1. This is often called hexadecimal normalization.
It should be pointed out that the gain in range achieved by using a base of 16 in the scale factor may result in a lower precision mantissa. Even though 24 bits are still used for the mantissa, the fact that hexadecimal normalization is used means that the leading 3 bits of a mantissa might be 0s. Thus, in some cases, only 21 significant bits are retained in the mantissa. This is in contrast with the use of the base 2 in the scale factor, where 24 bits of precision are always maintained.
Another change in the format of a floating-point number is useful. Instead of representing the exponent in a signed 2's-complement integer format, we represent it in excess-64 format. In this format, an exponent having the signed value E is represented by the value E' = E + 64. Since the desired range for E is -64 ≤ E ≤ 63, the excess-64 value E' will be in the range 0 ≤ E' ≤ 127. The smallest scale factor, 16-64, is then represented by seven 0s, and the largest scale factor, 16+63, is represented by seven 1s. This change facilitates the use of simple circuitry for determining the relative size of two floating-point numbers. An unnormalized value and its corresponding normalized version in the excess-64, base-16 scale factor scheme are shown in Fig. 2.25,c. The value 0 is represented by all zeros. As computations proceed, a number that does not fall in the representable range may be generated. This means that its normalized representation requires an exponent less than -64 or greater then +63. In the first case, we say that underflow has occurred, and in the second case, we say that overflow has occurred. Events like this are generally called arithmetic exceptions. A uniform way to handle exceptions in a computer system is to raise an interrupt when they occur. The interrupt-service routine can then take action as specified by the user or by a system convention. For example, on underflow the decision might be to set the value to 0 and proceed. A large number range is a significant feature of a floating-point system. However, the user convenience provided by automatic handling of the variable position of the binary point with respect to the significant bits is the most important feature of such systems.
Arithmetic Operations on Floating-Point Numbers.The rules we give below apply to hexadecimal-normalized 24-bit fraction mantissa and scale factors that have an implied base of 16 and an explicit 7-bit signed exponent in excess-64 format. An example is shown in Fig. 2.25,c. The rules are only intended to specify the major steps needed in performing the four operations. The possibility that overflow or underflow might occur is not handled. Furthermore, intermediate results for both mantissas and exponents might require more than 24 and 7 bits, respectively, for their representation. Both of these aspects of the operations need to be carefully considered when designing an arithmetic processor. Addition and subtraction require the mantissas to be shifted with respect to each other before they are added or subtracted when their exponents differ. Let us consider a decimal example in which we wish to add 2.9400 × 102 to 4.3100 × 104. We rewrite 2.9400 × 102 as 0.0294 × 104 and then perform addition of the mantissas to get 4.3394 × 104. A general rule for addition and subtraction may be stated as follows.
ADD/SUBTRACT Rule.1. Choose the number with the smaller exponent and shift its mantissa right (in 4-bit steps) a number of steps equal to the difference in exponents. 2. Set the exponent of the result equal to the larger exponent. 3. Perform addition/subtraction on the mantissas and determine the sign of the result. 4. Normalize the resulting value if necessary, and then use the first 24 bits after the binary point (truncated, as discussed later) as the mantissa of the result.
Multiplication and division are somewhat easier than addition and subtraction in that no alignment of mantissas is needed.
MULTIPLY Rule.1. Add exponents and subtract 64. 2. Multiply mantissas and determine the sign of the result. 3. Normalize the resulting value if necessary, and then use the first 24 bits after the binary point (truncated) as the mantissa of the result.
DIVIDE Rule1. Subtract exponents and add 64. 2. Divide mantissas and determine the sign of the result. 3. Normalize the resulting value if necessary, and then use the first 24 bits after the binary point (truncated) as the mantissa of the result.
The addition and subtraction of 64 in the above two rules is a result of using the excess-64 notation for exponents.
Guard Bits and Rounding.Although the mantissas of initial operands and final results are limited to 24 bits, it is important to retain extra bits, often called guard bits, during the intermediate steps. This enables retaining maximum accuracy in the results. The operation of removing guard bits in generating final results raises an important issue. The problem is that a binary fraction must be truncated to give a shorter fraction that is an approximation to the longer value. This problem also arises in other situations, for instance, in the conversion from decimal to binary fractions.
There are a number of ways that truncation can be done. The simplest way is to remove the guard bits and make no changes in the retained bits. This is called chopping. Suppose we wish to truncate a 6-bit fraction to a 3-bit fraction by this method. All fractions in the range 0.b-1b-2b-3000 to 0.b-1b-2b-3111 will be truncated to 0.b-1b-2b-3. The error in the 3-bit result obviously ranges from 0 to 0.000111. It is more convenient to say that, in general, the error in chopping ranges from 0 to almost 1 in the least significant position of the retained bits. In our example, this is the b-3 position. The result of chopping is called a biased approximation because the error is not symmetrical about 0.
The next simplest method of truncation is Von Neumann rounding. If the bits to be removed are all zeroes, they are simply dropped, with no changes to the retained bits. However, if any of the bits to be removed are 1, the least significant bit of the retained bits is set to 1. In our 6-bit to 3-bit truncation example, all 6-bit fractions with b-4b-5b-6 not equal to 000 will be truncated to 0.b-1b-21. It is easy to see that the error in this truncation method ranges between -1 and +1 in the LSB position of the retained bits. Although the range of error is larger with this technique than it is with chopping, the maximum magnitude is the same, and the approximation is unbiased because the error range is symmetrical about 0.
It is advantageous to use unbiased approximations if a large number of operands and operations are involved in generating a few results and if it can be assumed that the individual errors are approximately symmetrically distributed over the error range. Positive errors should tend to offset negative errors as the computation proceeds. From a statistical standpoint, we might then expect the results to have a high probability of being very accurate.
The third truncation method is a rounding procedure. It achieves the closest approximation to the number being truncated, and it is an unbiased technique. The procedure is as follows. A 1 is added to the LSB position of the bits to be retained if there is a 1 in the MSB position of the bits being removed. Thus, 0.b-1b-2b-31? rounds to 0.b-1b-2b-3 + 0.001, and 0.b-1b-2b-30? rounds to 0.b-1b-2b-3. This provides the desired approximation except for the case where the bits to be removed are 10?0. This is a tie situation. The longer value is half way between the two closest truncated representations. In order to break the tie in an unbiased way, one possibility is to choose the retained bits to be the nearest even number. In terms of our 6-bit example, the value 0.b-1b-20100 is truncated to the value 0.b-1b-20, and 0.b-1b-21100 is truncated to 0.b-1b-21 + 0.001. The descriptive phrase "round to the nearest number or nearest even number in case of a tie" is sometimes used to refer to this truncation technique. The error range is approximately -1/2 to +1/2 in the LSB position of the retained bits.
Implementation of Floating-Point Operations.The implementation of floating-point operations involves considerable circuitry. These operations may also be implemented by software routines. In either case, provision must be made for input and output conversion to and from the user's decimal representation of numbers. In many computers, floating-point operations are available at the basic machine instruction level. Hardware implementations range from serial through highly parallel forms, analogous to the range of hardware multipliers discussed earlier.
As an example of the implementation of floating-point operations, let us consider the block diagram for a hardware implementation of addition and subtraction on 32-bit floating-point operands that have the format shown in Fig. 2.25,c. Let the signs, exponents, and mantissas of operands A and B be represented by SA, EA, MA and SB, EB, MB, respectively. Following the ADD/SUBTRACT rule, we see that the first step is to compare exponents to determine how far to shift the mantissa of the number with the smaller exponent. The 7-bit subtractor circuit in the upper left corner of Fig. 2.26 determines this shift-count value, n. The magnitude of the difference EA - EB, which is n, is sent to the SHIFTER unit. The range of n is restricted to 0, 1, ? , 7, where n = 7 if |EA - EB| ≥ 7; otherwise, n = |EA - EB|. If n = 7, it is possible to determine the result immediately as being equal to the larger operand (or its negative). However, this option is not explicitly shown in Fig. 2.26. The sign of the difference resulting from the exponent comparison determines which mantissa is to be shifted. Therefore, the sign is sent to the SWAP network in the upper right corner of the figure. If the sign is 0, then EA ≥ EB, and the mantissas MA and MB are sent straight through the SWAP network. This results in MB being sent to the SHIFTER, to be shifted n hex positions to the right. The other mantissa, MA, is sent directly to the Mantissa adder-subtractor. If the sign is 1, then EA < EB, and the mantissas are swapped before being sent to the SHIFTER. This completes step 1 of the ADD/SUBTRACT rule.
Step 2 is performed by the two-way multiplexer, MPX, in the bottom left corner of the Fig. 2.26. The exponent of the result, E, is tentatively determined as EA if EA ≥ EB, or EB if EA < EB. The sign of the difference resulting from the exponent comparison operation in step 1 determines this.
Step 3 involves the major component, the Mantissa adder-subtractor in the middle of the figure. The CONTROL logic determines whether the mantissas are to be added or subtracted. This is decided by the signs of the operands, SA and SB, and the operation, Add or Subtract, which is to be performed on the operands. The CONTROL logic also determines the sign of the result SR. For example, if A is negative (SA = 1), B is positive (SB = 0), and the operation is A - B, then the mantissas are added and the sign of the result is negative (SB = 1). On the other hand, if A and B are both positive and the operation is A - B, then the mantissas are subtracted. The sign of the result SR now depends on the mantissa subtraction operation. For instance, if EA > EB, then MA - shifted MB will be positive and the result will be positive. But if EB > EA, then MB - shifted MA will be positive and the result will be negative. This example shows that the sign from the exponent comparison is also required as an input to the CONTROL network. When EA = EB and the mantissas are subtracted, the sign of the Mantissa adder-subtractor output is crucial in determining the sign of the result. The reader should now be able to construct the complete truth table for the CONTROL network.
Step 4 of the ADD/SUBTRACT rule is the normalization of the result mantissa M produced by step 3. The number of leading zeros in M determines the number X of hex digit shifts to be applied to M. Then, the normalized value is truncated to generate the 24-bit mantissa, MR, of the result. The value X is also subtracted from the tentative result exponent E to generate the true result exponent ER. We should note that it is possible that a single hex digit right shift might be needed to normalize the result. This would be the case if two mantissas of the form 0.1xx? are added. The vector M would be then have the form 1.xxx? . This would correspond to an X value of -1 in the figure.
We have not given any details on the guard bits that need to be carried along with intermediate mantissa values. In general, only a few bits are needed, depending upon the truncation technique used to generate the 24-bit normalized mantissa of the result.
A few comments are in order about the actual hardware that might be used to implement the blocks in Fig. 2.26. The two 7-bit subtractors and the Mantissa adder-subtractor can be implemented by combinational logic as discussed earlier in this chapter. Since their outputs are required in sign and magnitude form, some modifications to our earlier discussions need to be made. A combination of 1's-complement arithmetic and sign and magnitude representation is often used. There is considerable flexibility in the implementation of the SHIFTER and the output normalization operation. To make these parts inexpensive, they should be constructed as shift registers. However, if speed of execution is important, they can be built in a more combinational manner. Fig. 2.26 is organized along the lines of part of the floating-point hardware used in some IBM computers.