

and Architecture 10th Edition



# Chapter 10 Computer Arithmetic



## Arithmetic & Logic Unit (ALU)

- Part of the computer that actually performs arithmetic and logical operations on data
- All of the other elements of the computer system are there mainly to bring data into the ALU for it to process and then to take the results back out
- Based on the use of simple digital logic devices that can store binary digits and perform simple Boolean logic operations





Operands for arithmetic and logic operations are presented to the ALU in registers, and the results of an operation are stored in registers in binary.

Figure 10.1 ALU Inputs and Outputs



## Integer Representation



- In the binary number system arbitrary numbers can be represented with:
  - The digits zero and one
  - The minus sign (for negative numbers)
  - The period, or *radix point* (for numbers with a fractional component)
- For purposes of computer storage and processing we do not have the benefit of special symbols for the minus sign and radix point
- Only binary digits (0,1) may be used to represent numbers

## Sign-Magnitude Representation



There are several alternative conventions used to represent

negative as well as positive integers

- All of these alternatives involve treating the most significant (leftmost) bit in the word as a sign bit
- •If the sign bit is 0 the number is positive
- •If the sign bit is 1 the number is negative

Sign-magnitude representation is the simplest form that employs a sign bit

Drawbacks:

- Addition and subtraction require a consideration of both the signs of the numbers and their relative magnitudes to carry out the required operation
- •There are two representations of 0

Because of these drawbacks, sign-magnitude representation is rarely used in implementing the integer portion of the ALU

# Table 10.1 Characteristics of Twos Complement Representation and Arithmetic

| Range                             | $-2_{n-1}$ through $2_{n-1}-1$                                                                                                                       |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| Number of Representations of Zero | One                                                                                                                                                  |
| Negation                          | Take the Boolean complement of each bit of the corresponding positive number, then add 1 to the resulting bit pattern viewed as an unsigned integer. |
| <b>Expansion of Bit Length</b>    | Add additional bit positions to the left and fill in with the value of the original sign bit.                                                        |
| Overflow Rule                     | If two numbers with the same sign (both positive or both negative) are added, then overflow occurs if and only if the result has the opposite sign.  |
| Subtraction Rule                  | To subtract $B$ from $A$ , take the twos complement of $B$ and add it to $A$ .                                                                       |

#### Table 10.2

Alternative Representations for 4-Bit Integers

| Decimal<br>Representation | Sign-Magnitude<br>Representation | Twos Complement<br>Representation | Biased<br>Representation |
|---------------------------|----------------------------------|-----------------------------------|--------------------------|
| +8                        | _                                | _                                 | 1111                     |
| +7                        | 0111                             | 0111                              | 1110                     |
| +6                        | 0110                             | 0110                              | 1101                     |
| +5                        | 0101                             | 0101                              | 1100                     |
| +4                        | 0100                             | 0100                              | 1011                     |
| +3                        | 0011                             | 0011                              | 1010                     |
| +2                        | 0010                             | 0010                              | 1001                     |
| +1                        | 0001                             | 0001                              | 1000                     |
| +0                        | 0000                             | 0000                              | 0111                     |
| -0                        | 1000                             | _                                 | _                        |
| -1                        | 1001                             | 1111                              | 0110                     |
| -2                        | 1010                             | 1110                              | 0101                     |
| -3                        | 1011                             | 1101                              | 0100                     |
| -4                        | 1100                             | 1100                              | 0011                     |
| <b>-</b> 5                | 1101                             | 1011                              | 0010                     |
| -6                        | 1110                             | 1010                              | 0001                     |
| <del>-</del> 7            | 1111                             | 1001                              | 0000                     |
| <b>-8</b>                 |                                  | 1000                              | _                        |

| -128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |
|------|----|----|----|---|---|---|---|
|      |    |    |    |   |   |   |   |

(a) An eight-position two's complement value box

|   | -128 | 64 | 32 | 16 | 8 | 4 | 2  | 1  |       |
|---|------|----|----|----|---|---|----|----|-------|
|   | 1    | 0  | 0  | 0  | 0 | 0 | 1  | 1  |       |
| • | -128 |    |    |    |   |   | +2 | +1 | =-125 |

(b) Convert binary 10000011 to decimal



(c) Convert decimal –120 to binary

Figure 10.2 Use of a Value Box for Conversion Between Twos Complement Binary and Decimal

As you can see in Figure 10.2a, the most negative twos complement number that can be represented is - 2<sup>n-1</sup>; if any of the bits other than the sign bit is one, it adds a positive amount to the number



## Range Extension

- Range of numbers that can be expressed is extended by increasing the bit length
- In sign-magnitude notation this is accomplished by moving the sign bit to the new leftmost position and fill in with zeros
- This procedure will not work for twos complement negative integers
  - Rule is to move the sign bit to the new leftmost position and fill in with copies of the sign bit
  - For positive numbers, fill in with zeros, and for negative numbers, fill in with ones
  - This is called sign extension

## **Fixed-Point Representation**

The radix point (binary point) is fixed and assumed to be to the right of the rightmost digit

Programmer can use the same representation for binary fractions by scaling the numbers so that the binary point is implicitly positioned at some other location

## +

## Negation

- Twos complement operation
  - Take the Boolean complement of each bit of the integer (including the sign bit)
  - Treating the result as an unsigned binary integer, add 1

■ The negative of the negative of that number is itself:

-18 = 11101110 (twos complement)  
bitwise complement = 00010001  
$$\frac{+ 1}{00010010} = +18$$

#### +

## Negation Special Case 1



$$0 = 00000000$$
 (twos complement)

#### Overflow is ignored, so:

$$-0 = 0$$



## **Negation Special Case 2**



$$-128 = 10000000$$
 (twos complement)

Bitwise complement = 01111111

Add 1 to LSB + 1

Result 10000000

So:

$$-(-128) = -128 X$$

Monitor MSB (sign bit)

It should change during negation



Addition proceeds as if the two numbers were unsigned integers

| $   \begin{array}{rcl}     & 1001 & = & -7 \\     & +0101 & = & 5 \\     & 1110 & = & -2   \end{array} $ (a) (-7) + (+5) | $ \begin{array}{rcl} 1100 &=& -4 \\ +0100 &=& 4 \\ \hline 10000 &=& 0 \end{array} $ (b) (-4) + (+4)                                        |
|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| 0011 = 3  + 0100 = 4  0111 = 7  (c) (+3) + (+4)                                                                          | 1100 = -4  +1111 = -1  11011 = -5  (d) (-4) + (-1)                                                                                         |
| 0101 = 5  + 0100 = 4  1001 = Overflow  (e) (+5) + (+4)                                                                   | $   \begin{array}{rcl}     1001 &=& -7 \\     +1010 &=& -6 \\     10011 &=& \text{Overflow} \\     \hline     (f)(-7)+(-6)   \end{array} $ |

Figure 10.3 Addition of Numbers in Twos Complement Representation



Overflow

If two numbers are added, and they are both positive or both negative, then overflow occurs if and only if the result has the opposite sign.

Rule

When two signed 2's complement numbers are added, overflow is detected if:

- 1.both operands are positive and the result is negative, or
- 2.both operands are negative and the result is positive.

| $   \begin{array}{r}     1001 = -7 \\     +0101 = 5 \\     1110 = -2   \end{array} $ (a) (-7) + (+5) | 1100 = -4  +0100 = 4  10000 = 0  (b) (-4) + (+4)                 |
|------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| 0011 = 3 + 0100 = 4 0111 = 7 (c) (+3) + (+4)                                                         | 1100 = -4  + 1111 = -1  1011 = -5  (d) (-4) + (-1)               |
| 0101 = 5  + 0100 = 4  1001 = Overflow  (e) (+5) + (+4)                                               | 1001 = -7<br>+ $1010 = -6$<br>10011 = Overflow<br>(f)(-7) + (-6) |

Figure 10.3 Addition of Numbers in Twos Complement Representation

Figures 10.3e and f show examples of overflow. Note that overflow can occur whether or not there is a carry.



To subtract one number

Subtraction

(subtrahend) from another (minuend), take the twos complement (negation) of the labtrahend and add it

Rule

to the minuend.

$$\begin{array}{c} 0010 = 2 \\ +1001 = -7 \\ 1011 = -5 \end{array} & \begin{array}{c} 0101 = 5 \\ +1110 = -2 \\ 100011 = 3 \end{array} \\ \\ (a) \ M = 2 = 0010 \\ S = 7 = 0111 \\ -S = 1001 \end{array} & \begin{array}{c} (b) \ M = 5 = 0101 \\ S = 2 = 0010 \\ -S = 1110 \end{array} \\ \\ \begin{array}{c} 1011 = -5 \\ +1110 = -2 \\ 11001 = -7 \end{array} & \begin{array}{c} 0101 = 5 \\ +0010 = 2 \\ 0111 = 7 \end{array} \\ \\ (c) \ M = -5 = 1011 \\ S = 2 = 0010 \\ -S = 1110 \end{array} & \begin{array}{c} (d) \ M = 5 = 0101 \\ S = -2 = 1110 \\ -S = 0010 \end{array} \\ \\ \begin{array}{c} 0111 = 7 \\ +0111 = 7 \\ 1110 = 0 \end{array} & \begin{array}{c} 1010 = -6 \\ +1100 = -4 \\ 10110 = 0 \end{array} \\ \\ \begin{array}{c} (e) \ M = 7 = 0111 \\ S = -7 = 1001 \\ -S = 0111 \end{array} & \begin{array}{c} (f) \ M = -6 = 1010 \\ S = 4 = 0100 \\ -S = 1100 \end{array} \\ \end{array}$$

Figure 10.4 Subtraction of Numbers in Twos Complement Representation (M – S)



OF = overflow bit

SW = Switch (select addition or subtraction)

Figure 10.6 Block Diagram of Hardware for Addition and Subtraction





Figure 10.7 Multiplication of Unsigned Binary Integers



| С | A    | Q    | M    |                    |                 |
|---|------|------|------|--------------------|-----------------|
| 0 | 0000 | 1101 | 1011 | Initial            | Values          |
| 0 | 1011 | 1101 | 1011 | Add 7              | First           |
| 0 | 0101 | 1110 | 1011 | Shift <b>\( \)</b> | Cycle           |
| 0 | 0010 | 1111 | 1011 | Shift }            | Second<br>Cycle |
|   |      |      |      | ر                  |                 |
| 0 | 1101 | 1111 | 1011 | Add }              | Third           |
| 0 | 0110 | 1111 | 1011 | Shift )            | Cycle           |
| 1 | 0001 | 1111 | 1011 | Add ?              | Fourth          |
| 0 | 1000 | 1111 | 1011 | Shift <b>\( \)</b> | Cycle           |
|   |      |      |      |                    |                 |

(b) Example from Figure 9.7 (product in A, Q)

Figure 10.8 Hardware Implementation of Unsigned Binary Multiplication



Figure 10.9 Flowchart for Unsigned Binary Multiplication



Figure 10.10 Multiplication of Two Unsigned 4-Bit Integers Yielding an 8-Bit Result



(a) Unsigned integers

(b) Twos complement integers

## Figure 10.11 Comparison of Multiplication of Unsigned and Twos Complement Integers



Figure 10.12 Booth's Algorithm for Twos Complement Multiplication

| A<br>0000    | Q<br>0011    | Q <sub>-1</sub><br>0 | M<br>0111    | Initial Values                         |
|--------------|--------------|----------------------|--------------|----------------------------------------|
| 1001<br>1100 | 0011<br>1001 | 0<br>1               | 0111<br>0111 | $A \leftarrow A - M$ First Shift Cycle |
| 1110         | 0100         | 1                    | 0111         | Shift Second Cycle                     |
| 0101<br>0010 | 0100<br>1010 | 1<br>0               | 0111<br>0111 | $A \leftarrow A + M$ Third Shift Cycle |
| 0001         | 0101         | 0                    | 0111         | Shift } Fourth Cycle                   |

#### Figure 10.13 Example of Booth's Algorithm (7× 3)

(a) 
$$(7)$$
  $(3) = (21)$ 

(b) 
$$(7)$$
  $(-3) = (-21)$ 

(c) 
$$(-7)$$
  $(3) = (-21)$ 

(d) 
$$(-7)$$
  $(-3) = (21)$ 

#### Figure 10.14 Examples Using Booth's Algorithm





#### Figure 10.15 Example of Division of Unsigned Binary Integers



Figure 10.16 Flowchart for Unsigned Binary Division

| A                                                                                           | Q    |                                                            |
|---------------------------------------------------------------------------------------------|------|------------------------------------------------------------|
| 0000                                                                                        | 0111 | Initial value                                              |
| $   \begin{array}{r}     0000 \\     \underline{1101} \\     1101   \end{array} $           | 1110 | Shift Use twos complement of 0011 for subtraction Subtract |
| 0000                                                                                        | 1110 | Restore, set $Q_0 = 0$                                     |
| 0001<br>1101<br>1110                                                                        | 1100 | Shift                                                      |
| 0001                                                                                        | 1100 | Restore, set $Q_0 = 0$                                     |
| 0011<br>1101                                                                                | 1000 | Shift                                                      |
| 0000                                                                                        | 1001 | Subtract, set $Q_0 = 1$                                    |
| $   \begin{array}{r}     0001 \\     \hline     1101 \\     \hline     1110   \end{array} $ | 0010 | Shift                                                      |
| 0001                                                                                        | 0010 | Restore, set $Q_0 = 0$                                     |

Figure 10.17 Example of Restoring Twos Complement Division (7/3)

## Floating-Point Representation

#### **Principles**

- With a fixed-point notation it is possible to represent a range of positive and negative integers centered on or near 0
- By assuming a fixed binary or radix point, this format allows the representation of numbers with a fractional component as well

#### ■ Limitations:

- Very large numbers cannot be represented nor can very small fractions
- The fractional part of the quotient in a division of two large numbers could be lost



(b) Examples

Figure 10.18 Typical 32-Bit Floating-Point Format

#### +

## Floating-Point

## Significand

- The final portion of the word
- Any floating-point number can be expressed in many ways

The following are equivalent, where the significand is expressed in binary form:

$$0.110 * 2^{5}$$
 $110 * 2^{2}$ 
 $0.0110 * 2^{6}$ 

- Normal number
  - The most significant digit of the significand is nonzero



(a) Twos Complement Integers



Figure 10.19 Expressible Numbers in Typical 32-Bit Formats



#### Figure 10.20 Density of Floating-Point Numbers

### **IEEE Standard 754**

Most important floating-point representation is defined

Standard was developed to facilitate the portability of programs from one processor to another and to encourage the development of sophisticated, numerically oriented programs

Standard has been widely adopted and is used on virtually all contemporary processors and arithmetic coprocessors

IEEE 754-2008 covers both binary and decimal floating-point representations

## IEEE 754-2008



#### Arithmetic format

All the mandatory operations defined by the standard are supported by the format. The format may be used to represent floating-point operands or results for the operations described in the standard.

#### ■ Basic format

■ This format covers five floating-point representations, three binary and two decimal, whose encodings are specified by the standard, and which can be used for arithmetic. At least one of the basic formats is implemented in any conforming implementation.

#### ■ Interchange format

A fully specified, fixed-length binary encoding that allows data interchange between different platforms and that can be used for storage.



Figure 10.21 IEEE 754 Formats

#### Table 10.3 IEEE 754 Format Parameters

| Parameter                            | Format                 |                        |                         |  |
|--------------------------------------|------------------------|------------------------|-------------------------|--|
| 1 at affecter                        | binary32               | binary64               | binary128               |  |
| Storage width (bits)                 | 32                     | 64                     | 128                     |  |
| Exponent width (bits)                | 8                      | 11                     | 15                      |  |
| Exponent bias                        | 127                    | 1023                   | 16383                   |  |
| Maximum exponent                     | 127                    | 1023                   | 16383                   |  |
| Minimum exponent                     | -126                   | -1022                  | -16382                  |  |
| Approx normal number range (base 10) | 10_38, 10_438          | 10_308, 10+308         | 10_4932, 10+4932        |  |
| Trailing significand width (bits)*   | 23                     | 52                     | 112                     |  |
| Number of exponents                  | 254                    | 2046                   | 32766                   |  |
| Number of fractions                  | 2 <sub>23</sub>        | 2 <sub>52</sub>        | 2 <sub>112</sub>        |  |
| Number of values                     | 1.98 ´ 2 <sub>31</sub> | 1.99 ´ 2 <sub>63</sub> | 1.99 ´ 2 <sub>128</sub> |  |
| Smallest positive normal number      | 2_126                  | 2_1022                 | 2_16362                 |  |
| Largest positive normal number       | $2_{128} - 2_{104}$    | $2_{1024} - 2_{971}$   | $2_{16384} - 2_{16271}$ |  |
| Smallest subnormal magnitude         | 2_149                  | 2_1074                 | 2_16494                 |  |

<sup>\*</sup> not including implied bit and not including sign bit

## + Additional Formats

#### **Extended Precision Formats**

- Provide additional bits in the exponent (extended range) and in the significand (extended precision)
- Lessens the chance of a final result that has been contaminated by excessive roundoff error
- Lessens the chance of an intermediate overflow aborting a computation whose final result would have been representable in a basic format
- Affords some of the benefits of a larger basic format without incurring the time penalty usually associated with higher precision

#### **Extendable Precision Format**

- Precision and range are defined under user control
- May be used for intermediate calculations but the standard places no constraint or format or length



## Table 10.4 IEEE Formats

| Format                                                  |                   | Format Type  |                    |
|---------------------------------------------------------|-------------------|--------------|--------------------|
| r or mat                                                | Arithmetic Format | Basic Format | Interchange Format |
| binary16                                                |                   |              | X                  |
| binary32                                                | X                 | X            | X                  |
| binary64                                                | X                 | X            | X                  |
| binary128                                               | X                 | X            | X                  |
| binary $\{k\}$<br>$(k = n \cdot 32 \text{ for } n > 4)$ | x                 |              | x                  |
| decimal64                                               | X                 | X            | X                  |
| decimal128                                              | X                 | X            | X                  |
| decimal $\{k\}$<br>$(k = n \ 32 \text{ for } n > 4)$    | X                 |              | X                  |
| extended precision                                      | X                 |              |                    |
| extendable precision                                    | X                 |              |                    |

Table 10.5
Interpretation of IEEE 754 Floating-Point Numbers (page 1 of 3)

|                         | Sign   | Biased exponent | Fraction                | Value             |
|-------------------------|--------|-----------------|-------------------------|-------------------|
| positive zero           | 0      | 0               | 0                       | 0                 |
| negative zero           | 1      | 0               | 0                       | -0                |
| plus infinity           | 0      | all 1s          | 0                       | 8                 |
| Minus infinity          | 1      | all 1s          | 0                       |                   |
| quiet NaN               | 0 or 1 | all 1s          | ≠ 0; first bit<br>= 1   | qNaN              |
| signaling NaN           | 0 or 1 | all 1s          | $\neq$ 0; first bit = 0 | sNaN              |
| positive normal nonzero | 0      | 0 < e < 255     | f                       | $2_{e-127}(1.f)$  |
| negative normal nonzero | 1      | 0 < e < 255     | f                       | $-2_{e-127}(1.f)$ |
| positive subnormal      | 0      | 0               | $f \neq 0$              | $2_{e-126}(0.f)$  |
| negative subnormal      | 1      | 0               | $f \neq 0$              | $-2_{e-126}(0.f)$ |

#### (a) binary32 format

Table 10.5
Interpretation of IEEE 754 Floating-Point Numbers (page 2 of 3)

|                         | Sign   | Biased exponent | Fraction                | Value              |
|-------------------------|--------|-----------------|-------------------------|--------------------|
| positive zero           | 0      | 0               | 0                       | 0                  |
| negative zero           | 1      | 0               | 0                       | -0                 |
| plus infinity           | 0      | all 1s          | 0                       | $\infty$           |
| Minus infinity          | 1      | all 1s          | 0                       | $-\infty$          |
| quiet NaN               | 0 or 1 | all 1s          | ≠ 0; first bit<br>= 1   | qNaN               |
| signaling NaN           | 0 or 1 | all 1s          | $\neq$ 0; first bit = 0 | sNaN               |
| positive normal nonzero | 0      | 0 < e < 2047    | f                       | $2_{e-1023}(1.f)$  |
| negative normal nonzero | 1      | 0 < e < 2047    | f                       | $-2_{e-1023}(1.f)$ |
| positive subnormal      | 0      | 0               | $f \neq 0$              | $2_{e-1022}(0.f)$  |
| negative subnormal      | 1      | 0               | $f \neq 0$              | $-2_{e-1022}(0.f)$ |

#### (a) binary64 format

Table 10.5
Interpretation of IEEE 754 Floating-Point Numbers (page 3 of 3)

|                         | Sign   | Biased exponent | Fraction                | Value               |
|-------------------------|--------|-----------------|-------------------------|---------------------|
| positive zero           | 0      | 0               | 0                       | 0                   |
| negative zero           | 1      | 0               | 0                       | -0                  |
| plus infinity           | 0      | all 1s          | 0                       | $\infty$            |
| minus infinity          | 1      | all 1s          | 0                       | $-\infty$           |
| quiet NaN               | 0 or 1 | all 1s          | ≠ 0; first bit<br>= 1   | qNaN                |
| signaling NaN           | 0 or 1 | all 1s          | $\neq$ 0; first bit = 0 | sNaN                |
| positive normal nonzero | 0      | all 1s          | f                       | $2_{e-16383}(1.f)$  |
| negative normal nonzero | 1      | all 1s          | f                       | $-2_{e-16383}(1.f)$ |
| positive subnormal      | 0      | 0               | $f \neq 0$              | $2_{e-16383}(0.f)$  |
| negative subnormal      | 1      | 0               | $f \neq 0$              | $-2_{e-16383}(0.f)$ |

#### (a) binary 128 format

**Table 10.6 Floating-Point Numbers and Arithmetic Operations** 

| Floating Point Numbers                          | Arithmetic Operations                                                                                                                                                                                        |
|-------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $X = X_S \cdot B^{X_E}$ $Y = Y_S \cdot B^{Y_E}$ | $X + Y = \begin{pmatrix} X_s & B^{X_E - Y_E} + Y_s \end{pmatrix} & B^{Y_E} \ddot{\downarrow}$ $X - Y = \begin{pmatrix} X_s & B^{X_E - Y_E} - Y_s \end{pmatrix} & B^{Y_E} \ddot{\downarrow} & X_E \notin Y_E$ |
|                                                 | $X \cdot Y = (X_S \cdot Y_S) \cdot B^{X_E + Y_E}$                                                                                                                                                            |
|                                                 | $\frac{X}{Y} = \begin{cases} \frac{\mathcal{X}_S \ddot{0}}{Y_S \dot{\tilde{\emptyset}}} & B^{X_E - Y_E} \end{cases}$                                                                                         |

#### **Examples:**

$$X = 0.3 \cdot 10^2 = 30$$
  
 $Y = 0.2 \cdot 10^3 = 200$ 

$$X + Y = (0.3 \cdot 10_{2-3} + 0.2) \cdot 10_3 = 0.23 \cdot 10_3 = 230$$
  
 $X - Y = (0.3 \cdot 10_{2-3} - 0.2) \cdot 10_3 = (-0.17) \cdot 10_3 = -170$   
 $X \cdot Y = (0.3 \cdot 0.2) \cdot 10_{2+3} = 0.06 \cdot 10_5 = 6000$   
 $X \cdot Y = (0.3 \cdot 0.2) \cdot 10_{2-3} = 1.5 \cdot 10_{-1} = 0.15$ 



Figure 10.22 Floating-Point Addition and Subtraction ( $Z \leftarrow X \pm Y$ )



Figure 10.23 Floating-Point Multiplication (Z← X× Y)



Figure 10.24 Floating-Point Division ( $Z \leftarrow X/Y$ )

```
x = 1.000....00 ´ 2^{1} x = .100000 ´ 16^{1} y = 0.111....11 ´ 2^{1} y = 0.000....01 ´ 2^{1} z = 0.00001 ´ 16^{1} z = 0.00001 ´ 16^{1} z = 1.000....00 ´ 2^{-22} z = .100000 ´ 16^{-4}
```

- (a) Binary example, without guard bits
- (c) Hexadecimal example, without guard bits

```
x = 1.000....00 0000 ´ 2^1 x = .100000 00 ´ 16^1

-y = 0.111....11 1000 ´ 2^1 -y = .0FFFFF F0 ´ 16^1

z = 0.000....00 1000 ´ 2^1 z = .000000 10 ´ 16^1

= 1.000....00 0000 ´ 2^{-23} = .100000 00 ´ 16^{-5}
```

(b) Binary example, with guard bits

(d) Hexadecimal example, with guard bits

#### Figure 10.25 The Use of Guard Bits

### +

## **Precision Considerations**

## Rounding

- IEEE standard approaches:
  - Round to nearest:
    - The result is rounded to the nearest representable number.
  - Round toward  $+\infty$ :
    - The result is rounded up toward plus infinity.
  - Round toward -∞:
    - The result is rounded down toward negative infinity.
  - Round toward 0:
    - The result is rounded toward zero.

## **Interval Arithmetic**

- Provides an efficient method for monitoring and controlling errors in floating-point computations by producing two values for each result
- The two values correspond to the lower and upper endpoints of an interval that contains the true result
- The width of the interval indicates the accuracy of the result
- If the endpoints are not representable then the interval endpoints are rounded down and up respectively
- If the range between the upper and lower bounds is sufficiently narrow then a sufficiently accurate result has been obtained

Minus infinity and rounding to plus are useful in implementing interval arithmetic

### **Truncation**

- Round toward zero
- Extra bits are ignored
- Simplest technique
- A consistent bias toward zero in the operation
  - Serious bias because it affects every operation for which there are nonzero extra bits

# IEEE Standard for Binary Floating-Point Arithmetic Infinity

Is treated as the limiting case of real arithmetic, with the infinity values given the following interpretation:

$$-∞$$
 < (every finite number) <  $+∞$ 

#### For example:

$$5 + (+\infty) = +\infty$$

$$5 \div (+\infty) = +0$$

$$5 - (+\infty) = -\infty$$

$$5 + (-\infty) = -\infty$$

$$5 - (-\infty) = +\infty$$

$$5 \div (+\infty) = +\infty$$

$$(-\infty) + (-\infty) = -\infty$$

$$(-\infty) - (+\infty) = -\infty$$

$$5 \div (+\infty) = +\infty$$

# Table 10.7 Operations that Produce a Quiet NaN

| Operation       | Quiet NaN Produced by                                |
|-----------------|------------------------------------------------------|
| Any             | Any operation on a signaling NaN                     |
|                 | Magnitude subtraction of infinities:                 |
|                 | $(+\infty) + (-\infty)$                              |
| Add or subtract | $(-\infty) + (+\infty)$                              |
|                 | $(+\infty)-(+\infty)$                                |
|                 | $(-\infty)-(-\infty)$                                |
| Multiply        | $0 \times \infty$                                    |
| Division        | $\frac{0}{0}$ or $\frac{\infty}{\infty}$             |
| Remainder       | $x \text{ REM } 0 \text{ or } \infty \text{ REM } y$ |
| C .             |                                                      |
| Square root     | $\sqrt{x}$ where $x < 0$                             |



(a) 32-bit format without subnormal numbers



(b) 32-bit format with subnormal numbers

#### Figure 10.26 The Effect of IEEE 754 Subnormal Numbers

## + Summary

## Chapter 10

- ALU
- Integer representation
  - Sign-magnitude representation
  - Twos complement representation
  - Range extension
  - Fixed-point representation
- Floating-point representation
  - Principles
  - IEEE standard for binary floating-point representation

## Computer Arithmetic

- Integer arithmetic
  - Negation
  - Addition and subtraction
  - Multiplication
  - Division
- Floating-point arithmetic
  - Addition and subtraction
  - Multiplication and division
  - Precision consideration
  - IEEE standard for binary floating-point arithmetic