

#### **Digital Design and Computer Architecture: ARM® Edition**

Sarah L. Harris and David Money Harris



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <1>



# Chapter 7 :: Topics

- Introduction
- Performance Analysis
- Single-Cycle Processor
- Multicycle Processor
- Pipelined Processor
- Advanced Microarchitecture





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <2>



### Introduction

- Microarchitecture: how to implement an architecture in hardware
- Processor:
  - Datapath: functional blocks
  - Control: control signals







### Microarchitecture

- Multiple implementations for a single architecture:
  - Single-cycle: Each instruction executes in a single cycle
  - Multicycle: Each instruction is broken up into series of shorter steps
  - Pipelined: Each instruction broken up into series of steps & multiple instructions execute at once





#### **Processor Performance**

#### Program execution time

Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)

#### • Definitions:

- CPI: Cycles/instruction
- clock period: seconds/cycle
- IPC: instructions/cycle = IPC

#### • Challenge is to satisfy constraints of:

- Cost
- Power
- Performance





#### **ARM Processor**

- Consider **subset** of ARM instructions:
  - Data-processing instructions:
    - ADD, SUB, AND, ORR
    - with register and immediate Src2, but no shifts
  - Memory instructions:
    - LDR, STR
    - with positive immediate offset
  - Branch instructions:

• B





### Architectural State Elements

#### **Determines everything about a processor:**

- Architectural state:
  - 16 registers (including PC)
  - Status register
- Memory





#### **ARM Architectural State Elements**





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <8>



- Datapath
- Control





- Datapath
- Control



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <10>



- **Datapath:** start with LDR instruction
- Example: LDR R1, [R2, #5] LDR Rd, [Rn, imm12]







# Single-Cycle Datapath: LDR fetch

#### **STEP 1:** Fetch instruction











# Single-Cycle Datapath: LDR Reg Read

#### **STEP 2:** Read source operands from RF









# Single-Cycle Datapath: LDR Immed.

#### **STEP 3:** Extend the immediate







# Single-Cycle Datapath: LDR Address

#### **STEP 4:** Compute the memory address







### Single-Cycle Datapath: LDR Mem Read

# **STEP 5:** Read data from memory and write it back to register file







#### Single-Cycle Datapath: PC Increment

#### **STEP 6:** Determine address of next instruction







#### Single-Cycle Datapath: Access to PC

#### PC can be source/destination of instruction





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <18>



#### Single-Cycle Datapath: Access to PC

PC can be source/destination of instruction

- **Source:** R15 must be available in Register File
  - PC is read as the current PC plus 8





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <19>



#### Single-Cycle Datapath: Access to PC

PC can be source/destination of instruction

- **Source:** R15 must be available in Register File
  - PC is read as the current PC plus 8
- **Destination:** Be able to write result to PC





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <20>



# Single-Cycle Datapath: STR

#### **Expand datapath to handle STR:**

• Write data in Rd to memory







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Cl

Chapter 7 <21>



#### With immediate Src2:

- Read from Rn and Imm8 (*ImmSrc* chooses the zero-extended Imm8 instead of Imm12)
- Write ALUResult to register file
- Write to Rd









#### With immediate Src2:

- Read from Rn and Imm8 (ImmSrc chooses the zero-extended Imm8 instead of Imm12)
- Write ALUResult to register file
- Write to Rd



ADD Rd, Rn, imm8





#### With register Src2:

- Read from Rn and Rm (instead of Imm8)
- Write ALUResult to register file
- Write to Rd



ADD Rd, Rn, Rm





#### With register Src2:

- Read from Rn and Rm (instead of Imm8)
- Write ALUResult to register file
- Write to Rd



ADD Rd, Rn, Rm



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <25>



### Single-Cycle Datapath: B

#### **Calculate branch target address:**

#### BTA = (ExtImm) + (PC + 8)

#### ExtImm = Imm24 << 2 and sign-extended



Branch







# Single-Cycle Datapath: ExtImm



| ImmSrc <sub>1:0</sub> | Extimm                                            | Description         |
|-----------------------|---------------------------------------------------|---------------------|
| 00                    | {24'b0, Instr <sub>7:0</sub> }                    | Zero-extended imm8  |
| 01                    | {20'b0, Instr <sub>11:0</sub> }                   | Zero-extended imm12 |
| 10                    | {6{Instr <sub>23</sub> }, Instr <sub>23:0</sub> } | Sign-extended imm24 |



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <27>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <28>













Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <30>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <31>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 C

Chapter 7 <32>





 FlagW<sub>1:0</sub>: Flag Write signal, asserted when ALUFlags should be saved (i.e., on instruction with S=1)



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <33>





- FlagW<sub>1:0</sub>: Flag Write signal, asserted when ALUFlags should be saved (i.e., on instruction with S=1)
- ADD, SUB update all flags (NZCV)
  - AND, ORR only update NZ flags







- FlagW<sub>1:0</sub>: Flag Write signal, asserted when ALUFlags should be saved (i.e., on instruction with S=1)
- ADD, SUB update all flags (NZCV)
  - AND, ORR only update NZ flags









Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <36>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 < 37>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chap

Chapter 7 <38>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapte

Chapter 7 <39>



## Control Unit: Main Decoder

| Op | Funct <sub>5</sub> | Funct <sub>0</sub> | Туре   | Branch | MemtoReg | MemW | ALUSrc | ImmSrc | RegW | RegSrc | ALUOp |
|----|--------------------|--------------------|--------|--------|----------|------|--------|--------|------|--------|-------|
| 00 | 0                  | Х                  | DP Reg | 0      | 0        | 0    | 0      | XX     | 1    | 00     | 1     |
| 00 | 1                  | X                  | DP Imm | 0      | 0        | 0    | 1      | 00     | 1    | X0     | 1     |
| 01 | Х                  | 0                  | STR    | 0      | X        | 1    | 1      | 01     | 0    | 10     | 0     |
| 01 | Х                  | 1                  | LDR    | 0      | 1        | 0    | 1      | 01     | 1    | X0     | 0     |
| 11 | Х                  | Х                  | В      | 1      | 0        | 0    | 1      | 10     | 0    | X1     | 0     |



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <40>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <41>

Source of the second se

#### Review: ALU

| ALUControl <sub>1:0</sub> | Function |
|---------------------------|----------|
| 00                        | Add      |
| 01                        | Subtract |
| 10                        | AND      |
| 11                        | OR       |





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <42>



#### Review: ALU





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 C

Chapter 7 <43>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapte

Chapter 7 <44>



# Control Unit: ALU Decoder

| ALUOp | Funct <sub>4:1</sub><br>(cmd) | Funct <sub>0</sub><br>(S) | Туре   | ALUControl <sub>1:0</sub> | FlagW <sub>1:0</sub> |
|-------|-------------------------------|---------------------------|--------|---------------------------|----------------------|
| 0     | Х                             | Х                         | Not DP | 00                        | 00                   |
| 1     | 0100                          | 0                         | ADD    | 00                        | 00                   |
|       |                               | 1                         |        |                           | 11                   |
|       | 0010                          | 0                         | SUB    | 01                        | 00                   |
|       |                               | 1                         |        |                           | 11                   |
|       | 0000                          | 0                         | AND    | 10                        | 00                   |
|       |                               | 1                         |        |                           | 10                   |
|       | 1100                          | 0                         | ORR    | 11                        | 00                   |
|       |                               | 1                         |        |                           | 10                   |

*FlagW*<sub>1</sub> = 1: *NZ* (*Flags*<sub>3:2</sub>) should be saved *FlagW*<sub>0</sub> = 1: *CV* (*Flags*<sub>1:0</sub>) should be saved









Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <46>

ELSEVIER

#### Single-Cycle Control: PC Logic

**PCS** = 1 if PC is written by an instruction or branch (B): PCS = ((Rd == 15) & RegW) | Branch



If instruction is executed:PCSrc = PCSElsePCSrc = 0 (i.e., PC = PC + 4)



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <47>



## Single-Cycle Control





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <48>



## Single-Cycle Control: Cond. Logic





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <49>



# **Conditional Logic**



#### **Function:**

- 1. Check if instruction should execute (if not, force PCSrc, RegWrite, and MemWrite to 0)
- 2. Possibly update Status Register (Flags<sub>3:0</sub>)



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



# **Conditional Logic**



#### **Function:**

- 1. Check if instruction should execute (if not, force PCSrc, RegWrite, and MemWrite to 0)
- 2. Possibly update Status Register (Flags<sub>3:0</sub>)



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



#### Single-Cycle Control: Conditional Logic





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <52>



#### **Conditional Logic: Conditional Execution**



Depending on condition mnemonic ( $Cond_{3:0}$ ) and condition flags ( $Flags_{3:0}$ ) the instruction is executed (CondEx = 1)



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <53>



#### **Conditional Logic: Conditional Execution**



Depending on condition mnemonic ( $Cond_{3:0}$ ) and condition flags ( $Flags_{3:0}$ ) the instruction is executed (CondEx = 1)



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <54>

ELSEVIER

### **Review: Condition Mnemonics**

| Cond <sub>3:0</sub> | Mnemonic     | Name                                | CondEx                           |
|---------------------|--------------|-------------------------------------|----------------------------------|
| 0000                | EQ           | Equal                               | Z                                |
| 0001                | NE           | Not equal                           | $\bar{Z}$                        |
| 0010                | CS / HS      | Carry set / Unsigned higher or same | С                                |
| 0011                | CC / LO      | Carry clear / Unsigned lower        | Ē                                |
| 0100                | MI           | Minus / Negative                    | N                                |
| 0101                | PL           | Plus / Positive of zero             | $\overline{N}$                   |
| 0110                | VS           | Overflow / Overflow set             | V                                |
| 0111                | VC           | No overflow / Overflow clear        | $\overline{V}$                   |
| 1000                | н            | Unsigned higher                     | ΖC                               |
| 1001                | LS           | Unsigned lower or same              | $Z OR \overline{C}$              |
| 1010                | GE           | Signed greater than or equal        | $\overline{N \oplus V}$          |
| 1011                | LT           | Signed less than                    | $N \oplus V$                     |
| 1100                | GT           | Signed greater than                 | $\bar{Z}(\overline{N \oplus V})$ |
| 1101                | LE           | Signed less than or equal           | $Z \ OR \ (N \oplus V)$          |
| 1110                | AL (or none) | Always / unconditional              | ignored                          |



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Ch

Chapter 7 <55>



#### **Conditional Logic: Conditional Execution**





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <56>



#### **Conditional Logic: Conditional Execution**





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <57>



# **Conditional Logic**



#### **Function:**

- 1. Check if instruction should execute (if not, force PCSrc, RegWrite, and MemWrite to 0)
- 2. Possibly update Status Register (Flags<sub>3:0</sub>)



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <58>



### Conditional Logic: Update (Set) Flags





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <59>

### Conditional Logic: Update (Set) Flags



#### **Recall:**

- ADD, SUB update
   all Flags
- AND, OR update
   NZ only
- So Flags status register has two write enables: FlagW<sub>1:0</sub>



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <60>



## Review: ALU Decoder

| ALUOp | Funct <sub>4:1</sub><br>(cmd) | Funct <sub>0</sub><br>(S) | Туре   | ALUControl <sub>1:0</sub> | FlagW <sub>1:0</sub> |
|-------|-------------------------------|---------------------------|--------|---------------------------|----------------------|
| 0     | Х                             | Х                         | Not DP | 00                        | 00                   |
| 1     | 0100                          | 0                         | ADD    | 00                        | 00                   |
|       |                               | 1                         |        |                           | 11                   |
|       | 0010                          | 0                         | SUB    | 01                        | 00                   |
|       |                               | 1                         |        |                           | 11                   |
|       | 0000                          | 0                         | AND    | 10                        | 00                   |
|       |                               | 1                         |        |                           | 10                   |
|       | 1100                          | 0                         | ORR    | 11                        | 00                   |
|       |                               | 1                         |        |                           | 10                   |

*FlagW*<sub>1</sub> = 1: *NZ* (*Flags*<sub>3:2</sub>) should be saved *FlagW*<sub>0</sub> = 1: *CV* (*Flags*<sub>1:0</sub>) should be saved





#### Conditional Logic: Update (Set) Flags





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Ch



#### Conditional Logic: Update (Set) Flags



**FlagW**<sub>1:0</sub> = 10 AND **CondEx** = 1 (unconditional) => **FlagWrite**<sub>1:0</sub> = 10



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <63>



#### Example: ORR







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <64>



#### Example: ORR





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <65>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015





#### No change to datapath



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <68>



| ALUOp | Funct <sub>4:1</sub><br>(cmd) | Funct <sub>0</sub><br>(S) | Туре   | ALUControl <sub>1:0</sub> | FlagW <sub>1:0</sub> | NoWrite |
|-------|-------------------------------|---------------------------|--------|---------------------------|----------------------|---------|
| 0     | X                             | X                         | Not DP | 00                        | 00                   | 0       |
| 1     | 0100                          | 0                         | ADD    | 00                        | 00                   | 0       |
|       |                               | 1                         |        |                           | 11                   | 0       |
|       | 0010                          | 0                         | SUB    | 01                        | 00                   | 0       |
|       |                               | 1                         |        |                           | 11                   | 0       |
|       | 0000                          | 0                         | AND    | 10                        | 00                   | 0       |
|       |                               | 1                         |        |                           | 10                   | 0       |
|       | 1100                          | 0                         | ORR    | 11                        | 00                   | 0       |
|       |                               | 1                         |        |                           | 10                   | 0       |
|       | 1010                          | 1                         | CMP    | 01                        | 11                   | 1       |



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



#### **Extended Functionality: Shifted Register**





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <70>



#### **Extended Functionality: Shifted Register**



#### No change to controller



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <71>



### **Review: Processor Performance**

#### **Program Execution Time**

- = (#instructions)(cycles/instruction)(seconds/cycle)
- = # instructions x CPI x  $T_c$



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



# Single-Cycle Performance



#### T<sub>c</sub> limited by critical path (LDR)



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 < 73>



# Single-Cycle Performance

• Single-cycle critical path:

 $T_{c1} = t_{pcq\_PC} + t_{mem} + t_{dec} + \max[t_{mux} + t_{RFread}, t_{sext} + t_{mux}] + t_{ALU} + t_{mem} + t_{mux} + t_{RFsetup}$ 

## • Typically, limiting paths are:

- memory, ALU, register file

$$-T_{c1} = t_{pcq\_PC} + 2t_{mem} + t_{dec} + t_{RFread} + t_{ALU} + 2t_{mux} + t_{RFsetup}$$





# Single-Cycle Performance Example

| Element             | Parameter                  | Delay (ps) |  |
|---------------------|----------------------------|------------|--|
| Register clock-to-Q | $t_{pcq\_PC}$              | 40         |  |
| Register setup      | t <sub>setup</sub>         | 50         |  |
| Multiplexer         | t <sub>mux</sub>           | 25         |  |
| ALU                 | t <sub>ALU</sub>           | 120        |  |
| Decoder             | t <sub>dec</sub>           | 70         |  |
| Memory read         | t <sub>mem</sub>           | 200        |  |
| Register file read  | <i>t<sub>RFread</sub></i>  | 100        |  |
| Register file setup | <i>t<sub>RFsetup</sub></i> | 60         |  |

$$T_{c1} = ?$$





# Single-Cycle Performance Example

| Element             | Parameter                  | Delay (ps) |  |
|---------------------|----------------------------|------------|--|
| Register clock-to-Q | $t_{pcq\_PC}$              | 40         |  |
| Register setup      | t <sub>setup</sub>         | 50         |  |
| Multiplexer         | t <sub>mux</sub>           | 25         |  |
| ALU                 | t <sub>ALU</sub>           | 120        |  |
| Decoder             | t <sub>dec</sub>           | 70         |  |
| Memory read         | t <sub>mem</sub>           | 200        |  |
| Register file read  | <i>t<sub>RFread</sub></i>  | 100        |  |
| Register file setup | <i>t<sub>RFsetup</sub></i> | 60         |  |

$$T_{c1} = t_{pcq\_PC} + 2t_{mem} + t_{dec} + t_{RFread} + t_{ALU} + 2t_{mux} + t_{RFsetup}$$
  
= [50 + 2(200) + 70 + 100 + 120 + 2(25) + 60] ps  
= **840 ps**





# Single-Cycle Performance Example

Program with 100 billion instructions:

## Execution Time = # instructions x CPI x $T_C$ = $(100 \times 10^9)(1)(840 \times 10^{-12} \text{ s})$ = 84 seconds





## • Single-cycle:

+ simple

- cycle time limited by longest instruction (LDR)
- separate memories for instruction and data
- 3 adders/ALUs
- Multicycle processor addresses these issues by breaking instruction into shorter steps

shorter instructions take fewer steps

- o can re-use hardware
- $\circ$  cycle time is faster





## • Single-cycle:

+ simple

- cycle time limited by longest instruction (LDR)
- separate memories for instruction and data
- 3 adders/ALUs

## • Multicycle:

- + higher clock speed
- + simpler instructions run faster
- + reuse expensive hardware on multiple cycles
- sequencing overhead paid many times





## • Single-cycle:

+ simple

- cycle time limited by longest instruction (LDR)
- separate memories for instruction and data
- 3 adders/ALUs

## • Multicycle:

- + higher clock speed
- + simpler instructions run faster

Same design steps as single-cycle:

- first datapath
- then control
- + reuse expensive hardware on multiple cycles
- sequencing overhead paid many times





# **Multicycle State Elements**

Replace Instruction and Data memories with a single unified memory – more realistic







## Multicycle Datapath: Instruction Fetch

#### **STEP 1:** Fetch instruction









## Multicycle Datapath: LDR Register Read









## Multicycle Datapath: LDR Address

#### **STEP 3:** Compute the memory address







## Multicycle Datapath: LDR Memory Read

#### **STEP 4:** Read data from memory







## Multicycle Datapath: LDR Write Register

#### **STEP 5:** Write data back to register file







## Multicycle Datapath: Increment PC

#### **STEP 6:** Increment PC







## Multicycle Datapath: Access to PC

#### PC can be read/written by instruction





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <88>



## Multicycle Datapath: Access to PC

### PC can be read/written by instruction

• Read: R15 (PC+8) available in Register File







#### Example: ADD R1, R15, R2



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <90>



#### Example: ADD R1, R15, R2

- R15 needs to be read as PC+8 from Register File (RF) in 2<sup>nd</sup> step
- So (also in 2<sup>nd</sup> step) PC + 8 is produced by ALU and routed to R15 input of RF







#### Example: ADD R1, R15, R2

- R15 needs to be read as PC+8 from Register File (RF) in 2<sup>nd</sup> step
- So (also in 2<sup>nd</sup> step) PC + 8 is produced by ALU and routed to R15 input of RF
  - SrcA = PC (which was already updated in step 1 to PC+4)
  - SrcB = 4
  - ALUResult = PC + 8
- ALUResult is fed to R15 input port of RF in 2<sup>nd</sup> step (which is then routed to RD1 output of RF)





#### Example: ADD R1, R15, R2

- R15 needs to be read as PC+8 from Register File (RF) in 2<sup>nd</sup> step
- So (also in 2<sup>nd</sup> step) PC + 8 is produced by ALU and routed to R15 input of RF







## Multicycle Datapath: Access to PC

#### PC can be read/written by instruction

- Read: R15 (PC+8) available in Register File
- Write: Be able to write result of instruction to PC







#### Example: SUB R15, R8, R3



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <95>



#### Example: SUB R15, R8, R3

- Result of instruction needs to be written to the PC register
- ALUResult already routed to the PC register, just assert PCWrite





#### Example: SUB R15, R8, R3

- Result of instruction needs to be written to the PC register
- ALUResult already routed to the PC register, just assert PCWrite







# Multicycle Datapath: STR

#### Write data in Rn to memory





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <98>



## Multicycle Datapath: Data-processing

# With immediate addressing (i.e., an immediate *Src2*), no additional changes needed for datapath







## Multicycle Datapath: Data-processing

## With register addressing (register *Src2*): Read from Rn and Rm





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <100>



# Multicycle Datapath: B

#### Calculate branch target address: BTA = (*ExtImm*) + (PC+8) *ExtImm = Imm24 << 2* and sign-extended





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <101>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <102>



# Multicycle Control





Then, Conditional Logic





Chapter 7 <103>

# Multicycle Control: Decoder







# Multicycle Control: Decoder







# Multicycle Control: Decoder



#### **ALU Decoder and PC Logic same as single-cycle**



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Cha

Chapter 7 <106>



# Multicycle Control: Instr Decoder

$$Op_{1:0} \qquad Instr \qquad ImmSrc_{1:0} 
RegSrc_0 = (Op == 10_2)$$

 $RegSrc_{1} = (Op == 01_{2})$  $ImmSrc_{1:0} = Op$ 

| Instruction  | Ор | Funct <sub>5</sub> | Funct <sub>0</sub> | RegSrc <sub>0</sub> | RegSrc <sub>1</sub> | ImmSrc <sub>1:0</sub> |
|--------------|----|--------------------|--------------------|---------------------|---------------------|-----------------------|
| LDR          | 01 | X                  | 1                  | 0                   | X                   | 01                    |
| STR          | 01 | x                  | 0                  | 0                   | 1                   | 01                    |
| DP immediate | 00 | 1                  | x                  | 0                   | X                   | 00                    |
| DP register  | 00 | 0                  | x                  | 0                   | 0                   | 00                    |
| В            | 10 | Х                  | X                  | 1                   | Х                   | 10                    |









Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <108>



# Multicycle Control: Main FSM







# Main Controller FSM: Fetch





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <110>



# Main Controller FSM: Decode





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <111>



# Main Controller FSM: Address





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <112>



# Main Controller FSM: Read Memory







# Multicycle ARM Processor







### Main Controller FSM: LDR







### Main Controller FSM: STR







### Main Controller FSM: Data-processing







### Main Controller FSM: Data-processing







# Multicycle Controller FSM

State Fetch Decode MemAdr MemRead MemWB MemWrite ExecuteR Executel ALUWB Branch

 $\begin{array}{l} \textbf{Datapath}\,\mu\textbf{Op}\\ Instr \leftarrow Mem[PC];\,PC \leftarrow PC+4\\ ALUOut \leftarrow PC+4\\ ALUOut \leftarrow Rn + Imm\\ Data \leftarrow Mem[ALUOut]\\ Rd \leftarrow Data\\ Mem[ALUOut] \leftarrow Rd\\ ALUOut \leftarrow Rn \ op \ Rm\\ ALUOut \leftarrow Rn \ op \ Imm\\ Rd \leftarrow ALUOut\\ PC \leftarrow R15 + offset\\ \end{array}$ 





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <119>



# Multicycle Control







# Multicycle Control: Cond. Logic





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <121>



# Single-Cycle Conditional Logic





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <122>



## Multicycle Conditional Logic







• Instructions take different number of cycles.





# Multicycle Controller FSM

State Fetch Decode MemAdr MemRead MemWB MemWrite ExecuteR Executel ALUWB Branch

 $\begin{array}{l} \textbf{Datapath}\,\mu\textbf{Op}\\ Instr \leftarrow Mem[PC];\,PC \leftarrow PC+4\\ ALUOut \leftarrow PC+4\\ ALUOut \leftarrow Rn + Imm\\ Data \leftarrow Mem[ALUOut]\\ Rd \leftarrow Data\\ Mem[ALUOut] \leftarrow Rd\\ ALUOut \leftarrow Rn \ op \ Rm\\ ALUOut \leftarrow Rn \ op \ Imm\\ Rd \leftarrow ALUOut\\ PC \leftarrow R15 + offset\\ \end{array}$ 





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <125>



- Instructions take different number of cycles:
  - 3 cycles:
  - 4 cycles:
  - 5 cycles:





- Instructions take different number of cycles:
  - 3 cycles: B
  - 4 cycles: DP, STR
  - 5 cycles: LDR





- Instructions take different number of cycles:
  - 3 cycles: B
  - 4 cycles: DP, STR
  - 5 cycles: LDR
- CPI is weighted average
- SPECINT2000 benchmark:
  - 25% loads
  - 10% stores
  - 13% branches
  - 52% R-type





- Instructions take different number of cycles:
  - 3 cycles: B
  - 4 cycles: DP, STR
  - 5 cycles: LDR
- CPI is weighted average
- SPECINT2000 benchmark:
  - 25% loads
  - 10% stores
  - 13% branches
  - **52%** R-type

Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12







Multicycle critical path:

- Assumptions:
  - RF is faster than memory
  - writing memory is faster than reading memory

$$T_{c2} = t_{pcq} + 2t_{mux} + \max(t_{ALU} + t_{mux}, t_{mem}) + t_{setup}$$





| Parameter                  | Delay (ps)                                                                     |
|----------------------------|--------------------------------------------------------------------------------|
| $t_{pcq\_PC}$              | 40                                                                             |
| t <sub>setup</sub>         | 50                                                                             |
| t <sub>mux</sub>           | 25                                                                             |
| t <sub>ALU</sub>           | 120                                                                            |
| t <sub>dec</sub>           | 70                                                                             |
| t <sub>mem</sub>           | 200                                                                            |
| <i>t<sub>RFread</sub></i>  | 100                                                                            |
| <i>t<sub>RFsetup</sub></i> | 60                                                                             |
|                            | $t_{pcq\_PC}$ $t_{setup}$ $t_{mux}$ $t_{ALU}$ $t_{dec}$ $t_{mem}$ $t_{RFread}$ |





| Element                                                                      | Parameter                  | Delay (ps) |  |  |  |  |
|------------------------------------------------------------------------------|----------------------------|------------|--|--|--|--|
| Register clock-to-Q                                                          | $t_{pcq\_PC}$              | 40         |  |  |  |  |
| Register setup                                                               | <i>t</i> <sub>setup</sub>  | 50         |  |  |  |  |
| Multiplexer                                                                  | t <sub>mux</sub>           | 25         |  |  |  |  |
| ALU                                                                          | t <sub>ALU</sub>           | 120        |  |  |  |  |
| Decoder                                                                      | t <sub>dec</sub>           | 70         |  |  |  |  |
| Memory read                                                                  | t <sub>mem</sub>           | 200        |  |  |  |  |
| Register file read                                                           | <i>t<sub>RFread</sub></i>  | 100        |  |  |  |  |
| Register file setup                                                          | <i>t<sub>RFsetup</sub></i> | 60         |  |  |  |  |
| $T_{c2} = t_{pcq} + 2t_{mux} + \max[t_{ALU} + t_{mux}, t_{mem}] + t_{setup}$ |                            |            |  |  |  |  |
| = [40 + 2(25) + 200 + 50] ps = <b>340</b> ps                                 |                            |            |  |  |  |  |





For a program with **100 billion** instructions executing on a **multicycle** ARM processor

- **CPI** = 4.12 cycles/instruction
- Clock cycle time:  $T_{c2}$  = 340 ps

#### Execution Time = ?





For a program with **100 billion** instructions executing on a **multicycle** ARM processor

- **CPI** = 4.12 cycles/instruction
- Clock cycle time:  $T_{c2}$  = 340 ps

#### Execution Time = (# instructions) × CPI × $T_c$ = (100 × 10<sup>9</sup>)(4.12)(340 × 10<sup>-12</sup>) = 140 seconds





For a program with **100 billion** instructions executing on a **multicycle** ARM processor

- **CPI** = 4.12 cycles/instruction
- Clock cycle time:  $T_{c2}$  = 340 ps

#### Execution Time = (# instructions) × CPI × $T_c$ = (100 × 10<sup>9</sup>)(4.12)(340 × 10<sup>-12</sup>) = 140 seconds

This is **slower** than the single-cycle processor (84 sec.)





# **Review: Single-Cycle ARM Processor**





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <136>



# Review: Multicycle ARM Processor





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <137>



# **Pipelined ARM Processor**

- Temporal parallelism
- Divide single-cycle processor into 5 stages:
  - Fetch
  - Decode
  - Execute
  - Memory
  - Writeback
- Add pipeline registers between stages





# Single-Cycle vs. Pipelined

#### Single-Cycle



#### **Pipelined**

| 1113 |                      |                    |                |                    |                |                    |                |           |                    |           |
|------|----------------------|--------------------|----------------|--------------------|----------------|--------------------|----------------|-----------|--------------------|-----------|
| 1    | Fetch<br>Instruction | Dec<br>Read<br>Reg | Execute<br>ALU | 9                  | Men<br>Read /  |                    | Wr<br>Reg      |           |                    |           |
| 2    |                      | tch<br>uction      |                | Dec<br>Read<br>Reg | Execute<br>ALU |                    | Men<br>Read /  | Wr<br>Reg |                    |           |
| 3    |                      |                    | Fet<br>Instru  | ch                 |                | Dec<br>Read<br>Reg | Execute<br>ALU |           | emory<br>d / Write | Wr<br>Reg |



Instr

Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <139>



## **Pipelined Processor Abstraction**





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <140>



# Single-Cycle & Pipelined Datapath

Single-Cycle





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <141>



# **Corrected Pipelined Datapath**



- WA3 must arrive at same time as Result
- Register file written on falling edge of CLK



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <142>



# **Optimized Pipelined Datapath**





Remove adder by using *PCPlus4F* after *PC* has been updated to *PC*+4



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 C

Chapter 7 <143>



# **Pipelined Processor Control**



- Same control unit as single-cycle processor
- Control delayed to proper pipeline stage



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <144>



## Pipeline Hazards

- When an instruction depends on result from instruction that hasn't completed
- Types:
  - Data hazard: register value not yet written back to register file
  - Control hazard: next instruction not decided yet (caused by branch)





#### Data Hazard





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <146>



### Handling Data Hazards

- Insert NOPs in code at compile time
- Rearrange code at compile time
- Forward data at run time
- Stall the processor at run time





# **Compile-Time Hazard Elimination**

- Insert enough NOPs for result to be ready
- Or move independent useful instructions forward





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <148>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <149>





- Check if register read in Execute stage matches register written in Memory or Writeback stage
- If so, forward result











- Execute stage register matches Memory stage register? Match\_1E\_M = (RA1E == WA3M) Match\_2E\_M = (RA2E == WA3M)
- Execute stage register matches Writeback stage register? Match\_1E\_W = (RA1E == WA3W) Match\_2E\_W = (RA2E == WA3W)
- If it matches, forward result:

if (Match\_1E\_M • RegWriteM) ForwardAE = 10; else if (Match\_1E\_W • RegWriteW) ForwardAE = 01; else ForwardAE = 00;



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <152>



- Execute stage register matches Memory stage register? Match\_1E\_M = (RA1E == WA3M) Match\_2E\_M = (RA2E == WA3M)
- Execute stage register matches Writeback stage register? Match\_1E\_W = (RA1E == WA3W) Match\_2E\_W = (RA2E == WA3W)
- If it matches, forward result:

if (Match\_1E\_M • RegWriteM) ForwardAE = 10; else if (Match\_1E\_W • RegWriteW) ForwardAE = 01; else ForwardAE = 00;

#### ForwardBE same but with Match2E



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <153>



## Stalling





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <154>



## Stalling





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <155>



## Stalling Hardware





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <156>



# Stalling Logic

 Is either source register in the Decode stage the same as the one being written in the Execute stage?

 $Match_{12D}E = (RA1D == WA3E) + (RA2D == WA3E)$ 

Is a LDR in the Execute stage AND Match\_12D\_E?
 Idrstall = Match\_12D\_E • MemtoRegE
 StallF = StallD = FlushE = Idrstall







### **Control Hazards**

• B:

- branch not determined until the Writeback stage of pipeline
- Instructions after branch fetched before branch occurs
- These 4 instructions must be flushed if branch happens
- Writes to PC (R15) similar





### **Control Hazards**



#### **Branch misprediction penalty**

- number of instruction flushed when branch is taken (4)
- May be reduced by determining BTA earlier



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <159>



## Early Branch Resolution

- Determine BTA in Execute stage
  - Branch misprediction penalty = 2 cycles
- Hardware changes
  - Add a branch multiplexer before *PC* register to select BTA from *ALUResultE*
  - Add BranchTakenE select signal for this multiplexer (only asserted if branch condition satisfied)
  - *PCSrcW* now only asserted for writes to *PC*





#### Pipelined processor with Early BTA





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <161>



#### Control Hazards with Early BTA





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <162>



## **Control Stalling Logic**

- PCWrPendingF = 1 if write to PC in Decode, Execute or Memory
   PCWrPendingF = PCSrcD + PCSrcE + PCSrcM
- **Stall Fetch** if *PCWrPendingF* StallF = IdrStallD + PCWrPendingF
- Flush Decode if PCWrPendingF OR PC is written in Writeback OR branch is taken
   FlushD = PCWrPendingF + PCSrcW + BranchTakenE
- Flush Execute if branch is taken FlushE = IdrStallD + BranchTakenE
- **Stall Decode** if *ldrStallD* (as before) *StallD* = *ldrStallD*





#### **ARM Pipelined Processor with Hazard Unit**





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <164>



#### • SPECINT2000 benchmark:

- 25% loads
- 10% stores
- 13% branches
- 52% R-type

#### • Suppose:

- 40% of loads used by next instruction
- 50% of branches mispredicted

#### • What is the average CPI?





#### • SPECINT2000 benchmark:

- 25% loads
- 10% stores
- 13% branches
- 52% R-type

#### • Suppose:

- 40% of loads used by next instruction
- 50% of branches mispredicted

#### • What is the average CPI?

Load CPI = 1 when not stalling, 2 when stalling

So,  $CPI_{Iw} = 1(0.6) + 2(0.4) = 1.4$ 

Branch CPI = 1 when not stalling, 3 when stalling

So,  $CPI_{beq} = 1(0.5) + 3(0.5) = 2$ 

#### Average CPI = (0.25)(1.4) + (0.1)(1) + (0.13)(2) + (0.52)(1) = 1.23



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Chapter 7 <166>



## **Pipelined Performance**

• Pipelined processor critical path:

 $T_{c3} = \max[$ 

 $t_{pcq} + t_{mem} + t_{setup}$   $2(t_{RFread} + t_{setup})$   $t_{pcq} + 2t_{mux} + t_{ALU} + t_{setup}$   $t_{pcq} + t_{mem} + t_{setup}$   $2(t_{pcq} + t_{mux} + t_{RFwrite})$ 

Fetch Decode Execute Memory Writeback





| Element             | Parameter                  | Delay (ps) |
|---------------------|----------------------------|------------|
| Register clock-to-Q | $t_{pcq\_PC}$              | 40         |
| Register setup      | t <sub>setup</sub>         | 50         |
| Multiplexer         | t <sub>mux</sub>           | 25         |
| ALU                 | t <sub>ALU</sub>           | 120        |
| Memory read         | t <sub>mem</sub>           | 200        |
| Register file read  | <i>t<sub>RFread</sub></i>  | 100        |
| Register file setup | <i>t<sub>RFsetup</sub></i> | 60         |
| Register file write | <i>t<sub>RFwrite</sub></i> | 70         |

#### Cycle time: $T_{c\beta} = ?$





| Element             | Parameter                  | Delay (ps) |
|---------------------|----------------------------|------------|
| Register clock-to-Q | $t_{pcq\_PC}$              | 40         |
| Register setup      | <i>t</i> <sub>setup</sub>  | 50         |
| Multiplexer         | t <sub>mux</sub>           | 25         |
| ALU                 | t <sub>ALU</sub>           | 120        |
| Memory read         | t <sub>mem</sub>           | 200        |
| Register file read  | <i>t<sub>RFread</sub></i>  | 100        |
| Register file setup | $t_{RFsetup}$              | 60         |
| Register file write | <i>t<sub>RFwrite</sub></i> | 70         |

**Cycle time:** 
$$T_{c3} = 2(t_{RFread} + t_{setup})$$
  
= 2[100 + 50] ps = **300 ps**





Program with 100 billion instructions

**Execution Time** = (# instructions) × CPI ×  $T_c$ 

- $= (100 \times 10^{9})(1.23)(300 \times 10^{-12})$
- = 36.9 seconds





#### **Processor Performance Comparison**

| Processor    | Execution<br>Time<br>(seconds) | Speedup<br>(single-cycle as baseline) |
|--------------|--------------------------------|---------------------------------------|
| Single-cycle | 84                             | 1                                     |
| Multicycle   | 140                            | 0.6                                   |
| Pipelined    | 36.9                           | 2.28                                  |



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <171>



## Advanced Microarchitecture

- Deep Pipelining
- Micro-operations
- Branch Prediction
- Superscalar Processors
- Out of Order Processors
- Register Renaming
- SIMD
- Multithreading
- Multiprocessors





# **Deep Pipelining**

- 10-20 stages typical
- Number of stages limited by:
  - Pipeline hazards
  - Sequencing overhead
  - Power
  - Cost





#### Micro-operations

- Decompose more complex instructions into a series of simple instructions called *micro-operations* (*micro-ops* or μ-ops)
- At run-time, complex instructions are decoded into one or more micro-ops
- Used heavily in CISC (complex instruction set computer) architectures (e.g., x86)
- Used for some ARM instructions, for example:

 Complex Op
 Micro-op Sequence

 LDR R1, [R2], #4
 LDR R1, [R2]

 ADD R2, R2, #4

#### Without u-ops, would need 2nd write port on the register file







#### Micro-operations

- Allow for dense code (fewer memory accesses)
- Yet preserve simplicity of RISC hardware
- ARM strikes balance by choosing instructions that:
  - Give better code density than pure RISC instruction sets (such as MIPS)
  - Enable more efficient decoding than CISC instruction sets (such as x86)





## **Branch Prediction**

- Guess whether branch will be taken
  - Backward branches are usually taken (loops)
  - Consider history to improve guess
- Good prediction reduces fraction of branches requiring a flush





# **Branch Prediction**

- Ideal pipelined processor: CPI = 1
- Branch misprediction increases CPI
- Static branch prediction:
  - Check direction of branch (forward or backward)
  - If backward, predict taken
  - Else, predict not taken
- Dynamic branch prediction:
  - Keep history of last several hundred (or thousand) branches in *branch target buffer*, record:
    - Branch destination
    - Whether branch was taken





### **Branch Prediction Example**

- MOV R1, #0 ; R1 = sum MOV R0, #0 ; R0 = i
- FOR ; for (i=0; i<10; i=i+1)
  CMP R0, #10
  BGE DONE
  ADD R1, R1, R0
  ADD R0, R0, #1
  B FOR</pre>

DONE





#### **1-Bit Branch Predictor**

- Remembers whether branch was taken the last time and does the same thing
- Mispredicts first and last branch of loop





#### **2-Bit Branch Predictor**



#### **Only mispredicts last branch of loop**



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <180>



### Superscalar

- Multiple copies of datapath execute multiple instructions at once
- Dependencies make it tricky to issue multiple instructions at once





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <181>



#### Superscalar Example







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <182>



### Superscalar with Dependencies





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <183>



## Out of Order Processor

- Looks ahead across multiple instructions
- Issues as many instructions as possible at once
- Issues instructions out of order (as long as no dependencies)
- Dependencies:
  - RAW (read after write): one instruction writes, later instruction reads a register
  - WAR (write after read): one instruction reads, later instruction writes a register
  - WAW (write after write): one instruction writes, later instruction writes a register





# Out of Order Processor

- Instruction level parallelism (ILP): number of instruction that can be issued simultaneously (average < 3)</li>
- Scoreboard: table that keeps track of:
  - -Instructions waiting to issue
  - -Available functional units
  - Dependencies







### Out of Order Processor Example





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <186>



### **Register Renaming**





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Chapter 7 <187>



## SIMD

- Single Instruction Multiple Data (SIMD)
  - Single instruction acts on multiple pieces of data at once
  - Common application: graphics
  - Perform short arithmetic operations (also called *packed arithmetic*)
- For example, add eight 8-bit elements







### Advanced Architecture Techniques

#### Multithreading

- Wordprocessor: thread for typing, spell checking, printing
- Multiprocessors
  - Multiple processors (cores) on a single chip





# **Threading: Definitions**

- Process: program running on a computer
  - Multiple processes can run at once: e.g., surfing
     Web, playing music, writing a paper
- Thread: part of a program
  - Each process has multiple threads: e.g., a word processor may have threads for typing, spell checking, printing





## Threads in Conventional Processor

- One thread runs at once
- When one thread stalls (for example, waiting for memory):
  - Architectural state of that thread stored
  - Architectural state of waiting thread loaded into processor and it runs
  - Called context switching
- Appears to user like all threads running simultaneously





# Multithreading

- Multiple copies of architectural state
- Multiple threads active at once:
  - When one thread stalls, another runs immediately
  - If one thread can't keep all execution units busy, another thread can use them
- Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput

#### Intel calls this "hyperthreading"







# Multiprocessors

- Multiple processors (cores) with a method of communication between them
- Types:
  - Homogeneous: multiple cores with shared main memory
  - Heterogeneous: separate cores for different tasks (for example, DSP and CPU in cell phone)
  - Clusters: each core has own memory system





## **Other Resources**

- Patterson & Hennessy's: Computer Architecture: A Quantitative Approach
- Conferences:
  - www.cs.wisc.edu/~arch/www/
  - ISCA (International Symposium on Computer Architecture)
  - HPCA (International Symposium on High Performance Computer Architecture)



