Chapter 4: Q12P (page 363)

This exercise is intended to help you understand the cost/complexity/performance trade-offs of forwarding in a pipelined processor. Problems in this exercise refer to pipelined data paths from Figure 4.45. These problems assume that, of all the instructions executed in a processor, the following fraction of these instructions have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both). We assume that the register write is done in the first half of the clock cycle and that register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards. Also, assume that the CPI of the processor is 1 if there are no data hazards.
Ex to 1^st only
MEM to 1^st only
EX to 2^nd only
MEM to 2^nd only
EX to 1^st and MEM to 2^nd
Other RAW Dependences
5%
20%
5%
10%
10%
10%
Assume the following latencies for individual pipeline stages. For the EX stage, latencies are given separately for a processor without forwarding and for a processor with different kinds of forwarding.
IF
ID
EX(no FW)
EX (full FW)
EX(FW from EX/MEM only)
Ex(FW from MEM/WB only)
MEM
WB
150ps
100ps
120ps
150ps
140ps
130ps
120ps
100ps
4.12.1 If we use no forwarding, what fraction of cycles are we stalling due to data hazards?
4.12.2 If we use full forwarding (forward all results that can be forwarded), what fraction of cycles are we staling due to data hazards?
4.12.3 Let us assume that we cannot afford to have three-input Muxes that are needed for full forwarding. We have to decide if it is better to forward only from the EX/MEM pipeline register (next-cycle forwarding) or only from the MEM/WB pipeline register (two-cycle forwarding). Which of the two options results in fewer data stall cycles?
4.12.4 For the given hazard probabilities and pipeline stage latencies, what is the speedup achieved by adding full forwarding to a pipeline that had no forwarding?
4.12.5 What would be the additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards? Assume that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.
4.12.6 Repeat 4.12.3 but this time determine which of the two options results in a shorter time per instruction.

Short Answer

Expert verified

4.12.1. If no forwarding is used, the stall cycles are 46%.

4.12.2. If full forwarding is used, the stall cycles are 17%.

4.12.3.MEM/WB has fewer stall cycles compared to EX/MEM.

4.12.4. The speedup achieved by adding full forwarding to a pipeline that had no forwarding is 1.54.

4.12.5.The additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards is 0.72

4.10.6 MEM/WB results in a shorter time per instruction with 202.5 ps.

Step by step solution

Determine forwarding and Stall cycles

In a sequence of instructions, when the instructions depend on other instructions data for execution, then there occurs a data hazard.

Pipeline registers will hold the ALU result, to prevent data hazards, the values of the registers can be forwarded to the subsequent instruction. This is called forwarding.

The data hazards will be eliminated by forwarding. The destination registers of the previous instructions will be compared with the source registers of the current instruction by the forwarding unit to detect data hazards.

The results of the previous instruction will be fetched before they are written back to the register file and forwarded to the next instruction.

A delay in execution of the instruction to resolve a hazard is known as a stall.

Determine stall cycles due to data hazards.

4.12.1 Given the reference of Figure 4.45 in the text, refer to the book.

From the figure reference, the following sequence of instructions is used in the problem.

lw $10, 20($1)

sub $11, $2, $3

add $12, $3, $4

lw $13, 24($1)

add $14, $15, $6

For the above sequence of instructions, the, CPI for each instruction is as follows:

Ex to 1^st only	MEM to 1^st only	EX to 2^nd only	MEM to 2^nd only	EX to 1^st and MEM to 2^nd	Other RAW Dependences
5%	20%	5%	10%	10%	10%

The given assumptions in the question:

The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both).
The register write is done in the first half of the clock cycle and the register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards.
The CPI of the processor is 1 if there are no data hazards.

From the 5 stage pipeline, the current stages of the instructions.

lw $10, 20($1)- Write back

sub $11, $2, $3-Memory

add $12, $3, $4-Execution

lw $13, 24($1)-Instruction Decode

add $14, $15, $6-Instrcution Fetch

From the given assumptions and the figure, the dependence on the 1^stnext instruction results in 2 stall cycles. For the stall, it takes 2 cycles if the dependences are to both 1^st and 2^nd next instruction. Dependences to only the 2^nd next instruction result in 1 stall.

So, the CPI and Stall Cycles are calculated as follows:

For instruction, that has no data hazard CPI is 1, then 2 stalls for EX to 1^st, MEM to 1^st andEX to 1^st and MEM to 2^nd. Finally 1 stall for EX to 2^nd only and MEM to 2^nd only.

CPI:

$\begin{array}{rcl} 1 + (10 p e r c e n t + 20 p e r c e n t + 5 p e r c e n t) \times 2 + (5 p e r c e n t + 10 p e r c e n t) \times 1 & = & 1 + (35 p e r c e n t \times 2) (15 p e r c e n t) \times 1 \\ = & 1 + (0.35 \times 2) + (0.15 \times 1) \\ = & 1 + 0.7 + 0.15 \\ = & 1.85 \end{array}$

Stall cycles:

$\begin{array}{rcl} \frac{0.85}{1.85} & = & 0.459 \\ 0.459 \times 100 & = & 45.9 \\ = & 46 P e r c e n t \end{array}$

Determine stall cycles due to data hazards with full forwarding

4.12.2 Refer the mentioned figure 4.45 for the instruction and the CPI is as follows:

CPI for each instruction is as follows:

Ex to 1^st only	MEM to 1^st only	EX to 2^nd only	MEM to 2^nd only	EX to 1^st and MEM to 2^nd	Other RAW Dependences
5%	20%	5%	10%	10%	10%

Considering the full forwarding, the MEM stage of one instruction to the 1^st next instruction causes RAW dependences. These will cause only one stall cycle.

CPI:

$\begin{array}{rcl} 1 + (20 p e r c e n t) & = & 1 + 0.20 \\ = & 1.21 \end{array}$

Stall Cycles:

$\begin{array}{rcl} \frac{0.20}{1.20} & = & 1.6666667 \\ 1.6666667 \times 100 & = & 16.7 \\ = & 17 p e r c e n t \end{array}$

Determine the option that results in fewer stall cycles.

4.12.3 Assuming that the three-input Muxes are not affordable, they are needed for full forwarding.

From the given in the question:

To forward only from the EX/MEM pipeline register (next-cycle forwarding), EX to 1^st dependences can be done without stalls. But other dependences incur one cycle stall.

Stall cycles of EX/MEM:

$0.2 + 0.05 + 0.1 + 0.1 = 0.45$

Comparing both the options, MEM/WB is the better one with 0.45 stall cycles.

To forward only from the MEM/WB pipeline register (two-cycle forwarding), EX to 2^nd has no stalls. Because, to forward the next instruction it has to wait for the instruction to complete the MEM stage.

Stall cycles of MEM/WB:

$0.05 + 0.2 + 0.1 = 0.35$

Comparing both the options, MEM/WB is the better one with 0.35 stall cycles.

Determine the speedup achieved by adding full forwarding to a pipeline that had no forwarding.

4.12.4 The CPI without forwarding and full forwarding is 1.85 and 1.20 (From 4.12.1 and 4.12.2).

IF	ID	EX(no FW)	EX (full FW)	EX(FW from EX/MEM only)	Ex(FW from MEM/WB only)	MEM	WB
150ps	100ps	120ps	150ps	140ps	130ps	120ps	100ps

From the above table, the latencies for individual pipelining can be considered.

Clock cycle time without forwarding:

$1.85 \times 150 ps = 277.5 ps$

Clock cycle time with forwarding:

$1.20 \times 150 ps = 180 ps$

Speed up:

$\frac{277.5}{180} = 1.54$

Determine the additional speedup achieved by adding full forwarding to a pipeline that had no forwarding

4.12.5Assuming that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.

The additional speedup (relative to a processor with forwarding) if time-travel forwarding is added that eliminates all data hazards is calculated as follows:

With full forwarding: (CPI from 4.12.4)

$1.20 \times 150 ps = 180 ps$

Time-travel forwarding:

$1 \times 250 ps = 250 ps$

Speedup:

$\frac{180}{250} = 0.72$

Determine which of the two options from 4.12.3 has shorter time per instruction.

4.12.6

CPI of EX/MEM=1.45

CPI of MEM/WB=1.35

Clock cycle of EX/MEM:

$1.45 \times 150 ps = 217.5$

Clock cycle of MEM/WB:

$1.35 \times 150 ps = 202.5 ps$

Comparing the clock cycles, MEM/WB has shorter time per instruction.

Unlock Step-by-Step Solutions & Ace Your Exams!

Full Textbook Solutions
Get detailed explanations and key concepts
Unlimited Al creation
Al flashcards, explanations, exams and more...
Ads-free access
To over 500 millions flashcards
Money-back guarantee
We refund you if you fail your exam.

Start your free trial

Over 30 million students worldwide already upgrade their learning with Vaia!

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Short Answer

Step by step solution

Determine forwarding and Stall cycles

Determine stall cycles due to data hazards.

Determine stall cycles due to data hazards with full forwarding

Determine the option that results in fewer stall cycles.

Determine the speedup achieved by adding full forwarding to a pipeline that had no forwarding.

Determine the additional speedup achieved by adding full forwarding to a pipeline that had no forwarding

Determine which of the two options from 4.12.3 has shorter time per instruction.

One App. One Place for Learning.

Most popular questions from this chapter

Recommended explanations on Computer Science Textbooks

Theory of Computation

Data Structures

Algorithms in Computer Science

Computer Programming

Functional Programming

Computer Systems

Study anywhere. Anytime. Across all devices.

Company

Product

Help

IF	ID	EX	MEM	WB
250ps	350ps	150ps	300ps	200ps

alu	beq	lw	sw
45%	20%	20%	15%