Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

This exercise is intended to help you understand the cost/complexity/performance trade-offs of forwarding in a pipelined processor. Problems in this exercise refer to pipelined data paths from Figure 4.45. These problems assume that, of all the instructions executed in a processor, the following fraction of these instructions have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both). We assume that the register write is done in the first half of the clock cycle and that register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards. Also, assume that the CPI of the processor is 1 if there are no data hazards.

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

Assume the following latencies for individual pipeline stages. For the EX stage, latencies are given separately for a processor without forwarding and for a processor with different kinds of forwarding.

IF

ID

EX(no FW)

EX (full FW)

EX(FW from EX/MEM only)

Ex(FW from MEM/WB only)

MEM

WB

150ps

100ps

120ps

150ps

140ps

130ps

120ps

100ps

4.12.1 If we use no forwarding, what fraction of cycles are we stalling due to data hazards?

4.12.2 If we use full forwarding (forward all results that can be forwarded), what fraction of cycles are we staling due to data hazards?

4.12.3 Let us assume that we cannot afford to have three-input Muxes that are needed for full forwarding. We have to decide if it is better to forward only from the EX/MEM pipeline register (next-cycle forwarding) or only from the MEM/WB pipeline register (two-cycle forwarding). Which of the two options results in fewer data stall cycles?

4.12.4 For the given hazard probabilities and pipeline stage latencies, what is the speedup achieved by adding full forwarding to a pipeline that had no forwarding?

4.12.5 What would be the additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards? Assume that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.

4.12.6 Repeat 4.12.3 but this time determine which of the two options results in a shorter time per instruction.

Short Answer

Expert verified

4.12.1. If no forwarding is used, the stall cycles are 46%.

4.12.2. If full forwarding is used, the stall cycles are 17%.

4.12.3.MEM/WB has fewer stall cycles compared to EX/MEM.

4.12.4. The speedup achieved by adding full forwarding to a pipeline that had no forwarding is 1.54.

4.12.5.The additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards is 0.72

4.10.6 MEM/WB results in a shorter time per instruction with 202.5 ps.

Step by step solution

01

Determine forwarding and Stall cycles

In a sequence of instructions, when the instructions depend on other instructions data for execution, then there occurs a data hazard.

Pipeline registers will hold the ALU result, to prevent data hazards, the values of the registers can be forwarded to the subsequent instruction. This is called forwarding.

The data hazards will be eliminated by forwarding. The destination registers of the previous instructions will be compared with the source registers of the current instruction by the forwarding unit to detect data hazards.

The results of the previous instruction will be fetched before they are written back to the register file and forwarded to the next instruction.

A delay in execution of the instruction to resolve a hazard is known as a stall.

02

Determine stall cycles due to data hazards.

4.12.1 Given the reference of Figure 4.45 in the text, refer to the book.

From the figure reference, the following sequence of instructions is used in the problem.

lw $10, 20($1)

sub $11, $2, $3

add $12, $3, $4

lw $13, 24($1)

add $14, $15, $6

For the above sequence of instructions, the, CPI for each instruction is as follows:

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

The given assumptions in the question:

  • The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both).
  • The register write is done in the first half of the clock cycle and the register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards.
  • The CPI of the processor is 1 if there are no data hazards.

From the 5 stage pipeline, the current stages of the instructions.

lw $10, 20($1)- Write back

sub $11, $2, $3-Memory

add $12, $3, $4-Execution

lw $13, 24($1)-Instruction Decode

add $14, $15, $6-Instrcution Fetch

From the given assumptions and the figure, the dependence on the 1stnext instruction results in 2 stall cycles. For the stall, it takes 2 cycles if the dependences are to both 1st and 2nd next instruction. Dependences to only the 2nd next instruction result in 1 stall.

So, the CPI and Stall Cycles are calculated as follows:

For instruction, that has no data hazard CPI is 1, then 2 stalls for EX to 1st, MEM to 1st andEX to 1st and MEM to 2nd. Finally 1 stall for EX to 2nd only and MEM to 2nd only.

CPI:

1+10percent+20percent+5percent×2+5percent+10percent×1=1+35percent×215percent×1=1+0.35×2+0.15×1=1+0.7+0.15=1.85

Stall cycles:

0.851.85=0.4590.459×100=45.9=46Percent

03

Determine stall cycles due to data hazards with full forwarding

4.12.2 Refer the mentioned figure 4.45 for the instruction and the CPI is as follows:

CPI for each instruction is as follows:

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

Considering the full forwarding, the MEM stage of one instruction to the 1st next instruction causes RAW dependences. These will cause only one stall cycle.

CPI:

1+20percent=1+0.20=1.21

Stall Cycles:

0.201.20=1.66666671.6666667×100=16.7=17percent

04

Determine the option that results in fewer stall cycles.

4.12.3 Assuming that the three-input Muxes are not affordable, they are needed for full forwarding.

From the given in the question:

To forward only from the EX/MEM pipeline register (next-cycle forwarding), EX to 1st dependences can be done without stalls. But other dependences incur one cycle stall.

Stall cycles of EX/MEM:

0.2+0.05+0.1+0.1=0.45

Comparing both the options, MEM/WB is the better one with 0.45 stall cycles.

To forward only from the MEM/WB pipeline register (two-cycle forwarding), EX to 2nd has no stalls. Because, to forward the next instruction it has to wait for the instruction to complete the MEM stage.

Stall cycles of MEM/WB:

0.05+0.2+0.1=0.35

Comparing both the options, MEM/WB is the better one with 0.35 stall cycles.

05

Determine the speedup achieved by adding full forwarding to a pipeline that had no forwarding.

4.12.4 The CPI without forwarding and full forwarding is 1.85 and 1.20 (From 4.12.1 and 4.12.2).

IF

ID

EX(no FW)

EX (full FW)

EX(FW from EX/MEM only)

Ex(FW from MEM/WB only)

MEM

WB

150ps

100ps

120ps

150ps

140ps

130ps

120ps

100ps

From the above table, the latencies for individual pipelining can be considered.

Clock cycle time without forwarding:

1.85×150ps=277.5ps

Clock cycle time with forwarding:

1.20×150ps=180ps

Speed up:

277.5180=1.54

06

Determine the additional speedup achieved by adding full forwarding to a pipeline that had no forwarding 

4.12.5Assuming that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.

The additional speedup (relative to a processor with forwarding) if time-travel forwarding is added that eliminates all data hazards is calculated as follows:

With full forwarding: (CPI from 4.12.4)

1.20×150ps=180ps

Time-travel forwarding:

1×250ps=250ps

Speedup:

180250=0.72

07

Determine which of the two options from 4.12.3 has shorter time per instruction.

4.12.6

CPI of EX/MEM=1.45

CPI of MEM/WB=1.35

Clock cycle of EX/MEM:

1.45×150ps=217.5

Clock cycle of MEM/WB:

1.35×150ps=202.5ps

Comparing the clock cycles, MEM/WB has shorter time per instruction.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

In this exercise, we examine how pipelining affects the clock cycle time of the processor. Problems in this exercise assume that individual stages of the datapath have the following latencies:

IF

ID

EX

MEM

WB

250ps

350ps

150ps

300ps

200ps

Also, assume that instructions executed by the processor are broken down as follows:

alu

beq

lw

sw

45%

20%

20%

15%

4.8.1 [5] What is the clock cycle time in a pipelined and non-pipelined processor?

4.8.2 [10] What is the total latency of an LW instruction in a pipelined and non-pipelined processor?

4.8.3 [10] If we can split one stage of the pipelined datapath into two new stages, each with half the latency of the original stage, which stage would you split and what is the new clock cycle time of the processor? 4.8.4 [10] Assuming there are no stalls or hazards, what is the utilization of the data memory?

4.8.5 [10] Assuming there are no stalls or hazards, what is the utilization of the write-register port of the “Registers” unit? 4.8.6 [30] Instead of a single-cycle organization, we can use a multi-cycle organization where each instruction takes multiple cycles but one instruction finishes before another is fetched. In this organization, an instruction only goes through stages it actually needs (e.g., ST only takes 4 cycles because it does not need the WB stage). Compare clock cycle times and execution times with singlecycle, multi-cycle, and pipelined organization.

When silicon chips are fabricated, defects in materials (e.g., silicon) and manufacturing errors can result in defective circuits. A very common defect is for one wire to affect the signal in another. This is called a cross-talk fault. A special class of cross-talk faults is when a signal is connected to a wire that has a constant logical value (e.g., a power supply wire). In this case we have a stuck-at-0 or a stuckat-1 fault, and the affected signal always has a logical value of 0 or 1, respectively. The following problems refer to bit 0 of the Write Register input on the register fi le in Figure 4.24. 4.6.1 [10] Let us assume that processor testing is done by filling the PC, registers, and data and instruction memories with some values (you can choose which values), letting a single instruction execute, then reading the PC, memories, and registers. These values are then examined to determine if a particular fault is present. Can you design a test (values for PC, memories, and registers) that would determine if there is a stuck-at-0 fault on this signal? 4.6.2 [10] Repeat 4.6.1 for a stuck-at-1 fault. Can you use a single test for both stuck-at-0 and stuck-at-1? If yes, explain how; if no, explain why not. 4.6.3 [60] If we know that the processor has a stuck-at-1 fault on this signal, is the processor still usable? To be usable, we must be able to convert any program that executes on a normal MIPS processor into a program that works on this processor. You can assume that there is enough free instruction memory and data memory to let you make the program longer and store additional data. Hint: the processor is usable if every instruction “broken” by this fault can be replaced with a sequence of “working” instructions that achieve the same effect. 4.6.4 [10] Repeat 4.6.1, but now the fault to test for is whether the “MemRead” control signal becomes 0 if RegDst control signal is 0, no fault otherwise. 4.6.5 [10] Repeat 4.6.4, but now the fault to test for is whether the “Jump” control signal becomes 0 if RegDst control signal is 0, no fault otherwise.

This exercise is intended to help you understand the cost/complexity/performance trade-offs of forwarding in a pipelined processor. Problems in this exercise refer to pipelined data paths from Figure 4.45. These problems assume that, of all the instructions executed in a processor, the following fraction of these instructions have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both). We assume that the register write is done in the first half of the clock cycle and that register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards. Also, assume that the CPI of the processor is 1 if there are no data hazards.

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

Assume the following latencies for individual pipeline stages. For the EX stage, latencies are given separately for a processor without forwarding and for a processor with different kinds of forwarding.

IF

ID

EX(no FW)

EX (full FW)

EX(FW from EX/MEM only)

Ex(FW from MEM/WB only)

MEM

WB

150ps

100ps

120ps

150ps

140ps

130ps

120ps

100ps

4.12.1 If we use no forwarding, what fraction of cycles are we stalling due to data hazards?

4.12.2 If we use full forwarding (forward all results that can be forwarded), what fraction of cycles are we staling due to data hazards?

4.12.3 Let us assume that we cannot afford to have three-input Muxes that are needed for full forwarding. We have to decide if it is better to forward only from the EX/MEM pipeline register (next-cycle forwarding) or only from the MEM/WB pipeline register (two-cycle forwarding). Which of the two options results in fewer data stall cycles?

4.12.4 For the given hazard probabilities and pipeline stage latencies, what is the speedup achieved by adding full forwarding to a pipeline that had no forwarding?

4.12.5 What would be the additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards? Assume that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.

4.12.6 Repeat 4.12.3 but this time determine which of the two options results in a shorter time per instruction.

Question: For the problems in this exercise, assume that there are no pipeline stalls and that the breakdown of executed instructions is as follows: add addi not beq lw sw 20% 20% 0% 25% 25% 10% 4.5.1 [10] In what fraction of all cycles is the data memory used? 4.5.2 [10] In what fraction of all cycles is the input of the sign-extend circuit needed? What is this circuit doing in cycles in which its input is not needed.

In this exercise, we examine how resource hazards, control hazards, and Instruction Set Architecture (ISA) design can affect pipelined execution. Problems in this exercise refer to the following fragment of MIPS code:

sw r16,12(r6)

lw r16,8(r6)

beq r5,r4,Label # Assume r5!=r4

add r5,r1,r4

slt r5,r15,r4

Assume that individual pipeline stages have the following latencies:

IF

ID

EX

MEM

WB

200ps

120ps

150ps

190ps

100ps

4.10.1 For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we only have one memory (for both instructions and data), there is a structural hazard every time we need to fetch an instruction in the same cycle in which another instruction accesses data. To guarantee forward progress, this hazard must always be resolved in favor of the instruction that accesses data. What is the total execution time of this instruction sequence in the 5-stage pipeline that only has one memory? We have seen that data hazards can be eliminated by addingnops to the code. Can you do the same with this structural hazard? Why?

4.10.2 For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we change load/store instructions to use a register (without an offset) as the address, these instructions no longer need to use the ALU. As a result, MEM and EX stages can be overlapped and the pipeline has only 4 stages. Change this code to accommodate this changed ISA. Assuming this change does not affect clock cycle time, what speedup is achieved in this instruction sequence?

4.10.3 Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if branch outcomes are determined in the ID stage, relative to the execution where branch outcomes are determined in the EX stage?

4.10.4. Given these pipeline stage latencies, repeat the speedup calculation from 4.10.2, but take into account the (possible) change in clock cycle time. When EX and MEM are done in a single stage, most of their work can be done in parallel. As a result, the resulting EX/MEM stage has a latency that is the larger of the original two, plus 20 ps needed for the work that could not be done in parallel.

4.10.5Given these pipeline stage latencies, repeat the speedup calculation from 4.10.3, taking into account the (possible) change in clock cycle time. Assume that the latency ID stage increases by 50% and the latency of the EX stage decrease by 10ps when branch outcome resolution is moved from EX to I

4.10.6 Assuming stall-on-branch and no delay slots, what is the new clock cycle time and execution time of this instruction sequence ifbeqaddress computation is moved to the MEM stage? What is the speedup from this change? Assume that the latency of the EX stage is reduced by 20 ps and the latency of the MEM stage is unchanged when branch outcome resolution is moved from EX to MEM.

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free