Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

This exercise is intended to help you understand the cost/complexity/performance trade-offs of forwarding in a pipelined processor. Problems in this exercise refer to pipelined data paths from Figure 4.45. These problems assume that, of all the instructions executed in a processor, the following fraction of these instructions have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both). We assume that the register write is done in the first half of the clock cycle and that register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards. Also, assume that the CPI of the processor is 1 if there are no data hazards.

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

Assume the following latencies for individual pipeline stages. For the EX stage, latencies are given separately for a processor without forwarding and for a processor with different kinds of forwarding.

IF

ID

EX(no FW)

EX (full FW)

EX(FW from EX/MEM only)

Ex(FW from MEM/WB only)

MEM

WB

150ps

100ps

120ps

150ps

140ps

130ps

120ps

100ps

4.12.1 If we use no forwarding, what fraction of cycles are we stalling due to data hazards?

4.12.2 If we use full forwarding (forward all results that can be forwarded), what fraction of cycles are we staling due to data hazards?

4.12.3 Let us assume that we cannot afford to have three-input Muxes that are needed for full forwarding. We have to decide if it is better to forward only from the EX/MEM pipeline register (next-cycle forwarding) or only from the MEM/WB pipeline register (two-cycle forwarding). Which of the two options results in fewer data stall cycles?

4.12.4 For the given hazard probabilities and pipeline stage latencies, what is the speedup achieved by adding full forwarding to a pipeline that had no forwarding?

4.12.5 What would be the additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards? Assume that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.

4.12.6 Repeat 4.12.3 but this time determine which of the two options results in a shorter time per instruction.

Short Answer

Expert verified

4.12.1. If no forwarding is used, the stall cycles are 46%.

4.12.2. If full forwarding is used, the stall cycles are 17%.

4.12.3.MEM/WB has fewer stall cycles compared to EX/MEM.

4.12.4. The speedup achieved by adding full forwarding to a pipeline that had no forwarding is 1.54.

4.12.5.The additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards is 0.72

4.10.6 MEM/WB results in a shorter time per instruction with 202.5 ps.

Step by step solution

01

Determine forwarding and Stall cycles

In a sequence of instructions, when the instructions depend on other instructions data for execution, then there occurs a data hazard.

Pipeline registers will hold the ALU result, to prevent data hazards, the values of the registers can be forwarded to the subsequent instruction. This is called forwarding.

The data hazards will be eliminated by forwarding. The destination registers of the previous instructions will be compared with the source registers of the current instruction by the forwarding unit to detect data hazards.

The results of the previous instruction will be fetched before they are written back to the register file and forwarded to the next instruction.

A delay in execution of the instruction to resolve a hazard is known as a stall.

02

Determine stall cycles due to data hazards.

4.12.1 Given the reference of Figure 4.45 in the text, refer to the book.

From the figure reference, the following sequence of instructions is used in the problem.

lw $10, 20($1)

sub $11, $2, $3

add $12, $3, $4

lw $13, 24($1)

add $14, $15, $6

For the above sequence of instructions, the, CPI for each instruction is as follows:

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

The given assumptions in the question:

  • The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both).
  • The register write is done in the first half of the clock cycle and the register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards.
  • The CPI of the processor is 1 if there are no data hazards.

From the 5 stage pipeline, the current stages of the instructions.

lw $10, 20($1)- Write back

sub $11, $2, $3-Memory

add $12, $3, $4-Execution

lw $13, 24($1)-Instruction Decode

add $14, $15, $6-Instrcution Fetch

From the given assumptions and the figure, the dependence on the 1stnext instruction results in 2 stall cycles. For the stall, it takes 2 cycles if the dependences are to both 1st and 2nd next instruction. Dependences to only the 2nd next instruction result in 1 stall.

So, the CPI and Stall Cycles are calculated as follows:

For instruction, that has no data hazard CPI is 1, then 2 stalls for EX to 1st, MEM to 1st andEX to 1st and MEM to 2nd. Finally 1 stall for EX to 2nd only and MEM to 2nd only.

CPI:

1+10%+20%+5%×2+5%+10%×1=1+35%×2+15%×1=1+0.35×2+0.15×1=1+0.7+0.15=1.85

Stall Cycles:

0.851.85=0.4590.459×100=45.9=46%

03

Determine stall cycles due to data hazards with full forwarding

.12.2 Refer the mentioned figure 4.45 for the instruction and the CPI is as follows:

CPI for each instruction is as follows:

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

Considering the full forwarding, the MEM stage of one instruction to the 1st next instruction causes RAW dependences. These will cause only one stall cycle.

CPI:

1+20%=1+0.20=1.20

Stall Cycles:

0.201.20=1.66666671.666667×100=16.7=17%

04

Determine the option that results in fewer stall cycles.

4.12.3 Assuming that the three-input Muxes are not affordable, they are needed for full forwarding.

From the given in the question:

To forward only from the EX/MEM pipeline register (next-cycle forwarding), EX to 1st dependences can be done without stalls. But other dependences incur one cycle stall.

Stall cycles of EX/MEM:

$0.2+0.05+0.1+0.1=0.45$

To forward only from the MEM/WB pipeline register (two-cycle forwarding), EX to 2nd has no stalls. Because, to forward the next instruction it has to wait for the instruction to complete the MEM stage.

Stall cycles of MEM/WB:

$0.05+0.2+0.1=0.35$

Comparing both the options, MEM/WB is the better one with 0.35 stall cycles.

05

Determine the speedup achieved by adding full forwarding to a pipeline that had no forwarding.

4.12.4 The CPI without forwarding and full forwarding is 1.85 and 1.20 (From 4.12.1 and 4.12.2).

IF

ID

EX(no FW)

EX (full FW)

EX(FW from EX/MEM only)

Ex(FW from MEM/WB only)

MEM

WB

150ps

100ps

120ps

150ps

140ps

130ps

120ps

100ps

From the above table, the latencies for individual pipelining can be considered.

Clock cycle time without forwarding:

1.85×150ps=277.5ps

Clock cycle time with forwarding:

1.20×150ps=180ps

Speed up:

277.5180=1.54

06

Determine the additional speedup achieved by adding full forwarding to a pipeline that had no forwarding.

4.12.5Assuming that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.

The additional speedup (relative to a processor with forwarding) if time-travel forwarding is added that eliminates all data hazards is calculated as follows:

With full forwarding: (CPI from 4.12.4)

1.20×150ps=180ps

Time-travel forwarding:

1×250ps = 250ps

Speedup:

180250=0.72

07

Determine which of the two options from 4.12.3 has shorter time per instruction.

4.12.6

CPI of EX/MEM=1.45

CPI of MEM/WB=1.35

Clock cycle of EX/MEM:

1.45×150ps = 217.5

Clock cycle of MEM/WB:

1.35×150ps = 202.5ps

Comparing the clock cycles, MEM/WB has shorter time per instruction.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

4.18 In this exercise we compare the performance of 1-issue and 2-issue processors, taking into account program transformations that can be made to optimize for 2-issue execution. Problems in this exercise refer to the following loop (written in C):

for(i=0;i!=j;i+=2)

b[i]=a[i]–a[i+1];

When writing MIPS code, assume that variables are kept in registers as follows, and that all registers except those indicated as Free are used to keep various variables, so they cannot be used for anything else.

I

j

a

b

c

Free

R5

R6

R1

R2

R3

R10, R11, R12

4.18.1 [10] Translate this C code into MIPS instructions. Your translation should be direct, without rearranging instructions to achieve better performance.

4.18.2 [10] If the loop exits aft er executing only two iterations, draw a pipeline diagram for your MIPS code from 4.18.1 executed on a 2-issue processor shown in Figure 4.69. Assume the processor has perfect branch prediction and can fetch any two instructions (not just consecutive instructions) in the same cycle.

4.18.3 [10] Rearrange your code from 4.18.1 to achieve better performance on a 2-issue statically scheduled processor from Figure 4.69.

4.18.4 [10] Repeat 4.18.2, but this time use your MIPS code from 4.18.3. 4.18.5 [10] What is the speedup of going from a 1-issue processor to a 2-issue processor from Figure 4.69? Use your code from 4.18.1 for both 1-issue and 2-issue, and assume that 1,000,000 iterations of the loop are executed. As in 4.18.2, assume that the processor has perfect branch predictions, and that a 2-issue processor can fetch any two instructions in the same cycle.

4.18.6 [10] Repeat 4.18.5, but this time assume that in the 2-issue processor one of the instructions to be executed in a cycle can be of any kind, and the other must be a non-memory instruction

Question: When silicon chips are fabricated, defects in materials (e.g., silicon) and manufacturing errors can result in defective circuits. A very common defect is for one wire to affect the signal in another. This is called a cross-talk fault. A special class of cross-talk faults is when a signal is connected to a wire that has a constant logical value (e.g., a power supply wire). In this case we have a stuck-at-0 or a stuckat-1 fault, and the affected signal always has a logical value of 0 or 1, respectively. The following problems refer to bit 0 of the Write Register input on the register fi le in Figure 4.24. 4.6.1 [10] Let us assume that processor testing is done by filling the PC, registers, and data and instruction memories with some values (you can choose which values), letting a single instruction execute, then reading the PC, memories, and registers. These values are then examined to determine if a particular fault is present. Can you design a test (values for PC, memories, and registers) that would determine if there is a stuck-at-0 fault on this signal? 4.6.2 [10] Repeat 4.6.1 for a stuck-at-1 fault. Can you use a single test for both stuck-at-0 and stuck-at-1? If yes, explain how; if no, explain why not. 4.6.3 [60] If we know that the processor has a stuck-at-1 fault on this signal, is the processor still usable? To be usable, we must be able to convert any program that executes on a normal MIPS processor into a program that works on this processor. You can assume that there is enough free instruction memory and data memory to let you make the program longer and store additional data. Hint: the processor is usable if every instruction “broken” by this fault can be replaced with a sequence of “working” instructions that achieve the same effect. 4.6.4 [10] Repeat 4.6.1, but now the fault to test for is whether the “MemRead” control signal becomes 0 if RegDst control signal is 0, no fault otherwise. 4.6.5 [10] Repeat 4.6.4, but now the fault to test for is whether the “Jump” control signal becomes 0 if RegDst control signal is 0, no fault otherwise.

This exercise is intended to help you understand the cost/complexity/performance trade-offs of forwarding in a pipelined processor. Problems in this exercise refer to pipelined data paths from Figure 4.45. These problems assume that, of all the instructions executed in a processor, the following fraction of these instructions have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both). We assume that the register write is done in the first half of the clock cycle and that register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards. Also, assume that the CPI of the processor is 1 if there are no data hazards.

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

Assume the following latencies for individual pipeline stages. For the EX stage, latencies are given separately for a processor without forwarding and for a processor with different kinds of forwarding.

IF

ID

EX(no FW)

EX (full FW)

EX(FW from EX/MEM only)

Ex(FW from MEM/WB only)

MEM

WB

150ps

100ps

120ps

150ps

140ps

130ps

120ps

100ps

4.12.1 If we use no forwarding, what fraction of cycles are we stalling due to data hazards?

4.12.2 If we use full forwarding (forward all results that can be forwarded), what fraction of cycles are we staling due to data hazards?

4.12.3 Let us assume that we cannot afford to have three-input Muxes that are needed for full forwarding. We have to decide if it is better to forward only from the EX/MEM pipeline register (next-cycle forwarding) or only from the MEM/WB pipeline register (two-cycle forwarding). Which of the two options results in fewer data stall cycles?

4.12.4 For the given hazard probabilities and pipeline stage latencies, what is the speedup achieved by adding full forwarding to a pipeline that had no forwarding?

4.12.5 What would be the additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards? Assume that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.

4.12.6 Repeat 4.12.3 but this time determine which of the two options results in a shorter time per instruction.

Consider the following loop.

loop:lw r1,0(r1)

and r1,r1,r2

lw r1,0(r1)

lw r1,0(r1)

beq r1,r0,loop

Assume that perfect branch prediction is used (no stalls due to control hazards), that there are no delay slots, and that the pipeline has full forwarding support. Also, assume that many iterations of this loop are executed before the loop exits.

4.11.1 Show a pipeline execution diagram for the third iteration of this loop, from the cycle in which we fetch the first instruction of that iteration up to(but not including) the cycle in which we can fetch the first instruction of the next iteration. Show all instructions that are in the pipeline during these cycles (not just those from the third iteration).

4.11.2 How often (as a percentage of all cycles) do we have a cycle in which all five pipeline stages are doing useful work?

When processor designers consider a possible improvement to the processor datapath, the decision usually depends on the cost/performance trade-off . In the following three problems, assume that we are starting with a datapath from Figure 4.3, where I-Mem, Add, Mux, ALU, Regs, D-Mem, and Control blocks have latencies of 400 ps, 100 ps, 30 ps, 120 ps, 200 ps, 350 ps, and 100 ps, respectively, and costs of 1000, 30, 10, 100, 200, 2000, and 500, respectively. Consider the addition of a multiplier to the ALU. This addition will add 300 ps to the latency of the ALU and will add a cost of 600 to the ALU. The result will be 5% fewer instructions executed since we will no longer need to emulate the MUL instruction. 4.3.1 [10] What is the clock cycle time with and without this improvement? 4.3.2 [10] What is the speedup achieved by adding this improvement? 4.3.3 [10] Compare the cost/performance ratio with and without this improvement.

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free