Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

In this exercise, we examine how resource hazards, control hazards, and Instruction Set Architecture (ISA) design can affect pipelined execution. Problems in this exercise refer to the following fragment of MIPS code:

sw r16,12(r6)

lw r16,8(r6)

beq r5,r4,Label # Assume r5!=r4

add r5,r1,r4

slt r5,r15,r4

Assume that individual pipeline stages have the following latencies:

IF

ID

EX

MEM

WB

200ps

120ps

150ps

190ps

100ps

4.10.1 For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we only have one memory (for both instructions and data), there is a structural hazard every time we need to fetch an instruction in the same cycle in which another instruction accesses data. To guarantee forward progress, this hazard must always be resolved in favor of the instruction that accesses data. What is the total execution time of this instruction sequence in the 5-stage pipeline that only has one memory? We have seen that data hazards can be eliminated by addingnops to the code. Can you do the same with this structural hazard? Why?

4.10.2 For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we change load/store instructions to use a register (without an offset) as the address, these instructions no longer need to use the ALU. As a result, MEM and EX stages can be overlapped and the pipeline has only 4 stages. Change this code to accommodate this changed ISA. Assuming this change does not affect clock cycle time, what speedup is achieved in this instruction sequence?

4.10.3 Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if branch outcomes are determined in the ID stage, relative to the execution where branch outcomes are determined in the EX stage?

4.10.4. Given these pipeline stage latencies, repeat the speedup calculation from 4.10.2, but take into account the (possible) change in clock cycle time. When EX and MEM are done in a single stage, most of their work can be done in parallel. As a result, the resulting EX/MEM stage has a latency that is the larger of the original two, plus 20 ps needed for the work that could not be done in parallel.

4.10.5Given these pipeline stage latencies, repeat the speedup calculation from 4.10.3, taking into account the (possible) change in clock cycle time. Assume that the latency ID stage increases by 50% and the latency of the EX stage decrease by 10ps when branch outcome resolution is moved from EX to I

4.10.6 Assuming stall-on-branch and no delay slots, what is the new clock cycle time and execution time of this instruction sequence ifbeqaddress computation is moved to the MEM stage? What is the speedup from this change? Assume that the latency of the EX stage is reduced by 20 ps and the latency of the MEM stage is unchanged when branch outcome resolution is moved from EX to MEM.

Short Answer

Expert verified

4.10.1 The total execution time of the 5 staged pipeline that has only one memory is 11 cycles.

No, the structural hazard can not be eliminated by adding NOPs because NOPs are fetched like any other instructions.

4.10.2 Assuming that the given change does not affect clock cycle time, then the speedup achieved is 1.13.

4.10.3Speedup achieved on this code if branch outcomes are determined in the ID stage, relative to the execution where branch outcomes are determined in the EX stage is 1.10.

4.10.4 The speed-up achieved through the given condition is 1.07

4.10.5 . Assuming that the latency ID stage increases by 50% and the latency of the EX stage decrease by 10ps when branch outcome resolution is moved from EX to I, the speedup achieved is 1.10

4.10.6 Assuming that the latency of the EX stage is reduced by 20 ps and the latency of the MEM stage is unchanged when branch outcome resolution is moved from EX to MEM. The speedup achieved through the change is 0.92.

Step by step solution

01

Determine the 5-staged pipeline and the hazards.

The 5-stage pipeline has five steps:

  1. The instruction will be fetched from the memory.
  2. The fetch instruction will be decoded by reading the registers.
  3. The decoded instruction will be executed with the values read from the registers.
  4. The operands in the data memory will be accessed.
  5. The result will be written back into a register.

Hazards:

Hazard is the situation that occurs in pipelining when the next instruction cannot be executed in the next clock cycle.

When the hardware does not support the execution of the planned instruction as per the clock cycle then it is called, a structural hazard.

When the data needed to execute the instruction is not yet available, as per the clock cycle, then it is called a data hazard.

The Control hazard occurs when the wrong instruction is fetched, that in not expected by the pipeline is called a control hazard.

02

Determine the total execution time.

4.10.1Given that all the branches are perfectly predicted. If it is considered to have only one memory, then it creates a structural hazard, whenever another instruction tries to access the data in the same cycle.

So, the structural hazard must be eliminated to allow the instruction to access the data.

The pipelined execution of the given instructions is shown below:

Instruction

Pipeline stage

Cycles

sw r16,12(r6)

IF

ID

EX

MEM

WB

11

lw r16,8(r6)

IF

ID

EX

MEM

WB

beq r5,r4,Label

IF

ID

EX

MWM

WB

add r5,r1,r4

*****

*****

IF

ID

EX

MEM

WB

slt r5,r15,r4

IF

ID

EX

MEM

WB

Therefore, the total execution time is 11 cycles.

In the above pipeline execution, the ***** shows the stall occurred when the instruction cannot be fetched because the load or store instructions access the memory in the same cycle.

NOP cannot be added to eliminate the structural hazard. Because, NOP instructions are fetched like any other instruction, this does not deal with any other hardware,

03

Determine the speedup achieved by the given condition.

4.10.2 All the branches of the code are predicted perfectly, which has no control hazards.

If the load or store instructions are changed use the register instead of ALU. MEM and EX stages will be overlapped, and become 4 stages.

Assuming that this change does not affect the clock cycle time, then the speedup achieved by this instruction sequence is as follows:

Instructionsexecuted=5Cycleswith5stages=4+5=9Cycleswith4stages=3+5=8Thespeedoftheratiobetween5stagesand4stagescycleSpeedup=98=1.13

04

Determine the speedup achieved by the given condition.

4.10.3 In this question, it is assumed that the stall occurred on a branch without delay slots.

The speed-up achieved on the code If branch outcomes are determined in the ID stage, relative to the execution where the branch outcomes are determined in the EX stage can be calculated as follows:

The stall will delay the next instruction fetch until the branch is executed. In the EX stage, each branch causes two-stall cycles when the branch executes. In the ID stage, each branch will cause one stall cycle.

InstructionsExecuted=5BranchesExecuted=4InEXstage:Cycles=4+5+1×2=9+2=11InIDstage:Cycles=4+5+1×1=9+1=10

Speedup is the ratio between EX stage and ID stage:

Speed up=1110=1.10

05

Determine the speedup achieved by the given condition.

4.10.4

The pipeline stage latencies are given, repeat the speedup from 4.10.2. but the change in the clock cycle is ignored. The resulting EX/MEM stage has a latency that is the larger of the original two, plus 20 ps needed for the work that could not be done in parallel is found as follows:

The clock cycle time is equal to the latency of the longest-latency stage. Combining EX and MEM stage, it becomes the longest -latency stage.

From the given latencies:

IF

ID

EX

MEM

WB

200ps

120ps

150ps

190ps

100ps

Cyclewith5stages=200psCyclewith4stages=210ps

The speed up is as follows:

Speedup=9×2008×210=1.07

06

Determine the speedup achieved by the given condition.

4.10.5

Repeat the speedup calculation of 4.10.3, considering the change in clock cycle time.

Also, assuming that the latency ID stage increases by 50% and the latency of the EX stage decrease by 10ps when branch outcome resolution is moved from EX to ID.

The speed up achieved is:

NewIDlatency=180psNewEXlatency=140psNewCycletime=200psOldcycletime=200psSpeedup=11×20010×200=1.10

07

Determine the speedup achieved by the given condition

4.10.6 Assumption: Stall-on-branch, no delay slots. If the beq address computation is moved to the MEM stage the new clock cycle time and execution time is calculated as follows.

The cycle time remains unchanged, a 20 ps reduction in EX latency has no effect in clock cycle time because EX is not the longest-latency stage,

For each branch, the change does not affect execution time because it adds one additional stall. Because the number of cycles increases and the speed up be below 1.

CycleswithbranchinEX=4+5+1*2=11Executiontime(branchinEX)=11*200ps=2200psCyclewithbranchinMEM=4+5+1*3=12Executiontime(branchinMEM)=12*200ps=2400ps

Therefore, the speed up is 0.92

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

This exercise examines the accuracy of various branch predictors for the following repeating pattern (e.g., in a loop) of branch outcomes: T, NT, T, T, NT

4.16.1 [5] What is the accuracy of always-taken and always-not-taken predictors for this sequence of branch outcomes?

4.16.2 [5] What is the accuracy of the two-bit predictor for the first4 branches in this pattern, assuming that the predictor starts off in the bottom left state from Figure 4.63 (predict not taken)?

4.16.3 [10] What is the accuracy of the two-bit predictor if this pattern is repeated forever?

4.16.4 [30] Design a predictor that would achieve a perfect accuracy if this pattern is repeated forever. You predictor should be a sequential circuit with one output that provides a prediction (1 for taken, 0 for not taken) and no inputs other than the clock and the control signal that indicates that the instruction is a conditional branch.

4.16.5 [10] What is the accuracy of your predictor from 4.16.4 if it is given a repeating pattern that is the exact opposite of this one?

4.16.6 [20] Repeat 4.16.4, but now your predictor should be able to eventually (after a warm-up period during which it can make wrong predictions) start perfectly predicting both this pattern and its opposite. Your predictor should have an input that tells it what the real outcome was. Hint: this input lets your predictor determine which of the two repeating patterns it is given.

When silicon chips are fabricated, defects in materials (e.g., silicon) and manufacturing errors can result in defective circuits. A very common defect is for one wire to affect the signal in another. This is called a cross-talk fault. A special class of cross-talk faults is when a signal is connected to a wire that has a constant logical value (e.g., a power supply wire). In this case we have a stuck-at-0 or a stuckat-1 fault, and the affected signal always has a logical value of 0 or 1, respectively. The following problems refer to bit 0 of the Write Register input on the register fi le in Figure 4.24. 4.6.1 [10] Let us assume that processor testing is done by filling the PC, registers, and data and instruction memories with some values (you can choose which values), letting a single instruction execute, then reading the PC, memories, and registers. These values are then examined to determine if a particular fault is present. Can you design a test (values for PC, memories, and registers) that would determine if there is a stuck-at-0 fault on this signal? 4.6.2 [10] Repeat 4.6.1 for a stuck-at-1 fault. Can you use a single test for both stuck-at-0 and stuck-at-1? If yes, explain how; if no, explain why not. 4.6.3 [60] If we know that the processor has a stuck-at-1 fault on this signal, is the processor still usable? To be usable, we must be able to convert any program that executes on a normal MIPS processor into a program that works on this processor. You can assume that there is enough free instruction memory and data memory to let you make the program longer and store additional data. Hint: the processor is usable if every instruction “broken” by this fault can be replaced with a sequence of “working” instructions that achieve the same effect. 4.6.4 [10] Repeat 4.6.1, but now the fault to test for is whether the “MemRead” control signal becomes 0 if RegDst control signal is 0, no fault otherwise. 4.6.5 [10] Repeat 4.6.4, but now the fault to test for is whether the “Jump” control signal becomes 0 if RegDst control signal is 0, no fault otherwise.

Consider the following loop.

loop:lw r1,0(r1)

and r1,r1,r2

lw r1,0(r1)

lw r1,0(r1)

beq r1,r0,loop

Assume that perfect branch prediction is used (no stalls due to control hazards), that there are no delay slots, and that the pipeline has full forwarding support. Also, assume that many iterations of this loop are executed before the loop exits.

4.11.1 Show a pipeline execution diagram for the third iteration of this loop, from the cycle in which we fetch the first instruction of that iteration up to(but not including) the cycle in which we can fetch the first instruction of the next iteration. Show all instructions that are in the pipeline during these cycles (not just those from the third iteration).

4.11.2 How often (as a percentage of all cycles) do we have a cycle in which all five pipeline stages are doing useful work?

Question: Problems in this exercise assume that logic blocks needed to implement a processor’s datapath have the following latencies: I-Mem Add Mux ALU Regs D-Mem Sign-Extend Shift-Left-2 200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps 4.4.1 [10] If the only thing we need to do in a processor is fetch consecutive instructions (Figure 4.6), what would the cycle time be? 4.4.2 [10] Consider a datapath similar to the one in Figure 4.11, but for a processor that only has one type of instruction: unconditional PC-relative branch. What would the cycle time be for this datapath? 4.4.3 [10] Repeat 4.4.2, but this time we need to support only conditional PC-relative branches. The remaining three problems in this exercise refer to the datapath element Shift - left -2: 4.4.4 [10] Which kinds of instructions require this resource? 4.4.5 [20] For which kinds of instructions (if any) is this resource on the critical path? 4.4.6 [10] Assuming that we only support beq and add instructions, discuss how changes in the given latency of this resource affect the cycle time of the processor. Assume that the latencies of other resources do not change.

When processor designers consider a possible improvement to the processor datapath, the decision usually depends on the cost/performance trade-off . In the following three problems, assume that we are starting with a datapath from Figure 4.3, where I-Mem, Add, Mux, ALU, Regs, D-Mem, and Control blocks have latencies of 400 ps, 100 ps, 30 ps, 120 ps, 200 ps, 350 ps, and 100 ps, respectively, and costs of 1000, 30, 10, 100, 200, 2000, and 500, respectively. Consider the addition of a multiplier to the ALU. This addition will add 300 ps to the latency of the ALU and will add a cost of 600 to the ALU. The result will be 5% fewer instructions executed since we will no longer need to emulate the MUL instruction. 4.3.1 [10] What is the clock cycle time with and without this improvement? 4.3.2 [10] What is the speedup achieved by adding this improvement? 4.3.3 [10] Compare the cost/performance ratio with and without this improvement.

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free