Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

In this exercise, we examine how pipelining affects the clock cycle time of the processor. Problems in this exercise assume that individual stages of the datapath have the following latencies:

IF

ID

EX

MEM

WB

250ps

350ps

150ps

300ps

200ps

Also, assume that instructions executed by the processor are broken down as follows:

alu

beq

lw

sw

45%

20%

20%

15%

4.8.1 [5] What is the clock cycle time in a pipelined and non-pipelined processor?

4.8.2 [10] What is the total latency of an LW instruction in a pipelined and non-pipelined processor?

4.8.3 [10] If we can split one stage of the pipelined datapath into two new stages, each with half the latency of the original stage, which stage would you split and what is the new clock cycle time of the processor? 4.8.4 [10] Assuming there are no stalls or hazards, what is the utilization of the data memory?

4.8.5 [10] Assuming there are no stalls or hazards, what is the utilization of the write-register port of the “Registers” unit? 4.8.6 [30] Instead of a single-cycle organization, we can use a multi-cycle organization where each instruction takes multiple cycles but one instruction finishes before another is fetched. In this organization, an instruction only goes through stages it actually needs (e.g., ST only takes 4 cycles because it does not need the WB stage). Compare clock cycle times and execution times with singlecycle, multi-cycle, and pipelined organization.

Short Answer

Expert verified

4.8.1

350 ps is the required clock cycle timein a pipelined processor.

1250 psis the required clock cycle timein a non-pipelined processor.

4.8.2

The total latency isin a pipelined processor is 1750ps.

The total latency isin a non-pipelined processor is 1250ps.

4.8.3

The new clock cycle time is 300ps.

4.8.4

There will be 35% utilization of the data memory for the given condition.

4.8.5

There will be 65% utilization of the port “write-register”for the given condition.

4.8.6

The multi-cycle execution time is 4.20.

And the single-cycle execution time is 3.57.

Step by step solution

01

Define the concept.

4.8.1

For thepipelined processor,

The latency of each cycle is referred tothe clock cycle time.

The clock cycle timein a pipelined processor can’t becomputed by the sum of all latency of the stages.

For the non-pipelined processor,

The clock cycle timein a non-pipelined processor iscomputed by the sum of all latency of the stages.

4.8.2

Given that,

Individual stages of the data-path

Latency

IF

250ps

ID

350ps

EX

150ps

MEM

300ps

WB

200ps

Given that ,

Instruction

Break down

lw

20%

For thepipelined processor,

The latency of each cycle is 350 ps.

So, the latency of 5 cycles is ps (350x5) ps =1750 ps

The total latency ofthe “LW” instructionisin a pipelined processor is 1750ps.

For the non-pipelined processor,

Hence, (250+350+150+300+200) ps =1250 ps.

The total latencyof the “LW” instructionin a non-pipelined processor is 1250ps.

4.8.3

Let’s consider, the stage of the specified pipelined data path becomes split into two new stages and each of the new split stages has the half latency of the initial stage.

According to the given condition, the new clock cycle time is 300ps.

4.8.4

The instruction “lw” or load word has the break down 20%.

The instruction “sw” or store word has the break down 15%.

Hence, (20+15) %=35%.

The utilization of the data memory is referred to the sum of the break-down of the instruction “lw” and “sw”.

4.8.5

The instruction “ALU” or load word has the break down 45%.

The instruction “Beq” or store word has the break down 120%.

Hence, (20+45)%=65%.

The utilizationthe port “write-register”for the specified condition is referred to the sum of the break-down of the instruction “ALU” and “Beq”.

4.8.6

Let’s consider, four cycles are taken by “ST” as any WB stages are not required for this.

For the multi-cycle,

((0.20×5)+(0.80×4))=1.00+3.20=4.20

The multi-cycle execution time is 4.20.

For the single-cycle,

1250ps350ps=3.57

And the single-cycle execution time is 3.57.

02

 Determine the calculation.

4.8.1

Given that,

Individual stages of the data-path

Latency

IF

250ps

ID

350ps

EX

150ps

MEM

300ps

WB

200ps

For thepipelined processor,

The latency of each cycle is 350 ps.

350 ps is the required clock cycle time in a pipelined processor.

For the non-pipelined processor,

Hence, (250+350+150+300+200) ps =1250 ps.

1250 psis the required clock cycle timein a non-pipelined processor.

4.8.2

Given that,

Individual stages of the data-path

Latency

IF

250ps

ID

350ps

EX

150ps

MEM

300ps

WB

200ps

For thepipelined processor,

The latency of each cycle is 350 ps.

So, the latency of 5 cycle is ps 350×5ps=1750 ps

The total latency isin a pipelined processor is 1750ps.

For the non-pipelined processor,

Hence, (250+350+150+300+200) ps=1250 ps.

The total latency isin a non-pipelined processor is 1250ps.

4.8.3

Let’s consider, one stage of the pipelined data path is split into two new

Stage and each of the new split stages has the half latency of the initial stage.

According to the given condition, the new clock cycle time is 300ps.

4.8.4

Given that,

Instruction

Break down

Alu

45%

Beq

20%

lw

20%

sw

15%

The instruction “lw” or load word has the break down 20%.

The instruction “sw” or store word has the break down 15%.

Hence, (20+15)%=35%.

There will be 35% utilization of the data memory for the given condition.

4.8.5

Given that,

Instruction

Break down

Alu

45%

Beq

20%

lw

20%

sw

15%

The instruction “ALU” or load word has the break down 45%.

The instruction “Beq” or store word has the break down 120%.

Hence, (20+45)%=65%.

There will be 65% utilization of the port “write-register”for the given condition.

4.8.6

Let’s consider, four cycles are taken by “ST” as any WB stages are not required for this.

For the multi-cycle,

The multi-cycle execution time is 4.20.

For the single-cycle,

And the single-cycle execution time is 3.57.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

This exercise explores how exception handling affects pipeline design. The first three problems in this exercise refer to the following two instructions:

Instruction 1

Instruction 2

BNE R1,R2, Label

LW R1,0(R1)

4.17.1 Which exceptions can each of these instructions trigger? For each of these exceptions, specify the pipeline stage in which it is detected.

4.17.2 If there is a separate handler address for each exception, show how the pipeline organization must be changed to be able to handle this exception. You can assume that the addresses of these handlers are known when the processor is designed.

4.17.3 If the second instruction is fetched right after the first instruction, describe what happens in the pipeline when the first instruction causes the first exception you listed in 4.17.1. Show the pipeline execution diagram from the time the first instruction is fetched until the time the first instruction of the exception handler is completed.

4.17.4 In vectored exception handling, the table of exception handler

addresses is in data memory at a known (fixed) address. Change the pipeline to implement this exception handling mechanism. Repeat 4.17.3 using this modified pipeline and vectored exception handling.

4.17.5 We want to emulate vectored exception handling (described in 4.17.4) on a machine that has only one fixed handler address. Write the code that should be at that fixed address. Hint: this code should identify the exception, get the right address from the exception vector table, and transfer execution to that handler.

This exercise is intended to help you understand the relationship between forwarding, hazard detection, and ISA design. Problems in this exercise refer to the following sequence of instructions, and assume that it is executed on a 5-stage pipelined datapath:

add r5,r2,r1

lw r3,4(r5)

lw r2,0(r2)

or r3,r5,r3

sw r3,0(r5)

4.13.1 [5] If there is no forwarding or hazard detection, insert nops to ensure correct execution.

4.13.2 [10] Repeat 4.13.1 but now use nops only when a hazard cannot be avoided by changing or rearranging these instructions. You can assume register R7 can be used to hold temporary values in your modified code.

4.13.3 [10] If the processor has forwarding, but we forgot to implement the hazard detection unit, what happens when this code executes? 4.13.4 [20] If there is forwarding, for the first five cycles during the execution of this code, specify which signals are asserted in each cycle by hazard detection and forwarding units in Figure 4.60.

4.13.5 [10] If there is no forwarding, what new inputs and output signals do we need for the hazard detection unit in Figure 4.60? Using this instruction sequence as an example, explain why each signal is needed. 4.13.6 [20] For the new hazard detection unit from 4.13.5, specify which output signals it asserts in each of the first five cycles during the execution of this code.

In this exercise, we examine how resource hazards, control hazards, and Instruction Set Architecture (ISA) design can affect pipelined execution. Problems in this exercise refer to the following fragment of MIPS code:

sw r16,12(r6)

lw r16,8(r6)

beq r5,r4,Label # Assume r5!=r4

add r5,r1,r4

slt r5,r15,r4

Assume that individual pipeline stages have the following latencies:

IF

ID

EX

MEM

WB

200ps

120ps

150ps

190ps

100ps

4.10.1 For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we only have one memory (for both instructions and data), there is a structural hazard every time we need to fetch an instruction in the same cycle in which another instruction accesses data. To guarantee forward progress, this hazard must always be resolved in favor of the instruction that accesses data. What is the total execution time of this instruction sequence in the 5-stage pipeline that only has one memory? We have seen that data hazards can be eliminated by addingnops to the code. Can you do the same with this structural hazard? Why?

4.10.2 For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we change load/store instructions to use a register (without an offset) as the address, these instructions no longer need to use the ALU. As a result, MEM and EX stages can be overlapped and the pipeline has only 4 stages. Change this code to accommodate this changed ISA. Assuming this change does not affect clock cycle time, what speedup is achieved in this instruction sequence?

4.10.3 Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if branch outcomes are determined in the ID stage, relative to the execution where branch outcomes are determined in the EX stage?

4.10.4. Given these pipeline stage latencies, repeat the speedup calculation from 4.10.2, but take into account the (possible) change in clock cycle time. When EX and MEM are done in a single stage, most of their work can be done in parallel. As a result, the resulting EX/MEM stage has a latency that is the larger of the original two, plus 20 ps needed for the work that could not be done in parallel.

4.10.5Given these pipeline stage latencies, repeat the speedup calculation from 4.10.3, taking into account the (possible) change in clock cycle time. Assume that the latency ID stage increases by 50% and the latency of the EX stage decrease by 10ps when branch outcome resolution is moved from EX to I

4.10.6 Assuming stall-on-branch and no delay slots, what is the new clock cycle time and execution time of this instruction sequence ifbeqaddress computation is moved to the MEM stage? What is the speedup from this change? Assume that the latency of the EX stage is reduced by 20 ps and the latency of the MEM stage is unchanged when branch outcome resolution is moved from EX to MEM.

This exercise is intended to help you understand the relationship between delay slots, control hazards, and branch execution in a pipelined processor. In this exercise, we assume that the following MIPS code is executed on a pipelined processor with a 5-stage pipeline, full forwarding, and a predict-taken branch predictor:

lw r2,0(r1)

label1: beq r2,r0,label2 # not taken once, then taken

lw r3,0(r2) beq r3,r0,label1 # taken

add r1,r3,r1

label2: sw r1,0(r2)

4.14.1 [10] Draw the pipeline execution diagram for this code, assuming there are no delay slots and that branches execute in the EX stage. 4.14.2 [10] Repeat 4.14.1, but assume that delay slots are used. In the given code, the instruction that follows the branch is now the delay slot instruction for that branch.

4.14.3 [20] One way to move the branch resolution one stage earlier is to not need an ALU operation in conditional branches. The branch instructions would be “bez rd,label” and “bnez rd,label”, and it would branch if the register has and does not have a zero value, respectively. Change this code to use these branch instructions instead of beq. You can assume that register R8 is available for you to use as a temporary register, and that an seq (set if equal) R-type instruction can be used. 366 Chapter 4 The Processor Section 4.8 describes how the severity of control hazards can be reduced by moving branch execution into the ID stage. This approach involves a dedicated comparator in the ID stage, as shown in Figure 4.62. However, this approach potentially adds to the latency of the ID stage, and requires additional forwarding logic and hazard detection.

4.14.4 [10] Using the first branch instruction in the given code as an example, describe the hazard detection logic needed to support branch execution in the ID stage as in Figure 4.62. Which type of hazard is this new logic supposed to detect?

4.14.5 [10] For the given code, what is the speedup achieved by moving branch execution into the ID stage? Explain your answer. In your speedup calculation, assume that the additional comparison in the ID stage does not affect clock cycle time. 4.14.6 [10] Using the first branch instruction in the given code as an example, describe the forwarding support that must be added to support branch execution in the ID stage. Compare the complexity of this new forwarding unit to the complexity of the existing forwarding unit in Figure 4.62.

The basic single-cycle MIPS implementation in Figure 4.2 can only implement some instructions. New instructions can be added to an existing Instruction Set Architecture (ISA), but the decision whether or not to do that depends, among other things, on the cost and complexity the proposed addition introduces into the processor datapath and control. The first three problems in this exercise refer to the new instruction: Instruction: LWI Rt,Rd(Rs) Interpretation: Reg[Rt] = Mem[Reg[Rd]+Reg[Rs]] 4.2.1 [10] Which existing blocks (if any) can be used for this instruction? 4.2.2 [10] which new functional blocks (if any) do we need for this instruction? 4.2.3 [10] what new signals do we need (if any) from the control unit to support this instruction?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free