Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

The importance of having a good branch predictor depends on how often conditional branches are executed. Together with branch predictor accuracy, this will determine how much time is spent stalling due to mispredicted branches. In this exercise, assume that the breakdown of dynamic instructions into various instruction categories is as follows:

R-Type

BEQ

JMP

LW

SW

40%

25%

5%

25%

5%

Also, assume the following branch predictor accuracies:

Always-Taken

Always-Not-Taken

2-Bit

45%

55%

85%

4.15.1 [10] Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI due to mispredicted branches with the always-taken predictor? Assume that branch outcomes are determined in the EX stage, that there are no data hazards, and that no delay slots are used.

4.15.2 [10] Repeat 4.15.1 for the “always-not-taken” predictor.

4.15.3 [10] Repeat 4.15.1 for for the 2-bit predictor.

4.15.4 [10] With the 2-bit predictor, what speedup would be achieved if we could convert half of the branch instructions in a way that replaces a branch instruction with an ALU instruction? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced. 4.17 Exercises 367 4.15.5 [10] With the 2-bit predictor, what speedup would be achieved if we could convert half of the branch instructions in a way that replaced each branch instruction with two ALU instructions? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.

4.15.6 [10] Some branch instructions are much more predictable than others. If we know that 80% of all executed branch instructions are easy-to-predict loop-back branches that are always predicted correctly, what is the accuracy of the 2-bit predictor on the remaining 20% of the branch instructions?

Short Answer

Expert verified

4.15.1

is 0.41 for the predictor of “always-taken”.

4.15.2

is 0.34 for the predictor of “always-not-taken”.

4.15.3

is 0.113 for the predictor of “2-bit”.

4.15.4

The speed up with the 2-bit predictor will be 1.054.

4.15.5

The speed up with the 2-bit predictor will be 0.94.

4.15.6

The accuracy will be 25%.

Step by step solution

01

Define the concept.

4.15.1

3 stall cycles are caused for each not correctly predicted branch that are predicted by the always-taken predictor.

TheBranch predictor “always-taken”has the accuracy of 45%.

4.15.2

3 stall cycles are caused for each not correctly predicted branch that are predicted by the “always-not-taken” predictor.

TheBranch predictor “always-not-taken”has the accuracy of 55%.

4.15.3

3 stall cycles are caused for each not correctly predicted branch that are predicted by the “2-bit” predictor.

TheBranch predictor “2-bit”has the accuracy of 85%.

4.15.4

CPI of Predicted branches in a correct way was 1.

Now, these are converted into ALU instruction, the CPI of these are also 1.

CPI of converted ALU instructions from the predicted instructions in the incorrect way was also 1.

Speed-up=CPIAtinitialstageCPIAftertheconversion

4.15.5

After converting the branch instruction, these instruction consumes some additional time for execution.

Speed-up=CPIAtinitialstageCPIAftertheconversion

4.15.6

The accuracy will be on the non-loop back branches =B×0.05B×0.20

Where,B×0.85 is for the prediction in a correct way,

B×0.05is for the predicted non-loop back in a correct way,

Easily the percentage of the predictable loop-back branches with respect to all executed branch instructions is 80%.

And the remaining branch instruction is 20%.

02

Determine the calculation.

4.15.1

Given that,

Instruction

The break-down

R-Type

40%

BEQ

25%

JMP

5%

LW

25%

SW

5%

Also given that,

Branch predictor

Accuracy

Always-Taken

45%

Always-Not-Taken

55%

2-Bit

85%

3 stall cycles are caused for each not correctly predicted branch that are predicted by the “always-taken” predictor.

TheadditionalCPI=3×1-0.45×0.25=3×0.55×0.25=0.4125=0.41

4.15.2

Given that,

Instruction

The break-down

R-Type

40%

BEQ

25%

JMP

5%

LW

25%

SW

5%

Also given that,

Branch predictor

Accuracy

Always-Taken

45%

Always-Not-Taken

55%

2-Bit

85%

3 stall cycles are caused for each not correctly predicted branch that are predicted by the “always-not-taken” predictor.

role="math" localid="1655195714378" TheadditionalCPI=3×1-0.55×0.25=3×0.45×0.25=0.3375=0.34

4.15.3

Given that,

Instruction

The break-down

R-Type

40%

BEQ

25%

JMP

5%

LW

25%

SW

5%

Also given that,

Branch predictor

Accuracy

Always-Taken

45%

Always-Not-Taken

55%

2-Bit

85%

3 stall cycles are caused for each not correctly predicted branch that are predicted by the “2-bit” predictor.

TheadditionalCPI=3×1-0.85×0.25=3×0.15×0.25=0.1125=0.113

4.15.4

Let’s consider whether the predicted instruction in a correct way or not has a no different chance of being interchanged.

If the half branch is replaced by the ALU instructions.

CPIAtinitialstage=1+3×1-0.85×0.25=1+3×0.15×0.25=1+0.1125=1.1125=1.113

CPIAftertheconversion=1+3×1-0.85×0.25×0.5=1+3×0.15×0.25×0.5=1+0.05625=1.056

Then the speed up with the 2-bit predictor will be 1.1131.056=1.054

4.15.5

Let’s consider whether the predicted instruction in a correct way or not has a no different chance of being interchanged.

If the half branch is converted in such a way that each branch instruction is interchanged by the two ALU instructions.

CPIAtinitialstage=1+3×1-0.85×0.25=1+3×0.15×0.25=1+0.1125=1.1125=1.113

CPIAftertheconversion=1+1+3×1-0.85×0.25×0.5=1+1+3×0.15×0.25×0.5=1+1+0.45×0.25×0.5=1+0.18125=1.181

Then the speed up with the 2-bit predictor = 1.1131.181=0.94

4.15.6

Given information: The percentage of the easily predictable loop-back branches with respect to all executed branch instructions is 80%.

For the prediction in a correct way,

B×0.85

For the predicted non-loop back in a correct way,

B×0.05

Hence, the accuracy will be on the non-loop back branches =

B×0.05B×0.20=0.25

So, the percentage of the accuracy will be on the non-loop back branches 25%.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

This exercise is intended to help you understand the relationship between delay slots, control hazards, and branch execution in a pipelined processor. In this exercise, we assume that the following MIPS code is executed on a pipelined processor with a 5-stage pipeline, full forwarding, and a predict-taken branch predictor:

lw r2,0(r1)

label1: beq r2,r0,label2 # not taken once, then taken

lw r3,0(r2) beq r3,r0,label1 # taken

add r1,r3,r1

label2: sw r1,0(r2)

4.14.1 [10] Draw the pipeline execution diagram for this code, assuming there are no delay slots and that branches execute in the EX stage. 4.14.2 [10] Repeat 4.14.1, but assume that delay slots are used. In the given code, the instruction that follows the branch is now the delay slot instruction for that branch.

4.14.3 [20] One way to move the branch resolution one stage earlier is to not need an ALU operation in conditional branches. The branch instructions would be “bez rd,label” and “bnez rd,label”, and it would branch if the register has and does not have a zero value, respectively. Change this code to use these branch instructions instead of beq. You can assume that register R8 is available for you to use as a temporary register, and that an seq (set if equal) R-type instruction can be used. 366 Chapter 4 The Processor Section 4.8 describes how the severity of control hazards can be reduced by moving branch execution into the ID stage. This approach involves a dedicated comparator in the ID stage, as shown in Figure 4.62. However, this approach potentially adds to the latency of the ID stage, and requires additional forwarding logic and hazard detection.

4.14.4 [10] Using the first branch instruction in the given code as an example, describe the hazard detection logic needed to support branch execution in the ID stage as in Figure 4.62. Which type of hazard is this new logic supposed to detect?

4.14.5 [10] For the given code, what is the speedup achieved by moving branch execution into the ID stage? Explain your answer. In your speedup calculation, assume that the additional comparison in the ID stage does not affect clock cycle time. 4.14.6 [10] Using the first branch instruction in the given code as an example, describe the forwarding support that must be added to support branch execution in the ID stage. Compare the complexity of this new forwarding unit to the complexity of the existing forwarding unit in Figure 4.62.

This exercise explores how exception handling affects pipeline design. The first three problems in this exercise refer to the following two instructions:

Instruction 1

Instruction 2

BNE R1,R2, Label

LW R1,0(R1)

4.17.1 Which exceptions can each of these instructions trigger? For each of these exceptions, specify the pipeline stage in which it is detected.

4.17.2 If there is a separate handler address for each exception, show how the pipeline organization must be changed to be able to handle this exception. You can assume that the addresses of these handlers are known when the processor is designed.

4.17.3 If the second instruction is fetched right after the first instruction, describe what happens in the pipeline when the first instruction causes the first exception you listed in 4.17.1. Show the pipeline execution diagram from the time the first instruction is fetched until the time the first instruction of the exception handler is completed.

4.17.4 In vectored exception handling, the table of exception handler

addresses is in data memory at a known (fixed) address. Change the pipeline to implement this exception handling mechanism. Repeat 4.17.3 using this modified pipeline and vectored exception handling.

4.17.5 We want to emulate vectored exception handling (described in 4.17.4) on a machine that has only one fixed handler address. Write the code that should be at that fixed address. Hint: this code should identify the exception, get the right address from the exception vector table, and transfer execution to that handler.

Question: Problems in this exercise assume that logic blocks needed to implement a processor’s datapath have the following latencies: I-Mem Add Mux ALU Regs D-Mem Sign-Extend Shift-Left-2 200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps 4.4.1 [10] If the only thing we need to do in a processor is fetch consecutive instructions (Figure 4.6), what would the cycle time be? 4.4.2 [10] Consider a datapath similar to the one in Figure 4.11, but for a processor that only has one type of instruction: unconditional PC-relative branch. What would the cycle time be for this datapath? 4.4.3 [10] Repeat 4.4.2, but this time we need to support only conditional PC-relative branches. The remaining three problems in this exercise refer to the datapath element Shift - left -2: 4.4.4 [10] Which kinds of instructions require this resource? 4.4.5 [20] For which kinds of instructions (if any) is this resource on the critical path? 4.4.6 [10] Assuming that we only support beq and add instructions, discuss how changes in the given latency of this resource affect the cycle time of the processor. Assume that the latencies of other resources do not change.

In this exercise, we examine how data dependences affect execution in the basic 5-stage pipeline described in section 4.5. Problems in this exercise refer to the following sequence of instructions:

or r1,r2,r3

or r2,r1,r4

or r1,r1,r2

Also, assume the following cycle times for each of the options related to forwarding:

Without Forwarding

With Full Forwarding

With ALU-ALU Forwarding Only

250ps

300ps

290ps

4.9.1 Indicate dependences and their type.

4.9.2 Assume there is no forwarding in this pipelined processor. Indicate hazards and add nop instructions to eliminate them.

4.9.3 Assume there is full forwarding. Indicate hazards and addNOPinstructions to eliminate them.

4.9.4 What is the total execution time of this instruction sequence

without forwarding and with full forwarding? What is the speedup achieved by adding full forwarding to a pipeline that had no forwarding?

4.9.5 Addnopinstructions to this code to eliminate hazards if there is ALU-ALU forwarding only (no forwarding from the MEM to the EX stage).

4.9.6 What is the total execution time of this instruction sequence with only ALU-ALU forwarding? What is the speedup over a no-forwarding pipeline?

This exercise is intended to help you understand the cost/complexity/performance trade-offs of forwarding in a pipelined processor. Problems in this exercise refer to pipelined data paths from Figure 4.45. These problems assume that, of all the instructions executed in a processor, the following fraction of these instructions have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both). We assume that the register write is done in the first half of the clock cycle and that register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependencies are not counted because they cannot result in data hazards. Also, assume that the CPI of the processor is 1 if there are no data hazards.

Ex to 1st only

MEM to 1st only

EX to 2nd only

MEM to 2nd only

EX to 1st and MEM to 2nd

Other RAW Dependences

5%

20%

5%

10%

10%

10%

Assume the following latencies for individual pipeline stages. For the EX stage, latencies are given separately for a processor without forwarding and for a processor with different kinds of forwarding.

IF

ID

EX(no FW)

EX (full FW)

EX(FW from EX/MEM only)

Ex(FW from MEM/WB only)

MEM

WB

150ps

100ps

120ps

150ps

140ps

130ps

120ps

100ps

4.12.1 If we use no forwarding, what fraction of cycles are we stalling due to data hazards?

4.12.2 If we use full forwarding (forward all results that can be forwarded), what fraction of cycles are we staling due to data hazards?

4.12.3 Let us assume that we cannot afford to have three-input Muxes that are needed for full forwarding. We have to decide if it is better to forward only from the EX/MEM pipeline register (next-cycle forwarding) or only from the MEM/WB pipeline register (two-cycle forwarding). Which of the two options results in fewer data stall cycles?

4.12.4 For the given hazard probabilities and pipeline stage latencies, what is the speedup achieved by adding full forwarding to a pipeline that had no forwarding?

4.12.5 What would be the additional speedup (relative to a processor with forwarding) if we added time-travel forwarding that eliminates all data hazards? Assume that the yet-to-be-invented time-travel circuitry adds 100 ps to the latency of the full-forwarding EX stage.

4.12.6 Repeat 4.12.3 but this time determine which of the two options results in a shorter time per instruction.

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free