Chapter 4: 10. (page 362)

In this exercise, we examine how resource hazards, control hazards, and Instruction Set Architecture (ISA) design can affect pipelined execution. Problems in this exercise refer to the following fragment of MIPS code:
sw r16,12(r6)
lw r16,8(r6)
beq r5,r4,Label # Assume r5!=r4
add r5,r1,r4
slt r5,r15,r4
Assume that individual pipeline stages have the following latencies:
IF
ID
EX
MEM
WB
200ps
120ps
150ps
190ps
100ps
4.10.1 For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we only have one memory (for both instructions and data), there is a structural hazard every time we need to fetch an instruction in the same cycle in which another instruction accesses data. To guarantee forward progress, this hazard must always be resolved in favor of the instruction that accesses data. What is the total execution time of this instruction sequence in the 5-stage pipeline that only has one memory? We have seen that data hazards can be eliminated by addingnops to the code. Can you do the same with this structural hazard? Why?
4.10.2 For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we change load/store instructions to use a register (without an offset) as the address, these instructions no longer need to use the ALU. As a result, MEM and EX stages can be overlapped and the pipeline has only 4 stages. Change this code to accommodate this changed ISA. Assuming this change does not affect clock cycle time, what speedup is achieved in this instruction sequence?
4.10.3 Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if branch outcomes are determined in the ID stage, relative to the execution where branch outcomes are determined in the EX stage?
4.10.4. Given these pipeline stage latencies, repeat the speedup calculation from 4.10.2, but take into account the (possible) change in clock cycle time. When EX and MEM are done in a single stage, most of their work can be done in parallel. As a result, the resulting EX/MEM stage has a latency that is the larger of the original two, plus 20 ps needed for the work that could not be done in parallel.
4.10.5Given these pipeline stage latencies, repeat the speedup calculation from 4.10.3, taking into account the (possible) change in clock cycle time. Assume that the latency ID stage increases by 50% and the latency of the EX stage decrease by 10ps when branch outcome resolution is moved from EX to I
4.10.6 Assuming stall-on-branch and no delay slots, what is the new clock cycle time and execution time of this instruction sequence ifbeqaddress computation is moved to the MEM stage? What is the speedup from this change? Assume that the latency of the EX stage is reduced by 20 ps and the latency of the MEM stage is unchanged when branch outcome resolution is moved from EX to MEM.

Short Answer

Expert verified

4.10.1 The total execution time of the 5 staged pipeline that has only one memory is 11 cycles.

No, the structural hazard can not be eliminated by adding NOPs because NOPs are fetched like any other instructions.

4.10.2 Assuming that the given change does not affect clock cycle time, then the speedup achieved is 1.13.

4.10.3Speedup achieved on this code if branch outcomes are determined in the ID stage, relative to the execution where branch outcomes are determined in the EX stage is 1.10.

4.10.4 The speed-up achieved through the given condition is 1.07

4.10.5 . Assuming that the latency ID stage increases by 50% and the latency of the EX stage decrease by 10ps when branch outcome resolution is moved from EX to I, the speedup achieved is 1.10

4.10.6 Assuming that the latency of the EX stage is reduced by 20 ps and the latency of the MEM stage is unchanged when branch outcome resolution is moved from EX to MEM. The speedup achieved through the change is 0.92.

Step by step solution

Determine the 5-staged pipeline and the hazards.

The 5-stage pipeline has five steps:

The instruction will be fetched from the memory.
The fetch instruction will be decoded by reading the registers.
The decoded instruction will be executed with the values read from the registers.
The operands in the data memory will be accessed.
The result will be written back into a register.

Hazards:

Hazard is the situation that occurs in pipelining when the next instruction cannot be executed in the next clock cycle.

When the hardware does not support the execution of the planned instruction as per the clock cycle then it is called, a structural hazard.

When the data needed to execute the instruction is not yet available, as per the clock cycle, then it is called a data hazard.

The Control hazard occurs when the wrong instruction is fetched, that in not expected by the pipeline is called a control hazard.

Determine the total execution time.

4.10.1Given that all the branches are perfectly predicted. If it is considered to have only one memory, then it creates a structural hazard, whenever another instruction tries to access the data in the same cycle.

So, the structural hazard must be eliminated to allow the instruction to access the data.

The pipelined execution of the given instructions is shown below:

Instruction	Pipeline stage	Cycles
sw r16,12(r6)	IF	ID	EX	MEM	WB						11
lw r16,8(r6)		IF	ID	EX	MEM	WB
beq r5,r4,Label			IF	ID	EX	MWM	WB
add r5,r1,r4			*****	*****	IF	ID	EX	MEM	WB
slt r5,r15,r4						IF	ID	EX	MEM	WB

Therefore, the total execution time is 11 cycles.

In the above pipeline execution, the ***** shows the stall occurred when the instruction cannot be fetched because the load or store instructions access the memory in the same cycle.

NOP cannot be added to eliminate the structural hazard. Because, NOP instructions are fetched like any other instruction, this does not deal with any other hardware,

Determine the speedup achieved by the given condition.

4.10.2 All the branches of the code are predicted perfectly, which has no control hazards.

If the load or store instructions are changed use the register instead of ALU. MEM and EX stages will be overlapped, and become 4 stages.

Assuming that this change does not affect the clock cycle time, then the speedup achieved by this instruction sequence is as follows:

$I n s t r u c t i o n s e x e c u t e d = 5 C y c l e s w i t h 5 s t a g e s = 4 + 5 = 9 C y c l e s w i t h 4 s t a g e s = 3 + 5 = 8 T h e s p e e d o f t h e r a t i o b e t w e e n 5 s t a g e s a n d 4 s t a g e s c y c l e S p e e d u p = \frac{9}{8} = 1.13$

Determine the speedup achieved by the given condition.

4.10.3 In this question, it is assumed that the stall occurred on a branch without delay slots.

The speed-up achieved on the code If branch outcomes are determined in the ID stage, relative to the execution where the branch outcomes are determined in the EX stage can be calculated as follows:

The stall will delay the next instruction fetch until the branch is executed. In the EX stage, each branch causes two-stall cycles when the branch executes. In the ID stage, each branch will cause one stall cycle.

$I n s t r u c t i o n s E x e c u t e d = 5 B r a n c h e s E x e c u t e d = 4 I n E X s t a g e : C y c l e s = 4 + 5 + 1 \times 2 = 9 + 2 = 11 I n I D s t a g e : C y c l e s = 4 + 5 + 1 \times 1 = 9 + 1 = 10$

Speedup is the ratio between EX stage and ID stage:

Speed up $= \frac{11}{10} = 1.10$

Determine the speedup achieved by the given condition.

4.10.4

The pipeline stage latencies are given, repeat the speedup from 4.10.2. but the change in the clock cycle is ignored. The resulting EX/MEM stage has a latency that is the larger of the original two, plus 20 ps needed for the work that could not be done in parallel is found as follows:

The clock cycle time is equal to the latency of the longest-latency stage. Combining EX and MEM stage, it becomes the longest -latency stage.

From the given latencies:

IF	ID	EX	MEM	WB
200ps	120ps	150ps	190ps	100ps

$C y c l e w i t h 5 s t a g e s = 200 p s C y c l e w i t h 4 s t a g e s = 210 p s$

The speed up is as follows:

$S p e e d u p = \frac{9 \times 200}{8 \times 210} = 1.07$

Determine the speedup achieved by the given condition.

4.10.5

Repeat the speedup calculation of 4.10.3, considering the change in clock cycle time.

Also, assuming that the latency ID stage increases by 50% and the latency of the EX stage decrease by 10ps when branch outcome resolution is moved from EX to ID.

The speed up achieved is:

$N e w I D l a t e n c y = 180 p s N e w E X l a t e n c y = 140 p s N e w C y c l e t i m e = 200 p s O l d c y c l e t i m e = 200 p s S p e e d u p = \frac{11 \times 200}{10 \times 200} = 1.10$

Determine the speedup achieved by the given condition

4.10.6 Assumption: Stall-on-branch, no delay slots. If the beq address computation is moved to the MEM stage the new clock cycle time and execution time is calculated as follows.

The cycle time remains unchanged, a 20 ps reduction in EX latency has no effect in clock cycle time because EX is not the longest-latency stage,

For each branch, the change does not affect execution time because it adds one additional stall. Because the number of cycles increases and the speed up be below 1.

$C y c l e s w i t h b r a n c h i n E X = 4 + 5 + 1 * 2 = 11 E x e c u t i o n t i m e (b r a n c h i n E X) = 11 * 200 p s = 2200 p s C y c l e w i t h b r a n c h i n M E M = 4 + 5 + 1 * 3 = 12 E x e c u t i o n t i m e (b r a n c h i n M E M) = 12 * 200 p s = 2400 p s$

Therefore, the speed up is 0.92

Unlock Step-by-Step Solutions & Ace Your Exams!

Full Textbook Solutions
Get detailed explanations and key concepts
Unlimited Al creation
Al flashcards, explanations, exams and more...
Ads-free access
To over 500 millions flashcards
Money-back guarantee
We refund you if you fail your exam.

Start your free trial

Over 30 million students worldwide already upgrade their learning with Vaia!

Short Answer

Step by step solution

Determine the 5-staged pipeline and the hazards.

Determine the total execution time.

Determine the speedup achieved by the given condition.

Determine the speedup achieved by the given condition.

Determine the speedup achieved by the given condition.

Determine the speedup achieved by the given condition.

Determine the speedup achieved by the given condition

One App. One Place for Learning.

Most popular questions from this chapter

Recommended explanations on Computer Science Textbooks

Databases

Big Data

Issues in Computer Science

Data Structures

Computer Systems

Computer Organisation and Architecture

Study anywhere. Anytime. Across all devices.

Company

Product

Help