Chapter 6: Q4E (page 565)

Question:Consider the following piece of C code:
for (j=2;j<1000;j++)
D[j] = D[j−1]+D[j−2];
Th e MIPS code corresponding to the above fragment is:
addiu $s2,$zero,7992
addiu $s1,$zero,16
loop: l.d $f0, _16($s1)
l.d $f2, _8($s1)
add.d $f4, $f0, $f2
s.d $f4, 0($s1)
addiu $s1, $s1, 8
bne $s1, $s2, loop
Instructions have the following associated latencies (in cycles):
add.d
I.d
s.d
addiu
4
6
1
2
6.4.1 How many cycles does it take for all instructions in a single iteration of the above loop to execute?
6.4.2 When an instruction in a later iteration of a loop depends upon a data value produced in an earlier iteration of the same loop, we say that there is aloop carried dependencebetween iterations of the loop. Identify the loop-carried dependences in the above code. Identify the dependent program variable and assembly-level registers. You can ignore the loop induction variable j.
6.4.3 Loop unrolling was described in Chapter 4. Apply loop
unrolling to this loop and then consider running this code on a 2-node distributed memory message-passing system. Assume that we are going to use message passing as described in Section 6.7, where we introduce a new operation send (x, y) that sends to node x the value y, and an operation receive( ) that waits for the value being sent to it. Assume that send operations take a cycle to issue (i.e., later instructions on the same node can proceed on the next cycle), but take 10 cycles to be received on the receiving node. Receive instructions stall execution on the node where they are executed until they receive a message. Produce a schedule for the two nodes assuming an unroll factor of 4 for the loop body (i.e., the loop body will appear 4 times). Compute the number of cycles it will take for the loop to run on the message passing system.
6.4.4 The latency of the interconnect network plays a large role in the efficiency of message-passing systems. How fast does the interconnect need to be in order to obtain any speedup from using the distributed system described in Exercise 6.4.3

Short Answer

Expert verified

6.4.1 First instruction is executed once and the loop body is executed 998 times.

6.4.2 D[j] and D[j-1] will have loop carried dependencies. $f4 register will be dependent on the current iteration and the $f0 in the next iteration.

6.4.3 The loop body running in node 1:

addiu $s1, $zero, 996

l.d $f0, -16($s0)

l.d $f2, -8($s0)

loop:

add.d $f4, $f2, $f0

add.d $f6, $f4, $f2

Send (2, $f4)

Send (2, $f6)

s.d $f4, 0($s0)

s.d $f6, 8($s0)

Receive($f8)

add.d $f10, $f8, $f6

add.d $f0, $f10, $f8

Send (2, $f10)

Send (2, $f0)

s.d. $f8, 16($s0)

s.d $f10, 24($s0)

s.d $f0 32($s0)

Receive($f2)

s.d $f2 40($s0)

addiu $s0, $s0, 48

bne $s0, $s1, loop

add.d $f4, $f2, $f0

add.d $f6, $f4, $f2

add.d $f10, $f8, $f6

s.d $f4, 0($s0)

s.d $f6, 8($s0)

s.d $f8, 16($s0)

Code at node 2:

addiu $s2, $zero, 0

loop:

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

addiu $s2, $s2, 1

bne $s2, 83, loop

The loop takes 1463 cycles .

6.4.4 The loop network needs to respond in a single cycle to achieve speedup.

Step by step solution

Determine parallel processing

A Multiprocessor system will have more than one processor. A parallel processing system will run a single program with multiple processors. Independent programs will run independently in multiprocessors is called task-level parallelism. Set of computer systems connected over a local area network that works as a single large multiprocessor simultaneously is called a cluster.

Determine the cycles does it take for all instructions in a single iteration of the above loop to execute

6.4.1

Given piece of code:

for (j=2;j<1000;j++)

D[j] = D[j−1]+D[j−2];

The MIPS code corresponding to the above fragment is:

addiu $s2,$zero,7992

addiu $s1,$zero,16

loop: l.d $f0, _16($s1)

l.d $f2, _8($s1)

add.d $f4, $f0, $f2

s.d $f4, 0($s1)

addiu $s1, $s1, 8

bne $s1, $s2, loop

By the straightforward computation, the first instruction will execute once and the loop body is executed 998 times.

Approximately it takes around 20,959 cycles to compute.

Determine a schedule for the two nodes assuming an unroll factor of 4 for the loop body (i.e., the loop body will appear 4 times) and compute the number of cycles

6.4.3

Given:

Apply loop unrolling to this loop and consider that this code is running on node 2.
Use message passing in section 6.7
Introduce send(x,y) and the operation receive() .
Send() takes 1 cycle to issue and receive () takes 10 cycles to receive

Schedule code for node 1

The loop body running in node 1:

addiu $s1, $zero, 996

l.d $f0, -16($s0)

l.d $f2, -8($s0)

loop:

add.d $f4, $f2, $f0

add.d $f6, $f4, $f2

Send (2, $f4)

Send (2, $f6)

s.d $f4, 0($s0)

s.d $f6, 8($s0)

Receive($f8)

add.d $f10, $f8, $f6

add.d $f0, $f10, $f8

Send (2, $f10)

Send (2, $f0)

s.d. $f8, 16($s0)

s.d $f10, 24($s0)

s.d $f0 32($s0)

Receive($f2)

s.d $f2 40($s0)

addiu $s0, $s0, 48

bne $s0, $s1, loop

add.d $f4, $f2, $f0

add.d $f6, $f4, $f2

add.d $f10, $f8, $f6

s.d $f4, 0($s0)

s.d $f6, 8($s0)

s.d $f8, 16($s0)

Code at node 2:

addiu $s2, $zero, 0

loop:

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

addiu $s2, $s2, 1

bne $s2, 83, loop

The loop takes 1463 cycles for the loop to run on the message passing system.

Determine How fast does the interconnect need to be in order to obtain any speedup from using the distributed system described in Exercise 6.4.3

6.4.4 The loop network needs to respond in a single cycle to achieve speedup. This creates difficulty in using the distributed message passing when loops contain loop-carried dependencies.

Unlock Step-by-Step Solutions & Ace Your Exams!

Full Textbook Solutions
Get detailed explanations and key concepts
Unlimited Al creation
Al flashcards, explanations, exams and more...
Ads-free access
To over 500 millions flashcards
Money-back guarantee
We refund you if you fail your exam.

Start your free trial

Over 30 million students worldwide already upgrade their learning with Vaia!

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Short Answer

Step by step solution

Determine parallel processing

Determine the cycles does it take for all instructions in a single iteration of the above loop to execute

Determine a schedule for the two nodes assuming an unroll factor of 4 for the loop body (i.e., the loop body will appear 4 times) and compute the number of cycles

Determine How fast does the interconnect need to be in order to obtain any speedup from using the distributed system described in Exercise 6.4.3

One App. One Place for Learning.

Most popular questions from this chapter

Recommended explanations on Computer Science Textbooks

Theory of Computation

Game Design in Computer Science

Computer Network

Data Representation in Computer Science

Problem Solving Techniques

Issues in Computer Science

Study anywhere. Anytime. Across all devices.

Company

Product

Help