Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Question:Consider the following piece of C code:

for (j=2;j<1000;j++)

D[j] = D[j−1]+D[j−2];

Th e MIPS code corresponding to the above fragment is:

addiu \(s2,\)zero,7992

addiu \(s1,\)zero,16

loop: l.d \(f0, _16(\)s1)

l.d \(f2, _8(\)s1)

add.d \(f4, \)f0, \(f2

s.d \)f4, 0(\(s1)

addiu \)s1, \(s1, 8

bne \)s1, $s2, loop

Instructions have the following associated latencies (in cycles):

add.d

I.d

s.d

addiu

4

6

1

2

6.4.1 How many cycles does it take for all instructions in a single iteration of the above loop to execute?

6.4.2 When an instruction in a later iteration of a loop depends upon a data value produced in an earlier iteration of the same loop, we say that there is aloop carried dependencebetween iterations of the loop. Identify the loop-carried dependences in the above code. Identify the dependent program variable and assembly-level registers. You can ignore the loop induction variable j.

6.4.3 Loop unrolling was described in Chapter 4. Apply loop

unrolling to this loop and then consider running this code on a 2-node distributed memory message-passing system. Assume that we are going to use message passing as described in Section 6.7, where we introduce a new operation send (x, y) that sends to node x the value y, and an operation receive( ) that waits for the value being sent to it. Assume that send operations take a cycle to issue (i.e., later instructions on the same node can proceed on the next cycle), but take 10 cycles to be received on the receiving node. Receive instructions stall execution on the node where they are executed until they receive a message. Produce a schedule for the two nodes assuming an unroll factor of 4 for the loop body (i.e., the loop body will appear 4 times). Compute the number of cycles it will take for the loop to run on the message passing system.

6.4.4 The latency of the interconnect network plays a large role in the efficiency of message-passing systems. How fast does the interconnect need to be in order to obtain any speedup from using the distributed system described in Exercise 6.4.3

Short Answer

Expert verified

6.4.1 First instruction is executed once and the loop body is executed 998 times.

6.4.2 D[j] and D[j-1] will have loop carried dependencies. $f4 register will be dependent on the current iteration and the $f0 in the next iteration.

6.4.3 The loop body running in node 1:

addiu $s1, $zero, 996

l.d $f0, -16($s0)

l.d $f2, -8($s0)

loop:

add.d $f4, $f2, $f0

add.d $f6, $f4, $f2

Send (2, $f4)

Send (2, $f6)

s.d $f4, 0($s0)

s.d $f6, 8($s0)

Receive($f8)

add.d $f10, $f8, $f6

add.d $f0, $f10, $f8

Send (2, $f10)

Send (2, $f0)

s.d. $f8, 16($s0)

s.d $f10, 24($s0)

s.d $f0 32($s0)

Receive($f2)

s.d $f2 40($s0)

addiu $s0, $s0, 48

bne $s0, $s1, loop

add.d $f4, $f2, $f0

add.d $f6, $f4, $f2

add.d $f10, $f8, $f6

s.d $f4, 0($s0)

s.d $f6, 8($s0)

s.d $f8, 16($s0)

Code at node 2:

addiu $s2, $zero, 0

loop:

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

addiu $s2, $s2, 1

bne $s2, 83, loop

The loop takes 1463 cycles .

6.4.4 The loop network needs to respond in a single cycle to achieve speedup.

Step by step solution

01

Determine parallel processing

A Multiprocessor system will have more than one processor. A parallel processing system will run a single program with multiple processors. Independent programs will run independently in multiprocessors is called task-level parallelism. Set of computer systems connected over a local area network that works as a single large multiprocessor simultaneously is called a cluster.

02

Determine the cycles does it take for all instructions in a single iteration of the above loop to execute

6.4.1

Given piece of code:

for (j=2;j<1000;j++)

D[j] = D[j−1]+D[j−2];

The MIPS code corresponding to the above fragment is:

addiu $s2,$zero,7992

addiu $s1,$zero,16

loop: l.d $f0, _16($s1)

l.d $f2, _8($s1)

add.d $f4, $f0, $f2

s.d $f4, 0($s1)

addiu $s1, $s1, 8

bne $s1, $s2, loop

By the straightforward computation, the first instruction will execute once and the loop body is executed 998 times.

Approximately it takes around 20,959 cycles to compute.

03

Determine a schedule for the two nodes assuming an unroll factor of 4 for the loop body (i.e., the loop body will appear 4 times) and compute the number of cycles

6.4.3

Given:

  • Apply loop unrolling to this loop and consider that this code is running on node 2.

  • Use message passing in section 6.7

  • Introduce send(x,y) and the operation receive() .

  • Send() takes 1 cycle to issue and receive () takes 10 cycles to receive

Schedule code for node 1

The loop body running in node 1:

addiu $s1, $zero, 996

l.d $f0, -16($s0)

l.d $f2, -8($s0)

loop:

add.d $f4, $f2, $f0

add.d $f6, $f4, $f2

Send (2, $f4)

Send (2, $f6)

s.d $f4, 0($s0)

s.d $f6, 8($s0)

Receive($f8)

add.d $f10, $f8, $f6

add.d $f0, $f10, $f8

Send (2, $f10)

Send (2, $f0)

s.d. $f8, 16($s0)

s.d $f10, 24($s0)

s.d $f0 32($s0)

Receive($f2)

s.d $f2 40($s0)

addiu $s0, $s0, 48

bne $s0, $s1, loop

add.d $f4, $f2, $f0

add.d $f6, $f4, $f2

add.d $f10, $f8, $f6

s.d $f4, 0($s0)

s.d $f6, 8($s0)

s.d $f8, 16($s0)

Code at node 2:

addiu $s2, $zero, 0

loop:

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

Receive ($f12)

Receive ($f14)

add.d $f16, $f14, $f12

Send(1, $f16)

addiu $s2, $s2, 1

bne $s2, 83, loop

The loop takes 1463 cycles for the loop to run on the message passing system.

04

Determine How fast does the interconnect need to be in order to obtain any speedup from using the distributed system described in Exercise 6.4.3

6.4.4 The loop network needs to respond in a single cycle to achieve speedup. This creates difficulty in using the distributed message passing when loops contain loop-carried dependencies.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

We would like to execute the loop below as efficiently as possible. We have two different machines, a MIMD machine and a SIMD machine.

for (i=0;i<2000;i++)

for(j=0;j<3000;j++)

X_array[i][j] = Y_array[j][i] + 200;

6.11.1 [10] For a 4 CPU MIMD machine, show the sequence of MIPS instructions that you would execute on each CPU. What is the speedup for this MIMD machine?

6.11.2 [10] For an 8-wide SIMD machine (i.e.,8 parallel SIMD functional units), write an assembly program in using your own SIMD extensions to MIPS to execute the loop. Compare the number of instructions executed on the SIMD machine to MIMD machine.

AMD has recently announced that they will be integrating a graphics processing unit with their x86 cores in a single package, though with different clocks for each of the cores. This is an example of a heterogeneous multiprocessor system which we expect to see produced commercially in the near future. One of the key design points will be to allow for fast data communication between the CPU and the GPU. Presently communications must be performed between discrete CPU and GPU chips. But this is changing in AMDs Fusion architecture. Presently the plan is to use multiple (at least 16) PCI express channels for facilitate intercommunication. Intel is also jumping into this arena with their Larrabee chip. Intel is considering to use their QuickPath interconnect technology.

6.15.1Compare the bandwidth and latency associated with these two interconnect technologies.

Consider the following three CPU organizations:

CPU SS: A 2-core superscalar microprocessor that provides out-of-order issue

capabilities on 2 function units (FUs). Only a single thread can run on each core at a time.

CPU MT: A fine-grained multithreaded processor that allows instructions from 2 threads to be run concurrently (i.e., there are two functional units), though only instructions from a single thread can be issued on any cycle.

CPU SMT: An SMT processor that allows instructions from 2 threads to be run

concurrently (i.e., there are two functional units), and instructions from either or both threads can be issued to run on any cycle.

Assume we have two threads X and Y to run on these CPUs that include the

following operations:

Thread X Thread Y

A1 – takes 3 cycles to execute B1 – take 2 cycles to execute

A2 – no dependences B2 – conflicts for a functional unit with B1

A3 – conflicts for a functional unit with A1 B3 – depends on the result of B2

A4 – depends on the result of A3 B4 – no dependences and takes 2 cycles to execute

Assume all instructions take a single cycle to execute unless noted oterwise or they encounter a hazard.

6.9.1 [10] <§6.4> Assume that you have 1 SS CPU. How many cycles will it take to execute these two threads? How many issue slots are wasted due to hazards?

6.9.2 [10] <§6.4> Now assume you have 2 SS CPUs. How many cycles will it take to execute these two threads? How many issue slots are wasted due to hazards?

6.9.3 [10] <§6.4> Assume that you have 1 MT CPU. How many cycles will it take

to execute these two threads? How many issue slots are wasted due to hazards?

Rewrite the code for fact to use fewer instructions.

Figure B.8.8 on page B-55 illustrates the implementation of the register file for the MIPS datapath. Pretend that a new register file is to be built, but that there are only two registers and only one read port, and that each register has only 2 bits of data. Redraw Figure B.8.8 so that every wire in your diagram corresponds to only 1 bit of data (unlike the diagram in Figure B.8.8, in which some wires are 5 bits and some wires are 32 bits). Redraw the registers using D flipflops. You do not need to show how to implement a D flip-flop or a multiplexor.

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free