Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Assume for arithmetic, load/store, and branch instructions, a processor have CPIs of 1, 12, and 5, respectively. Also assume that on a single processor a program requires the execution of 2.56E9 arithmetic instructions, 1.28E9 load/store instructions, and 256 million branch instructions. Assume that each processor has a 2 GHz clock frequency.

Assume that, as the program is parallelized to run over multiple cores, the number of arithmetic and load/store instructions per processor is divided by 0.7 x p (where p is the number of processors) but the number of branch instructions per processor remains the same.

1.9.1 Find the total execution time for this program on 1, 2, 4, and 8 processors, and show the relative speedup of the 2, 4, and 8 processor result relative to the single processor result.

1.9.2 If the CPI of the arithmetic instructions was doubled, what would the impact be on the execution time of the program on 1, 2, 4, or 8 processors?

1.9.3 To what should the CPI of load/store instructions be reduced in order for a single processor to match the performance of four processors using the original CPI values?

Short Answer

Expert verified

1.9.1)

Execution time for processor 1: 9.6 sec

Execution time for processor 2: 7.02 sec

Execution time for processor 4: 3.86 sec

Execution time for processor 8: 2.25 sec

Relative speed taken by processor 2: 1.37

Relative speed taken by processor 4: 2.49

Relative speed taken by processor 8: 4.27

1.9.2)

Impact on execution time on 1 processor: 10.88ms

Impact on execution time on 2 processor: 7.954ms

Impact on execution time on 4 processor: 4.297ms

Impact on execution time on 8 processor: 2.468ms

1.9.3)

reduced CPI of load/store instructions is 25%

Step by step solution

01

Consider the following data

CPI for

Arithmeticinstruction:1Load/Storeinstruction:12Branchinstruction:5

No. of Instruction per processor

Arithmetic:2.56×109Load/Store:1.28×109Branch:256×106instructions

Clock rate=

When a program is distributed to operate on multicore processors, the amount of arithmetic and load store operations per processor is divided by 0.7 and multiplied by the number of processors p, but the branch command remains unchanged. There are four processors: 1,2,4,8. To calculate the execution, the processor uses the following method:

Write a formula for execution time for a processor:

Executiontime=clockcycleclockrate.........(1)

Write a formula for clock cycle.

Clockcyles=CPIfpxNo.FPinstr.+CPIintxNo.INTinstr.+CPI1/zxNo.L/Sinstr.+CPlbranchxNo.branchinstr............(2)

Write a formula forRelative speedup

Relativespeedup=Timetakenby1processorTimetakenbypprocessors.....................(3)

02

Execution time for processor 1, processor 2, processor 4, and processor 8

1.9.1

Execution time for processor 1:

Compute the clock cycle using the following strategy:

Clockcycle=CPIfpxNo.FPinstr.+CPlint,*No.INTinstr+CPI1/zxNo.L/Sinstr.+CPIbranch,*No.branchinstr

Therefore,

Clockcycle=2560000000 x1 + 1280000000x12 + 256000000 x5=2560000000 + 15360000000 + 1280000000=19200000000 cycles

Now calculate the execution time with the help of following method:

Executiontimeforaprocessor=clockcycleclockrate

Therefore,

CPUexecutiontime=19200000000cycles/2×109cycles/sec=9600000000/109=9.6sec

Relativespeedup=Timetakenby1processorTimetakenbypprocessors=9.6/9.6=1

Execution time for processor 2:

When a programme is distributed to operate on multicore processors, the amount of arithmetic and load store operations per processor is divided by 0.7 and multiplied by the number of processors p, but the branch command remains unchanged.There are four processors: 1,2,4,8.

Compute the clock cycle using the following strategy:

Clockcyles=CPIfpxNo.FPinstr.+CPlint,xNo.INTinstr+CPI1/zxNo.L/Sinstr.+CPlbranch,*No.branchinstr

Therefore,

clockcycle=25600.7*8*2+12890.7*8*12+256*5=914.23+2742.86+1280=4937.08cycle

Now calculate the execution time with the help of following method:

Executiontimeforaprocessor=clockcycleclockrate

Therefore

CPUexecutiontime=14040000000cycles/2*108cycles/sec=7020000000/109=7.02sec

Relativespeedup=Timetakenby1processorTimetakenby2processors=9.67.02=1.37

Execution time for processor 4:

When a programme is distributed to operate on multicore processors, the amount of arithmetic and load store operations per processor is divided by 0.7 and multiplied by the number of processors p, but the branch command remains unchanged. There are four processors: 1,2,4,8.

Compute the clock cycle using the following strategy:

Clockeyes=CPIfp,xNo.FPinstr.+CPIint,xNo.INTinstr+CPI1/z,xNo.L/Sinstr.+CPlbranch*No.branchinstr

Therefore,

clockcycle=2560000000*1/0.7*4+1280000000*12/0.7*4+256000000*5=7720000000cycles

Now calculate the execution time with the help of following method:

Executiontimeforaprocessor=clockcycleclockrate

Therefore,

CPUexecutiontime=1720000000cycles/2*109cycles/sec=3.86secRelativespeedup=Timetakenby1processorTimetakenby4processors=9.6/3.86=2.49

Execution time for processor 8:

When a programme is distributed to operate on multicore processors, the amount of arithmetic and load store operations per processor is divided by 0.7 and multiplied by the number of processors p, but the branch command remains unchanged. There are four processors: 1,2,4,8.

Compute the clock cycle using the following strategy:

Clockcyles=CPlfpxNo.FPinstr.+CPIint,xNo.INTinstr+CPI1/zxNo.L/Sinstr.+CPlbranch*No.branchinstr

Therefore,

localid="1655192387163" clockcycle=2560000000*10.7*8+1280000000*120.7*8+256000000*5=256000000005.6+153600000005.6+1280000000=4500000000cycles

Now calculate the execution time with the help of following method:

Executiontimeforaprocessor=clockcycleclockrate

Therefore,

CPUexecutiontime=4500000000cycles/2*109Cycle/sec=2250000000/109=2.25sec

Time taken by processor

Relativespeedup=Timetakenby1processorTimetakenby8processor=9.6/225=4.27

03

Execution time with double Arithmetic instructions.

1.9.2

Execution time for processor 1 when Arithmetic instructions are doubled:

If an Arithmetic instruction is doubled (CPI=2) then Execution time for processor 1 can be calculated as:

First calculate the Clock cycle, which can be calculated as:

Clock cycle

=2560×2+1280×12+256×5=21760cycles

Now execution time can be calculated as:

CPUexecutiontime=21760cycles/2x10cycles/sec=10.88ms

Execution time for processor 2 when Arithmetic instructions are doubled:

If an Arithmetic instruction is doubled (CPI=2) then Execution time for processor 2 can be calculated as:

Clockcycle=2560/0.7*2*2+1280/0.7*2*12+256*5=3657.14+1097143+1280=15908.57cycles

Now execution time can be calculated as:

CPUexecutiontime=15908.57cycles/2*109cycles/sec=7.954ms

Execution time for processor 4 when Arithmetic instructions are doubled:

If an Arithmetic instruction is doubled (CPI=2) then Execution time for processor 4 can be calculated as:

clockcycle=25600.7*4*2+12800.7*4*12+256*5=1828.57+5485.71+1280=8594

Now execution time can be calculated as:

CPU execution time

=8594.28cycles/2*109cycles/sec=4.297ms

Execution time for processor 8 when Arithmetic instructions are doubled:

If an Arithmetic instruction is doubled (CPI=2) then Execution time for processor 8 can be calculated as:

clockcycle=25600.7*8*2+12890.7*8*12+256*5=914.23+2742.86+1280=4937.08cycle

Now execution time can be calculated as:

=4937.08cycles/2x109cycles/sec=2.468ms

04

Determine the reduced CPI of load/store instructions.

1.9.3

For 4 processors:

When a programme is distributed to operate on multicore processors, the amount of arithmetic and load store operations per processor is divided by 0.7 and multiplied by the number of processors p, but the branch command remains unchanged. There are four processors: 1,2,4,8.

Therefore,

clock cycle

=2560000000/2.8+15360000000/2.8+1280000000=7720000000cycles

Now calculate the execution time with the help of following method:

Executiontimeforaprocessor=clockcycleclockrate

Therefore,

CPUexecutiontime=7720000000cycles/2x109cycles/sec=3860000000/109=3.86sec

Reducing CPI of a single processor to match the performance of 4 processors:

Compute the clock cycle using the following strategy:

Clockcyles=CPIfpxNo.FPinstr.+CPIintxNo.INTinstr.+CPI1/zxNo.L/Sinstr.+CPIbranchxNo.branchinstr

Therefore,

clock cycle

=2560000000x1+1280000000xa+256000000x5=2560000000+1280000000a+1280000000=3840000000+1280000000xa

Now calculate the execution time with the help of following method:

Executiontimeforaprocessor=clockcycleclockrate

Therefore,

CPU execution time

=3840000000+1280000000*a2*109cycle/sec=38400000002*109+1280000000*a2*109=3.86=1.92+0.64*a=1.94=0.64*a=a=3.03

The reduced CPI is calculated as follows:

Original CPI for load instructions=12

ReducedCPI=aoriginalCPIforloadinstructions=3.03/12=0.25or25%

Thus, the reduced CPI of load/store instructions is 25%.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

2 Section 1.10 cites as a pitfall the utilization of a subset of the performance equation as a performance metric. To illustrate this, consider the following two processors. P1 has a clock rate of 4 GHz, average CPI of 0.9, and requires the execution of 5.0E9 instructions. P2 has a clock rate of 3 GHz, an average CPI of 0.75, and requires the execution of 1.0E9 instructions.

1.12.1 [5] <§§1.6, 1.10> One usual fallacy is to consider the computer with the largest clock rate as having the largest performance. Check if this is true for P1 and P2.

1.12.2 [10] <§§1.6, 1.10> Another fallacy is to consider that the processor executing the largest number of instructions will need a larger CPU time. Considering that processor P1 is executing a sequence of 1.0E9 instructions and that the CPI of processors P1 and P2 do not change, determine the number of instructions that P2 can execute in the same time that P1 needs to execute 1.0E9 instructions.

1.12.3 [10] <§§1.6, 1.10> A common fallacy is to use MIPS (millions of

instructions per second) to compare the performance of two different processors, and consider that the processor with the largest MIPS has the largest performance.

Check if this is true for P1 and P2.

1.12.4 [10] <§1.10> Another common performance figure is MFLOPS (millions of floating-point operations per second), defined as

MFLOPS = No. FP operations / (execution time × 1E6)

but this figure has the same problems as MIPS. Assume that 40% of the instructions executed on both P1 and P2 are floating-point instructions. Find the MFLOPS figures for the programs

The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6 GHz and voltage of 1.25 V. Assume that, on average, it consumed 10 W of static power and 90 W of dynamic power. The Core i5 Ivy Bridge, released in 2012, had a clock rate of 3.4 GHz and voltage of 0.9 V. Assume that, on average, it consumed 30 W of static power and 40 W of dynamic power.

1.8.1 For each processor find the average capacitive loads.

1.8.2 Find the percentage of the total dissipated power comprised by static power and the ratio of static power to dynamic power for each technology.

1.8.3 If the total dissipated power is to be reduced by 10%, how much should the voltage be reduced to maintain the same leakage current? Note: power is defined as the product of voltage and current

A.9 [25] Write and test a MIPS assembly language program to compute and print the first 100 prime numbers. A number n is prime if no numbers except 1 and n divide it evenly. You should implement two routines: ■ test_prime (n) Return 1 if n is prime and 0 if n is not prime. ■ main () Iterate over the integers, testing if each is prime. Print the first 100 numbers that are prime. Test your programs by running them on SPIM.

B.30 [15] <§B.6> This exercise is similar to Exercises B.28 and B.29, but this time calculate the relative speeds of a 64-bit adder using ripple carry only, ripple carry of 4-bit groups that use carry lookahead, ripple carry of 16-bit groups that use carry lookahead, and the carry-lookahead scheme from Exercise B.27.

Now we wish to add the instructionaddi(add immediate). Add any necessary changes to the datapath and to the control signals. How many product terms are required in a PLA that implements the control for the single-cycle datapath foraddiu?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free