Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Cache coherence concerns the views of multiple processors on a given cache block. The following data shows two processors and their read/write operations on two different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 32 bits.

P1

P2

X0++;X1=3

X0=5;X1+=2;

5.17.1 List the possible values of the given cache block for a correct cache coherence protocol implementation. List at least one more possible value of the block if the protocol doesn’t ensure cache coherency.

5.17.2 For a snooping protocol, list a valid operation sequence on each processor/cache to finish the above read/write operations.

5.17.3 What are the best-case and worst-case numbers of cache misses

needed to execute the listed read/write instructions?

Memory consistency concerns the views of multiple data items. The following data shows two processors and their read/write operations on different cache blocks (A and B initially 0).

P1

P2

A=1;B-2;A+=2;B++;

C=B;D=A;

5.17.4 List the possible values of C and D for an implementation that ensures both consistency assumptions on page 470.

5.17.5List at least one more possible pair of values for C and D if such assumptions are not maintained.

5.17.6 For various combinations of write policies and write allocation policies, which combinations make the protocol implementation simpler?

Short Answer

Expert verified

4.17.1.The possible values of the given cache block for a correct cache coherence protocol implementation

Order 1:

P1

P2

X0++;

X1=3;

X0=5;

X1+=2;

Order 2:

P1

P2

X0++;

X0=5;

X [1] = 3;

X [1] + = 2;

Order 3:

P1

P2

X [0] = 5;

X [0]++;

X [1] + = 2;

X [1] = 3;

Order 4:

P1

P2

X [0] ++;

X [0] = 5;

X [1] +=2;

X [1] =3;

Order 5:

P1

P2

X [0] = 5

X [0] ++;

X [1] = 3

X [1] += 2

Order 6:

P1

P2

X [0] = 5

X [1] += 2

X [0] ++;

X [1] = 3

If coherency is not ensured, then P2’s operations take precedence over P1’s.

4.17.2For a snooping protocol, the list of valid operation sequences on each processor/cache to finish the above read/write operations is as follows:

P1

P1 Cache status

P2

P2 Cache status

X [0] = 5

Invalidate X on other caches, Read X and Write X block

X [1] += 2;

read and write X block

X [0] ++;

Read X value

X block is shared

Send invalidate message

X block is invalided

Write X block

X [1] = 3;

Write X block

4.17.3 Best case: Orderings 1 and 6

Worst Case: Orderings 2 and 3

4.17.4The possible values of the given cache block for a correct cache coherence protocol implementation with C and D.

Order 1:

P1

P2

A = 1

B = 2

A+ =2;

B++;

C = B

D = A

Order 2:

P1

P2

A = 1

B = 2

A+ =2;

C = B

B ++;

D = A

Order 3:

P1

P2

A = 1

B = 2

C = B

A+ =2;

B ++;

D = A

Order 4:

P1

P2

A = 1

C = B

B = 2

A+ =2;

B ++;

D = A

Order 5:

P1

P2

C = B

A = 1

B = 2

A+ =2;

B++;

D = A

Order 6:

P1

P2

A = 1

B = 2

A+ =2;

C = B

D = A

B ++;

Order 7:

P1

P2

A = 1

B = 2

C = B

A+ =2;

D = A

B ++;

Order 8:

P1

P2

A = 1

C = B

B = 2

A+ =2;

D = A

B ++;

Order 9:

P1

P2

C = B

A = 1

B = 2

A+ =2;

D = A

B ++;

Order 10:

P1

P2

A = 1

B = 2

C = B

D = A

A+ =2;

B ++;

Order 11:

P1

P2

A = 1

C = B

B = 2

D = A

A+ =2;

B ++;

Order 12:

P1

P2

C = B

A = 1

B = 2

D = A

A+ =2;

B ++;

Order 13:

P1

P2

A = 1

C = B

D = A

B = 2

A+ =2;

B ++;

Order 14:

P1

P2

C = B

A = 1

D = A

B = 2

A+ =2;

B ++;

Order 15:

P1

P2

C = B

D = A

A = 1

B = 2

A+ =2;

B ++;

5.17.5 Result: (2,0) for B=0, not preceding by A=1

5.17.6 Write Back is simpler.

Step by step solution

01

Determine Cache coherence and Snooping Protocols.

The multicore processors have multiple processors, but these processors are very likely to have the same physical address. When shared cache data is introduced, these processors will hold the same physical address but different values. This is a problem is referred to as the cache coherence problem.

Cache coherence protocols will maintain coherence for multiple processors.

There are two types of snooping protocols, Write-invalidate and Write-Update.

The write-invalidate protocol will invalidate the copies of data in all other processors before the data-writing processor changes its copy. Write-Update will announce the new data and then all affected caches will be updated with new data.

02

Determine the possible values of the given cache block for a correct cache coherence protocol implementation and the invalid one.

5.17.1

The given cache block is as follows:

P1

P2

X [0] ++;X [1] = 3

X [0] = 5;X [1]+=2;

The correct cache coherence protocol will maintain the coherence between the multiple processors.

The possible values are as follows:

coherence protocol implementation

Order 1:

P1

P2

X[0]++;

X[1]=3;

X[0]=5;

X[1]+=2;

X[0] will be incremented, but the processor P2 holds the value of X[0] as 5. So it remains 5 in the result. X[1] will be 5 by adding 3 to 2.

The values will be (5,5)

Order 2:

P1

P2

X[0]++;

X[0]=5;

X[1]=3;

X[1]+=2;

X[0] will be incremented first, but the processor P2 holds the value of X[0] as 5. So it remains 5 in the result. X[1] will be 5 by adding 3 to 2.

The values will be (5,5)

Order 3:

P1

P2

X[0]=5;

X[0]++;


X[1]+=2;

X[1]=3;

X[0] will be incremented later, so the value of X[0] will be 6. So it remains 5 in the result. X[1] will be 3 since P2 has the add statement before P1.

The values will be (6,3)

Order 4:

P1

P2

X[0]++;

X[0]=5;

X[1]+=2;

X[1]=3;

X[0] will be incremented first, but the value of X[0] will be 5 because of P2. So it remains 5 in the result. X[1] will be 3 sine P2 has the add statement before P1.

The values will be (6,3)

Order 5:

P1

P2

X[0]=5;

X[0]++;

X[1]=3;

X[1]+=2;

X[0] will be assigned to 5 first by P2, then it gets incremented as 6. The value 3 will be added to 2 and X[1] will have 5.

The values will be (6,5)

Order 6:

P1

P2

X[0]=5;

X[1]+=2;

X[0]++;

X[1]=3;

X[0] will be assigned with value 5 by P2 first and then get incremented as 6 by P1. Since the addition passes first in P2, the value of X[1] will be 3.

The values will be (6,3)

If coherency is not ensured, then P2’s operations take precedence over P1’s.

03

Determine the list of valid operation sequences on each processor/cache to finish the above read/write operations for snooping protocol.

5.17.2.

For a snooping protocol, the list of valid operation sequences on each processor/cache to finish the above read/write operations is as follows:

P1

P1 Cache status

P2

P2 Cache status

X[0]=5;

Invalidate X on other caches, Read X and Write X block

X[1]+=2;

read and write X block

X[0]++;

Read X value

X block is shared

Send invalidate message

X block is invalided

Write X block

X[1]=3;

Write X block

The snooping protocol will invalidate the X on other caches so that the coherence will be maintained in P2. Then the read and write will be done in the X block of the processor P2. After this, the X block will be shared with the processor P1. Now, the Processor P1 will read the X block and send invalidate message to P2 which prevents P2 from changing the X values. Finally, P1 will write the values on the X block.

04

Determine the best case and worst case of the list.

5.17.3

From the list provided in 5.17.1, the best and worst cases can be decided.

The best cases are orders 1 and 6 because they require only two misses.

The worst cases are orders 2 and 3 because these require 4 cache misses.

05

The possible values of the given cache block for a correct cache coherence protocol implementation with C and D.

5.17.4

Order 1:

P1

P2

A = 1

B = 2

A+ =2;

B++;

C = B

D = A

Result: (3,3)

Order 2:

P1

P2

A = 1

B = 2

A+ =2;

C = B

B ++;

D = A

Result: (2,3)

Order 3:

P1

P2

A = 1

B = 2

C = B

A+ =2;

B ++;

D = A

Result: (2,3)

Order 4:

P1

P2

A = 1

C = B

B = 2

A+ =2;

B ++;

D = A

Result: (0,3)

Order 5:

P1

P2

C = B

A = 1

B = 2

A+ =2;

B++;

D = A

Result: (0,3)

Order 6:

P1

P2

A = 1

B = 2

A+ =2;

C = B

D = A

B ++;

Result: (2,3)

Order 7:

P1

P2

A = 1

B = 2

C = B

A+ =2;

D = A

B ++;

Result: (2,3)

Order 8:

P1

P2

A = 1

C = B

B = 2

A+ =2;

D = A

B ++;

Result: (0,3)

Order 9:

P1

P2

C = B

A = 1

B = 2

A+ =2;

D = A

B ++;

Result: (0,3)

Order 10:

P1

P2

A = 1

B = 2

C = B

D = A

A+ =2;

B ++;

Result: (2,1)

Order 11:

P1

P2

A = 1

C = B

B = 2

D = A

A+ =2;

B ++;

Result: (0,1)

Order 12:

P1

P2

C = B

A = 1

B = 2

D = A

A+ =2;

B ++;

Result: (0,1)

Order 13:

P1

P2

A = 1

C = B

D = A

B = 2

A+ =2;

B ++;

Result: (0,1)

Order 14:

P1

P2

C = B

A = 1

D = A

B = 2

A+ =2;

B ++;

Result: (0,1)

Order 15:

P1

P2

C = B

D = A

A = 1

B = 2

A+ =2;

B ++;

Result: (0,0)

06

Determine one more possible pair of values for C and D if such assumptions are not maintained.

5.17.5

Assume that B=0 by P2, but not preceding by A=1.

Then the result is (2,0)

07

Determine which combinations make the protocol implementation simpler.

5.17.6. Write back will facilitate the use of exclusive access blocks, and the frequency of invalidates gets lower. It also prevents the use of write-broadcasts, but this is a more complex protocol.

The allocation policy has very little effect on the protocol.

So, the Write back is simpler than Write through.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior.

Base CPI

Priviliged O/S Accesses per 10,000 Instructions

Performance Impact to Trap to the Guest O/S

Performance Impact to Trap to VMM

I/O Access per 10,000 Instructions

I/O Access Time (Includes Time to Trap to Guest O/S)

1.5

120

15 cycles

175 cycles

30

1100 cycles

(5.15.1) Calculate the CPI for the system listed above assuming that there are no accesses to I/O. What is the CPI if the VMM performance impact doubles? If it is cut in half? If a virtual machine software company wishes to obtain a 10% performance degradation, what is the longest possible penalty to trap to the VMM?

(5.15.2) I/O accesses often have a large impact on overall system performance. Calculate the CPI of a machine using the performance characteristics above, assuming a non-virtualized system. Calculate the CPI again, this time using a virtualized system. How do these CPIs change if the system has the I/O access? Explain why I/O bound applications have a smaller impact from virtualization.

(5.15.3) Compare and contrast the ideas of virtual memory and virtual machines. How do the goals of each compare? What are the pros and cons of each? List a few cases where virtual memory is desired, and a few cases where virtual machines are desired.

(5.15.4) Section 5.6 discusses virtualization under the assumption that the virtualized system is running the same ISA as the underlying hardware. However, one possible used of virtualization is to emulate non-native ISAs. An example of this is QEMU, which emulates a variety of ISAs such as MIPS, SPARC, and PowerPC. What are some of the difficulties involved in this kind of virtualization? Is it possible for an emulated system to run faster than on its native ISA?

In this exercise we show the definition of a web server log and examine code optimizations to improve log processing speed. The data structure for the log is defined as follows:

struct entry {

int srcIP; // remote IP address

char URL[128]; // request URL (e.g., “GET index.html”)

long long refTime; // reference time

int status; // connection status

char browser[64]; // client browser name

} log [NUM_ENTRIES];

Assume the following processing function for the log:

topK_sourceIP (int hour);

5.19.1 Which fields in a log entry will be accessed for the given log processing function? Assuming 64-byte cache blocks and no prefetching, how many caches misses per entry does the given function incur on average?

5.19.2 How can you reorganize the data structure to improve cache utilization and access locality? Show your structure definition code.

5.19.3 Give an example of another log processing function that would prefer a different data structure layout. If both functions are important, how would you rewrite the program to improve the overall performance? Supplement the discussion with code snippets and data.

For the problems below, use data from “Cache performance for SPEC CPU2000 Benchmarks”(http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the pairs of benchmarks shown in the following table.

a.

Mesa/gcc

b.

mcf/swim

5.19.4 For 64KiB data caches with varying set associativities, what are the miss rates broken down by miss types (cold, capacity, and conflict misses) for each benchmark?

5.19.5 Select the set associativity to be used by a 64KiB L1 data cache shared by both benchmarks. If the L1 cache has to be directly mapped, select the set associativity for the 1 MiB cache.

5.19.6 Give an example in the miss rate table where higher set associativity increases the miss rate. Construct a cache configuration and reference stream to demonstrate this.

Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream:

0, 2, 4, 6, 8, 10, 12, 14, 16, …

5.5.1 Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, based on the 3C model?

5.5.2 Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?

5.5.3 “Prefetching” is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above?

Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of 1.35 references (both instruction and data) per instruction, help find the optimal block size given the following miss rates for various block sizes.

8;4%
16:3%
32:2%
64:1.5%
128:1%

5.5.4 What is the optimal block size for a miss latency of 20×B cycles?

5.5.5 What is the optimal block size for a miss latency of 24+B cycles?

5.5.6 For constant miss latency, what is the optimal block size

Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches:

L1

L2

Write through, non-write allocate

Write back, write allocate

5.4.1 Buffers are employed between different levels of memory hierarchy to reduce access latency. For this given configuration, list the possible buffers needed between L1 and L2 caches, as well as L2 cache and memory.

5.4.2 Describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block.

5.4.3 For a multilevel exclusive cache (a block can only reside in one of the L1 and L2 caches), configuration, describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block

Consider the following program and cache behaviors.

Data Reads per 100 Instructions

Data writes per 1000 Instructions

Instruction Cache Miss Rate

Data Cache Miss Rate

Block Size(byte)

250

100

0.30%

2%

64%

5.4.4 For a write-through, write-allocate cache, what are the minimum read and write bandwidths (measured by byte per cycle) needed to achieve a CPI of 2?

5.4.5 For a write-back, write-allocate cache, assuming 30% of replaced data cache blocks are dirty, what are the minimal read and write bandwidths needed for a CPI of 2?

5.4.6 What are the minimal bandwidths needed to achieve the performance of CPI=1.5?

For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache.

Tag

Index

offset

31-10

9-5

4-0

5.3.1 What is the cache block size (in words)?

5.3.2 How many entries does the cache have?

5.3.3 What is the ratio between total bits required for such a cache implementation over the data storage bits?

Starting from power on, the following byte-addressed cache references are recorded.

ADDRESS

0

4

16

132

232

160

1024

30

140

3100

180

2180

5.3.4 How many blocks are replaced?

5.3.5 What is the hit ratio?

5.3.6 List the final state of the cache, with each valid entry represented as a record of <index, tag, data>

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free