Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream:

0, 2, 4, 6, 8, 10, 12, 14, 16, …

5.5.1 Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, based on the 3C model?

5.5.2 Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?

5.5.3 “Prefetching” is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above?

Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of 1.35 references (both instruction and data) per instruction, help find the optimal block size given the following miss rates for various block sizes.

8;4%
16:3%
32:2%
64:1.5%
128:1%

5.5.4 What is the optimal block size for a miss latency of 20×B cycles?

5.5.5 What is the optimal block size for a miss latency of 24+B cycles?

5.5.6 For constant miss latency, what is the optimal block size

Short Answer

Expert verified

5.5.1

Miss rate116

5.5.2

if the cache block size is 16 byte

the miss rate for the given address stream is

if the cache block size is 64 byte

The miss rate for the given address stream is

If the cache block size is 128 byte

The miss rate for the given address stream is

andthe workload is exploiting spatial locality

5.5.3

The miss rate is 0.00038% = 0%

5.5.4

B = 8 is the optimal block size

5.5.5

B=32 is the optimal block size

5.5.6

B = 128 is optimal

Step by step solution

01

Determine the formulae

Write the formula for calculating the size of the block

Sizeoftheblock=videostreamingworloadaccessesdatacachelinesize …….(1)

Write the formula for calculating the miss rate for the address stream

missratefortheaddressstream=Sizeofthedirect-mappedcacheSizeoftheblock……..(2)

Write the formula for calculating miss rate

missrate=numberofmissestotalaccesses ..........(3)

02

Determine the miss rate for the given address stream 

Video streaming workload that accesses data = 512 KiB

Given address stream = 0, 2, 4, 6, 8, 10, 12, 14, 16......

Difference between the stream = 2

Size of direct-mapped cache = 64 KiB

Cache line size =32-byte

First, we calculate the size of the block

sizeoftheblock=videostreaminworloadaccessesdataCachelinesize=512KiB32bytes=16K

Now we calculate the miss rate for the address stream

missratefrotheaddressstream=Sizeofthedirect-mappedcacheSizeoftheblock=64KB16K=4B=4×8bits=32bits

Difference between the stream 2 so the miss occurs at therole="math" localid="1650065851842" =322=16thaccess

Therefore, the miss will occur for every 16thaccess of the address stream

So, the miss rate for the given address stream is1miss=116

Miss rate is not sensitive because miss rate is not dependent on the size of the cache or working set.

03

categorize the misses

The miss will occur for every 16thaccess of the address stream so in every 16 addresses of the stream there are 15 hits and 1 miss.

Therefore, the number of misses is predictable for a given workload.

Based on the 3C model the misses in this workload is experiencing cold-stat misses or compulsory misses

04

Re-compute the miss rate if the cache block size is 16 byte

5.5.2

Given data

Video streaming workload that accesses data = 512 KiB

Given address stream = 0, 2, 4, 6, 8. 10, 12, 14, 16.....

Difference between the stream = 2

Size of direct-mapped cache = 64 KiB

Case 1:

If cache line size = 16 byte

sizeoftheblock=videostreamingworkloadaccessesdataCachelinesize=512KiB16bytes=32KMissratefortheaddressstream=Sizeofthedirect-mappedcacheSizeoftheblock=64KiB32K=2B=2×8=16bits

Difference between the stream 2 so the miss occurs at the =

Therefore, the miss will occur for every 8th access of the address stream

So, the miss rate for the given address stream is1miss=18

05

Re-compute the miss rate if the cache block size is 64 byte

If cache line size = 64 byte

Sizeoftheblock=videostreamingworkloadaccessesdataCachelinesize=512KiB64bytes=8KMissratefortheaddressstream=Sizeofthedirect-mappedcacheSizeoftheblock=64KiB8K=8B=8×8=64bits

Difference between the stream 2 so the miss occurs at the = 32thaccess

Therefore, the miss will occur for every 32nd access of the address stream

So, the miss rate for the given address stream is1miss=132

06

Re-compute the miss rate if the cache block size is 128 byte

If cache line size = 128 byte

sizeoftheblock=videostreamingworkloadaccessesdataCachelinesize=64KiB4K=16B=16×8bits=128bits

Difference between the stream 2 so the miss occurs at the=1282=64thaccess

Therefore, the miss will occur for every 32th access of the address stream

So, the miss rate for the given address stream is1miss=164

07

kind of locality is this workload

Spatial memory locality is considered to occur if a location has been referenced recently and there is a chance of it being referenced again in the future and here Every access has a nearby location.

So, the workload is exploiting spatial locality

08

Using Prefetching to reduce the miss rate to zero

5.5.3

Given data

Video streaming workload that accesses data = 512 KiB

Given address stream = 0, 2, 4, 6, 8, 10, 12, 14, 16, ......

Difference between the stream = 2

Size of direct-mapped cache = 64 KiB

Due to the two-entry stream buffer and the cache latency, a cache block can be loaded before the previous cache block is finished being computed.

Prefetching is used to predict future accesses to the cache memory as the current block is being executed, the new predicted block is preloaded into the cache memory, thereby reducing the miss rate to zero

Number of misses = 1

Total accesses=512KB2=512×10242=26144

Now, calculate the miss rate

Missrate=numberofmissestotalaccesses=1262144=0.00000381469

Therefore, the miss rate is 0.00038% = 0%

09

Determine the optimal block size for a miss latency of 20×B cycles

5.5.4

To determine the optimal block size for a miss latency we calculate AMAT (Average memory access time) for B

AMAT for B=8:0.040×(20×8)=6.40

AMAT for B=16:0.030×20×16=9.60

AMAT for B=32:0.020×20×32=12.80

AMAT for B=64:0.015×20×64=19.20

AMAT for B ==128:0.010×20×128=25.60

B is the optimal block size for a miss latency of 20×B cycles

10

Determine the optimal block size for a miss latency of 24+B cycles

5.5.5

Again, we calculate AMAT (Average memory access time) to determine the optimal block size for miss latency of 24+B cycles

AMAT for B=8:0.040×24+8=1.28

AMAT for B=16:0.030×(24+16)=1.20

AMAT for B=32:0.020×24+32=1.12

AMAT for B=64:0.015×(24+64)=1.32

AMAT for B=128:0.010×24+128=1.52

B =32 is the optimal block size for a miss latency of 24×B cycles

11

Determine the optimal block size for constant miss latency

B = 128 is optimal

Because it Is minimizing the miss rate minimizes the total miss latency.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

Question: In this exercise, we will examine space/time optimizations for page tables. The following list provides parameters of a virtual memory system.

Virtual Address (bits)

Physical DRAM Installed

Page Size

PTE Size (byte)

43

16 GiB

4KiB

4

(5.12.1) For a single-level page table, how many page table entries (PTEs) are needed? How much physical memory is needed for storing the page table?

(5.12.2) Using a multilevel page table can reduce the physical memory consumption of page tables, by keeping active PTEs in physical memory. How many levels of page tables will be needed in this case? And how many memory references are needed for address translation if missing in TLB?

(5.12.3) An inverted page table can be used to further optimize space and time. How many PTEs are needed to store the page table? Assuming a hash table implementation, what are the common case and worst case numbers of memory references needed for servicing a TLB miss?

The following table shows the contents of a 4-entry TLB.

Entry-ID

Valid

VA Page

Modified

Protection

PA Page

1

1

140

1

RW

30

2

0

40

0

RX

34

3

1

200

1

RO

32

4

1

280

0

RW

31

(5.12.4) Under what scenarios would entry 2’s valid bit be set to zero?

(5.12.5) What happens when an instruction writes to VA page 30? When would software managed TLB be faster than hardware managed TLB?

(5.12.6) What happens when an instruction writes to VA page 200?

Chip multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for benchmarks with private vs shared L2 cache designs. Assume L1 cache misses once every 32 instructions.

Private

Shared

Benchmark A misses-per-instruction

0.30%

0.12%

Benchmark B misses-per-instruction

0.06%

0.03%

Assume the following hit latencies:

Private Cache

Shared Cache

Memory

5

20

180

5.18.1 Which cache design is better for each of these benchmarks? Use data to support your conclusion.

5.18.2 Shared cache latency increases with the CMP size. Choose the best design if the shared cache latency doubles. Off-chip bandwidth becomes the bottleneck as the number of CMP cores increases. Choose the best design if off-chip memory latency doubles.

5.18.3 Discuss the pros and cons of shared vs. private L2 caches for both single-threaded, multi-threaded, and multiprogrammed workloads, and reconsider them if having on-chip L3 caches.

5.18.4 Assume both benchmarks have a base CPI of 1(ideal L2 cache). If having a non-blocking cache improves the average number of concurrent L2 misses from 1 to 2, how much performance improvement does this provide over a shared L2 cache? How much improvement can be achieved over private L2?

5.18.5 Assume new generations of processors double the number of cores every 18 months. To maintain the same level of per-core performance, how much more off-chip memory bandwidth is needed for a processor released in three years?

5.18.6 Consider the entire memory hierarchy. What kinds of optimizations can improve the number of concurrent misses?

Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches:

L1

L2

Write through, non-write allocate

Write back, write allocate

5.4.1 Buffers are employed between different levels of memory hierarchy to reduce access latency. For this given configuration, list the possible buffers needed between L1 and L2 caches, as well as L2 cache and memory.

5.4.2 Describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block.

5.4.3 For a multilevel exclusive cache (a block can only reside in one of the L1 and L2 caches), configuration, describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block

Consider the following program and cache behaviors.

Data Reads per 100 Instructions

Data writes per 1000 Instructions

Instruction Cache Miss Rate

Data Cache Miss Rate

Block Size(byte)

250

100

0.30%

2%

64%

5.4.4 For a write-through, write-allocate cache, what are the minimum read and write bandwidths (measured by byte per cycle) needed to achieve a CPI of 2?

5.4.5 For a write-back, write-allocate cache, assuming 30% of replaced data cache blocks are dirty, what are the minimal read and write bandwidths needed for a CPI of 2?

5.4.6 What are the minimal bandwidths needed to achieve the performance of CPI=1.5?

Cache coherence concerns the views of multiple processors on a given cache block. The following data shows two processors and their read/write operations on two different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 32 bits.

P1

P2

X0++;X1=3

X0=5;X1+=2;

5.17.1 List the possible values of the given cache block for a correct cache coherence protocol implementation. List at least one more possible value of the block if the protocol doesn’t ensure cache coherency.

5.17.2 For a snooping protocol, list a valid operation sequence on each processor/cache to finish the above read/write operations.

5.17.3 What are the best-case and worst-case numbers of cache misses

needed to execute the listed read/write instructions?

Memory consistency concerns the views of multiple data items. The following data shows two processors and their read/write operations on different cache blocks (A and B initially 0).

P1

P2

A=1;B-2;A+=2;B++;

C=B;D=A;

5.17.4 List the possible values of C and D for an implementation that ensures both consistency assumptions on page 470.

5.17.5List at least one more possible pair of values for C and D if such assumptions are not maintained.

5.17.6 For various combinations of write policies and write allocation policies, which combinations make the protocol implementation simpler?

This exercise examines the impact of different cache designs, specifically comparing associative caches to the direct-mapped caches from Section 5.4. For these exercises, refer to the address stream shown in Exercise 5.2.

(5.7.1) Using the sequence of references from Exercise 5.2, show the final cache contents for a three-way set associative cache with two- word blocks and a total size of 24 words. Use LRU replacement. For each reference identify the index bits, the tag bits, the block offset bits, and if it is a hit or a miss.

(5.7.2) Using the references from Exercise 5.2, show that final cache contents for a fully associative cache with one-word blocks and a total size of 8 words. Use LRU replacement. For each reference identify the index bits, the tag bits, and if it is a hit or a miss.

(5.7.3) Using the references from Exercise 5.2, what is the miss rate for a fully associative cache with two-word blocks and a total size of 8 words, using LRU replacement? What is the miss rate using MRU (most recently used) replacement? Finally what is the best possible miss rate for this cache, given any replacement policy?

Multilevel caching is an important technique to overcome the limited amount of space that a first level cache can provide while still maintaining its speed. Consider a processor with the following parameters:

Base CPI, No Memory Stalls

Processor Speed

Main Memory Access Time

First Level Cache MissRate per Instruction

Second Level Cache, Direct-Mapped Speed

Global Miss Rate with Second Level Cache, Direct-Mapped

Second Level Cache, Eight-Way Set Associative Speed

Global Miss Rate with Second Level Cache, Eight-Way Set Associative

1.5

2 GHz

100 ns

7%

12 cycles

3.5%

28 cycles

1.5%

(5.7.4) Calculate the CPI for the processor in the table using: 1) only a first level cache, 2) a second level direct-mapped cache, and 3) a second level eight-way set associative cache. How do these numbers change if main memory access time is doubled? If it is cut in half?

(5.7.5) It is possible to have an even greater cache hierarchy than two levels. Given the processor above with a second level, direct-mapped cache, a designer wants to add a third level cache that takes 50 cycles to access and will reduce the global miss rate to 1.3%. Would this provide better performance? In general, what are the advantages and disadvantages of adding a third level cache?

(5.7.6) In older processors such as the Intel Pentium or Alpha 21264, the second level of cache was external (located on a different chip) from the main processor and the first level cache. While this allowed for large second level caches, the latency to access the cache was much higher, and the bandwidth was typically lower because the second level cache ran at a lower frequency. Assume a 512 KiB off-chip second level cache has a global miss rate of 4%. If each additional 512 KiB of cache lowered global miss rates by 0.7%, and the cache had a total access time of 50 cycles, how big would the cache have to be to match the performance of the second level direct-mapped cache listed above? Of the eight way-set associative cache?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free