Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream:

0, 2, 4, 6, 8, 10, 12, 14, 16, …

5.5.1 Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, based on the 3C model?

5.5.2 Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?

5.5.3 “Prefetching” is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above?

Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of 1.35 references (both instruction and data) per instruction, help find the optimal block size given the following miss rates for various block sizes.

8:4%

16:3%

32:2%

64:1.5%

128:1%

5.5.4 What is the optimal block size for a miss latency of 20×B cycles?

5.5.5 What is the optimal block size for a miss latency of 24+B cycles?

5.5.6 For constant miss latency, what is the optimal block size

Short Answer

Expert verified

5.5.1

Miss rate=116

5.5.2

if the cache block size is 16 byte

the miss rate for the given address stream is 1miss=18

if the cache block size is 64 byte

The miss rate for the given address stream is1miss=132

If the cache block size is 128 byte

The miss rate for the given address stream is1miss=164

andthe workload is exploiting spatial locality

5.5.3

The miss rate is000038%=0%

5.5.4

Brole="math" localid="1655285883526" =8 is the optimal block size

5.5.5

B role="math" localid="1655285902426" =32is the optimal block size

5.5.6

B = 128 is optimal

Step by step solution

01

Determine the formulae

Write the formula for calculating the size of the block

Sizeoftheblockrole="math" localid="1655286115236" =videostreamingworkloadaccessesdataCachelinesize …….(1)

Write the formula for calculating the miss rate for the address stream

missratefortheaddressstream=SizeofthedirectmappedcacheSizeoftheblock……..(2)

Write the formula for calculating miss rate

role="math" localid="1655286304946" Missrate=numberofmissestotalaccesses ……..(3)

02

Determine the miss rate for the given address stream

5.5.1

Video streaming workload that accesses data = 512 KiB

Given address stream =0,2,4,6,8,10,12,14,16.....

Difference between the stream = 2

Size of direct-mapped cache = 64 KiB

Cache line size =32-byte

First, we calculate the size of the block

sizeoftheblock=videostreamingworkloadaccessesdataCachelinesize=512KiB32bytes=16K

Now we calculate the miss rate for the address stream

Missratefortheaddressstream=SizeofthedirectmappedcacheSizeoftheblock=64KiB16K=4B=4×8bits=32bits

Difference between the stream 2 so the miss occurs at the=322=16thaccess

Therefore, the miss will occur for every16th access of the address stream

So, the miss rate for the given address stream is1miss=116

Miss rate is not sensitive because miss rate is not dependent on the size of the cache or working set.

03

categorize the misses

The miss will occur for every 16thaccess of the address stream so in every 16 addresses of the stream there are 15 hits and 1 miss.

Therefore, the number of misses is predictable for a given workload.

Based on the 3C model the misses in this workload is experiencing cold-stat misses or compulsory misses

04

Re-compute the miss rate if the cache block size is 16 byte

Given data

Video streaming workload that accesses data = 512 KiB

Given address stream =0,2,4,6,8,10,12,14,16.....

Difference between the stream = 2

Size of direct-mapped cache = 64 KiB

Case 1:

If cache line size = 16 byte

sizeoftheblock=videostreamingworkloadaccessesdataCachelinesize=512KiB16bytes=32K

Missratefortheaddressstream=SizeofthedirectmappedcacheSizeoftheblock=64KiB32K=2B=2×8bits=16bits

Difference between the stream 2 so the miss occurs at the=162=8thaccess

Therefore, the miss will occur for every8th access of the address stream

So, the miss rate for the given address stream is1miss=18

05

Re-compute the miss rate if the cache block size is 64 byte

If cache line size = 64 byte

sizeoftheblock=videostreamingworkloadaccessesdataCachelinesize=512KiB64bytes=8K

Missratefortheaddressstream=SizeofthedirectmappedcacheSizeoftheblock=64KiB8K=8B=8×8bits=64bits

Difference between the stream 2 so the miss occurs at the=642=32thaccess

Therefore, the miss will occur for every32th access of the address stream

So, the miss rate for the given address stream is1miss=132

06

Re-compute the miss rate if the cache block size is 128 byte

If cache line size = 128 byte

sizeoftheblock=videostreamingworkloadaccessesdataCachelinesize=512KiB128bytes=4K

Missratefortheaddressstream=SizeofthedirectmappedcacheSizeoftheblock=64KiB4K=16B=16×8bits=128bits

Difference between the stream 2 so the miss occurs at the=1282=64thaccess

Therefore, the miss will occur for every 32thaccess of the address stream

So, the miss rate for the given address stream is1miss=164

07

kind of locality is this workload

Spatial memory locality is considered to occur if a location has been referenced recently and there is a chance of it being referenced again in the future and here Every access has a nearby location.

So, the workload is exploiting spatial locality

08

Using Prefetching to reduce the miss rate to zero

5.5.3

Given data

Video streaming workload that accesses data = 512 KiB

Given address stream =0,2,4,6,8,10,12,14,16.....

Difference between the stream = 2

Size of direct-mapped cache = 64 KiB

Due to the two-entry stream buffer and the cache latency, a cache block can be loaded before the previous cache block is finished being computed.

Prefetching is used to predict future accesses to the cache memory as the current block is being executed, the new predicted block is preloaded into the cache memory, thereby reducing the miss rate to zero

Number of misses = 1

Total accesses

Now, calculate the miss rate =512KB2

=512×10242=26144

Now, calculate the miss rate

Missrate=numberofmissestotalaccesses=1262144=000000381469

Therefore, the miss rate is000038%=0%

09

Determine the optimal block size for a miss latency of 20×B cycles

5.5.4

To determine the optimal block size for a miss latency we calculate AMAT (Average memory access time) for B

AMAT for B =8:0.040×(20×8)

=640

AMAT for B =16:0.030×(20×16)

=960

AMAT for B =32:0.020×(20×32)

=1280

AMAT for B =64:0.015×(20×64)

=1920

AMAT for B=128:0.010×(20×128)

=2560

B =8 is the optimal block size for a miss latency of 20×B cycles

10

Determine the optimal block size for a miss latency of 24+B cycles

5.5.5

Again, we calculate AMAT (Average memory access time) to determine the optimal block size for miss latency of 24+B cycles

AMAT for B =8:0.040×(24+8)

=128

AMAT for B =16:0.030×(24+16)

=120

AMAT for B =32:0.020×(24+32)

=112

AMAT for B=64:0.015×(24+64)

=132

AMAT for B=128:0.010×(24+128)

=152

B=32is the optimal block size for a miss latency of 24×B cycles .

11

Determine the optimal block size for constant miss latency

B = 128 is optimal

Because it Is minimizing the miss rate minimizes the total miss latency.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

To support multiple virtual machines, two levels of memory virtualization are needed. Each virtual machine still controls the mapping of virtual address (VA) to physical address (PA), while the hypervisor maps the physical address (PA) of each virtual machine to the actual machine address (MA). To accelerate such mappings, a software approach called “shadow paging” duplicates each virtual machine’s page tables in the hypervisor, and intercepts VA to PA mapping changes to keep both copies consistent. To remove the complexity of shadow page tables, a hardware approach called nested page table (NPT) explicitly supports two classes of page tables (VAPA and PAMA) and can walk such tables purely in hardware.

Consider the following sequence of operations: (1) Create process; (2) TLB miss; (3) page fault; (4) context switch;

(5.14.1) What would happen for the given operation sequence for shadow page table and nested page table, respectively?

(5.14.2) Assuming an x86-based 4-level page table in both guest and nested page table, how many memory references are needed to service a TLB miss for native vs. nested page table?

(5.14.3) Among TLB miss rate, TLB miss latency, page fault rate, and page fault latency, which metrics are more important for shadow page table? Which are important for nested page table?

Assume the following parameters for a shadow paging system

TLB Misses per 1000 instructions

NPT TLB Miss Latency

Page Faults per 1000 instructions

Shadowing Page Fault Overhead

0.2

200 cycles

0.001

30,000 cycles

(5.14.4) For a benchmark with native execution CPI of 1, what are the CPI numbers if using shadow page tables vs. NPT (assuming only page table virtualization overhead)?

(5.14.5) What techniques can be used to reduce page table shadowing induced overhead?

(5.14.6) What techniques can be used to reduce NPT induced overhead?

In this exercise, we will look at the different ways capacity affects overall performance. In general, cache access time is proportional to capacity. Assume that main memory accesses take 70 ns and that memory accesses are 36% of all instructions. The following table shows data for L1 caches attached to each of two processors, P1 and P2.

L1 Size

L1 Miss Rate

L1 Hit Time

P1

2 KiB

8.0%

0.66 ns

P2

4 KiB

6.0%

0.90 ns

(5.6.1) Assuming that the L1 hit time determines the cycle times for P1 and P2, what are their respective clock rates?

(5.6.2) What is the Average Memory Access Time for P1 and P2?

(5.6.3) Assuming a base CPI of 1.0 without any memory stalls, what is the total Cpi for P1 and P2? Which processor is faster?

For the next three problems, we will consider the addition of an L2 cache to P1 to presumably make up for its limited L1 cache capacity. Use the L1 cache capacities and hit times from the previous table when solving these problems. The L2 miss rate indicated is its local miss rate.

L2 Size

L2 Miss Rate

L2 Hit Time

1 MiB

95%

5.62 ns

(5.6.4) What is the AMAT for P1 with the addition of an L2 cache? Is the AMAT better or worse with the L2 cache?

(5.6.5) Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 with the addition of an L2 cache?

(5.6.6) Which processor is faster, now that P1 has an L2 cache? If P1 is faster, what miss rate would P2 need in its L1 cache to match P1’s performance? If P2 is faster, what miss rate would P1 need in its L1 cache to match P2’s performance?

In this exercise we show the definition of a web server log and examine code optimizations to improve log processing speed. The data structure for the log is defined as follows:

struct entry {

int srcIP; // remote IP address

char URL[128]; // request URL (e.g., “GET index.html”)

long long refTime; // reference time

int status; // connection status

char browser[64]; // client browser name

} log [NUM_ENTRIES];

Assume the following processing function for the log:

topK_sourceIP (int hour);

5.19.1 Which fields in a log entry will be accessed for the given log processing function? Assuming 64-byte cache blocks and no prefetching, how many caches misses per entry does the given function incur on average?

5.19.2 How can you reorganize the data structure to improve cache utilization and access locality? Show your structure definition code.

5.19.3 Give an example of another log processing function that would prefer a different data structure layout. If both functions are important, how would you rewrite the program to improve the overall performance? Supplement the discussion with code snippets and data.

For the problems below, use data from “Cache performance for SPEC CPU2000 Benchmarks”(http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the pairs of benchmarks shown in the following table.

a.

Mesa/gcc

b.

mcf/swim

5.19.4 For 64KiB data caches with varying set associativities, what are the miss rates broken down by miss types (cold, capacity, and conflict misses) for each benchmark?

5.19.5 Select the set associativity to be used by a 64KiB L1 data cache shared by both benchmarks. If the L1 cache has to be directly mapped, select the set associativity for the 1 MiB cache.

5.19.6 Give an example in the miss rate table where higher set associativity increases the miss rate. Construct a cache configuration and reference stream to demonstrate this.

Question: For a high-performance system such as a B-tree index for a database, the page size is determined mainly by the data size and disk performance. Assume that on average a B-tree index page is 70% full with fix-sized entries. The utility of a page is its B-tree depth, calculated as. The following table shows that for 16-byte entries, and a 10-year-old disk with a 10-year-old disk with a 10 ms latency and 10 MB/s transfer rate, the optimal page size is 16K.

Page Size (KiB)

Page Utility or B-Tree Depth (Number of Disk Accesses Saved)

Index Page Access Cost (ms)

Utility/Cost

2

6.49 (or)

10.2

0.64

4

7.49

10.4

0.72

8

8.49

10.8

0.79

16

9.49

11.6

0.82

32

10.49

13.2

0.79

64

11.49

16.4

0.70

128

12.49

22.8

0.55

256

13.49

35.6

0.38

(5.10.1) What is the best page size if entries now become 128 bytes?

(5.10.2) Based on 5.10.1, what is the best page size if pages are half full?

(5.10.3) Based on 5.10.2, what is the best page size if using a modern disk with a 3 ms latency and 100 MB/s transfer rate? Explain why future servers are likely to have larger pages.

Keeping “frequently used” (or “hot”) pages in DRAM can save disk accesses, but how do we determine the exact meaning of “frequently used” for a given system? Data engineers use the cost ratio between DRAM and disk access to quantify the reuse time threshold for hot pages. The cost of a disk access is \(Disk/accesses_per_sec, while the cost to keep a page in DRAM is \)DRAM_MiB/page _size. The typical DRAM and disk costs and typical database page sizes at several time points are listed below:

Year

DRAM Cost (\(/MiB)

Page Size (KiB)

Disk Cost (\)/disk)

Disk Access Rate (access/sec)

1987

5000

1

15,000

15

1997

15

8

2000

64

2007

0.05

64

80

83

(5.10.4) What are the reuse time thresholds for these three technology generations?

(5.10.5) What are the reuse time thresholds if we keep using the same 4K page size? What’s the trend here?

(5.10.6) What other factors can be changed to keep using the same page size (thus avoiding software rewrite)? Discuss their likeliness with current technology and cost trends.

This exercise examines the impact of different cache designs, specifically comparing associative caches to the direct-mapped caches from Section 5.4. For these exercises, refer to the address stream shown in Exercise 5.2.

(5.7.1) Using the sequence of references from Exercise 5.2, show the final cache contents for a three-way set associative cache with two- word blocks and a total size of 24 words. Use LRU replacement. For each reference identify the index bits, the tag bits, the block offset bits, and if it is a hit or a miss.

(5.7.2) Using the references from Exercise 5.2, show that final cache contents for a fully associative cache with one-word blocks and a total size of 8 words. Use LRU replacement. For each reference identify the index bits, the tag bits, and if it is a hit or a miss.

(5.7.3) Using the references from Exercise 5.2, what is the miss rate for a fully associative cache with two-word blocks and a total size of 8 words, using LRU replacement? What is the miss rate using MRU (most recently used) replacement? Finally what is the best possible miss rate for this cache, given any replacement policy?

Multilevel caching is an important technique to overcome the limited amount of space that a first level cache can provide while still maintaining its speed. Consider a processor with the following parameters:

Base CPI, No Memory Stalls

Processor Speed

Main Memory Access Time

First Level Cache MissRate per Instruction

Second Level Cache, Direct-Mapped Speed

Global Miss Rate with Second Level Cache, Direct-Mapped

Second Level Cache, Eight-Way Set Associative Speed

Global Miss Rate with Second Level Cache, Eight-Way Set Associative

1.5

2 GHz

100 ns

7%

12 cycles

3.5%

28 cycles

1.5%

(5.7.4) Calculate the CPI for the processor in the table using: 1) only a first level cache, 2) a second level direct-mapped cache, and 3) a second level eight-way set associative cache. How do these numbers change if main memory access time is doubled? If it is cut in half?

(5.7.5) It is possible to have an even greater cache hierarchy than two levels. Given the processor above with a second level, direct-mapped cache, a designer wants to add a third level cache that takes 50 cycles to access and will reduce the global miss rate to 1.3%. Would this provide better performance? In general, what are the advantages and disadvantages of adding a third level cache?

(5.7.6) In older processors such as the Intel Pentium or Alpha 21264, the second level of cache was external (located on a different chip) from the main processor and the first level cache. While this allowed for large second level caches, the latency to access the cache was much higher, and the bandwidth was typically lower because the second level cache ran at a lower frequency. Assume a 512 KiB off-chip second level cache has a global miss rate of 4%. If each additional 512 KiB of cache lowered global miss rates by 0.7%, and the cache had a total access time of 50 cycles, how big would the cache have to be to match the performance of the second level direct-mapped cache listed above? Of the eight way-set associative cache?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free