Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Chip multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for benchmarks with private vs shared L2 cache designs. Assume L1 cache misses once every 32 instructions.

Private

Shared

Benchmark A misses-per-instruction

0.30%

0.12%

Benchmark B misses-per-instruction

0.06%

0.03%

Assume the following hit latencies:

Private Cache

Shared Cache

Memory

5

20

180

5.18.1 Which cache design is better for each of these benchmarks? Use data to support your conclusion.

5.18.2 Shared cache latency increases with the CMP size. Choose the best design if the shared cache latency doubles. Off-chip bandwidth becomes the bottleneck as the number of CMP cores increases. Choose the best design if off-chip memory latency doubles.

5.18.3 Discuss the pros and cons of shared vs. private L2 caches for both single-threaded, multi-threaded, and multiprogrammed workloads, and reconsider them if having on-chip L3 caches.

5.18.4 Assume both benchmarks have a base CPI of 1(ideal L2 cache). If having a non-blocking cache improves the average number of concurrent L2 misses from 1 to 2, how much performance improvement does this provide over a shared L2 cache? How much improvement can be achieved over private L2?

5.18.5 Assume new generations of processors double the number of cores every 18 months. To maintain the same level of per-core performance, how much more off-chip memory bandwidth is needed for a processor released in three years?

5.18.6 Consider the entire memory hierarchy. What kinds of optimizations can improve the number of concurrent misses?

Short Answer

Expert verified

5.18.1 Private cache is best for both of these benchmarks.

5.18.2 Private cache is the best if the shared cache latency doubles and if off-chip memory latency doubles.

5.18.3

Shared L2

Private L2

Single-threaded

No advantages and disadvantages

No advantages and disadvantages

Multi-threaded

Advantage: Shared caches perform better for workloads.

Threads are tightly coupled and share the data frequently.

No disadvantages

Prevents contamination and conflict misses between threads.

Multi programmed

In rare cases, processes communicate.

The disadvantage is higher cache latency

If the OS attempts to assign the same CPU to each process, it works well.

5.18.4 In both cases, L2 would reduce the latency.

5.18.5 4 times.

5.18.6 Additional bandwidth, dynamic memory schedulers, higher cache associativity, and additional cache levels.

Step by step solution

01

Determine performance formulae and non-blocking cache.

Write the formula to find the performance.

AMAT=(missperinstruction)×Hitlatencyofcache+Peroentageofbenchmarkmisses×Hitlatrncyofmemory……(1)

While the cache is handling the earlier miss, it will be referenced by the processor by another cache, which is called a non-blocking cache.

02

Determine which cache design is better for each of these benchmarks.

5.18.1 To know which cache design is better, the performance of both designs must be calculated using the given data.

Private

Shared

Benchmark A misses-per-instruction

0.30%

0.12%

Benchmark B misses-per-instruction

0.06%

0.03%

Assume the following hit latencies:

Private Cache

Shared Cache

Memory

5

20

180

Benchmark A:

AMATprivate=132×5+0.0030×180=0.70AMATshared=132×20+0.0012×180=0.84

Benchmark B:

AMATprivate=132×5+0.0006×180=0.26AMATshared=132×20+0.0003×180=0.68

In both the benchmarks, the private cache is superior.

03

Determine the best design if the shared cache latency doubles and if off-chip memory latency doubles.

5.18.2 Given that the shared cache latency doubles for the shared cache and the private cache the memory latency doubles.

Benchmark A:

AMATprivate=132×5+0.00030×360=1.24AMATshared=132×40+0.0012×180=1.47

Benchmark B:

AMATprivate=132×5+0.0006×360=0.37AMATshared=132×40+0.0003×180=1.30

Comparing both, private is superior

04

Determine the pros and cons of shared vs. private L2 caches.

5.18.3

Shared L2

Private L2

Single-threaded

No advantages and disadvantages

No advantages and disadvantages

Multi-threaded

Advantage: Shared caches perform better for workloads.

Threads are tightly coupled and share the data frequently.

No disadvantages

Prevents contamination and conflict misses between threads.

Multi programmed

In rare cases, processes communicate.

The disadvantage is higher cache latency

If the OS attempts to assign the same CPU to each process, it works well.

05

Determine the performance improvement

5.18.4

A Shared L2 non-blocking cache will reduce the latency of the L2, by allowing hits for one CPU to be serviced while a miss is serviced for another CPU or allow for misses from both CPUs to be serviced simultaneously.

Private L2 will reduce the latency assuming that multiple memory instructions can be executed concurrently.

06

Determine how much more off-chip memory bandwidth is needed

5.18.5

Given that the new generations of processors double the number of cores every 18 months. To maintain the same level of per-core performance, it needs 4 times more off-chip memory bandwidth is needed for a processor released in three years.

07

Determine optimizations that can improve the number of concurrent misses.

To improve the number of concurrent misses, the following things can be optimized.

  • Additional DRAM bandwidth
  • Multi-banked memory systems
  • Dynamic memory schedulers
  • Higher cache associativity
  • Additional levels of cache.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches:

L1

L2

Write through, non-write allocate

Write back, write allocate

5.4.1 Buffers are employed between different levels of memory hierarchy to reduce access latency. For this given configuration, list the possible buffers needed between L1 and L2 caches, as well as L2 cache and memory.

5.4.2 Describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block.

5.4.3 For a multilevel exclusive cache (a block can only reside in one of the L1 and L2 caches), configuration, describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block

Consider the following program and cache behaviors.

Data Reads per 100 Instructions

Data writes per 1000 Instructions

Instruction Cache Miss Rate

Data Cache Miss Rate

Block Size(byte)

250

100

0.30%

2%

64%

5.4.4 For a write-through, write-allocate cache, what are the minimum read and write bandwidths (measured by byte per cycle) needed to achieve a CPI of 2?

5.4.5 For a write-back, write-allocate cache, assuming 30% of replaced data cache blocks are dirty, what are the minimal read and write bandwidths needed for a CPI of 2?

5.4.6 What are the minimal bandwidths needed to achieve the performance of CPI=1.5?

Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches:

L1

L2

Write through, non-write allocate

Write back, write allocate

5.4.1 Buffers are employed between different levels of memory hierarchy to reduce access latency. For this given configuration, list the possible buffers needed between L1 and L2 caches, as well as L2 cache and memory.

5.4.2 Describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block.

5.4.3 For a multilevel exclusive cache (a block can only reside in one of the L1 and L2 caches), configuration, describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block

Consider the following program and cache behaviors.

Data Reads per 100 Instructions

Data writes per 1000 Instructions

Instruction Cache Miss Rate

Data Cache Miss Rate

Block Size(byte)

250

100

0.30%

2%

64%

5.4.4 For a write-through, write-allocate cache, what are the minimum read and write bandwidths (measured by byte per cycle) needed to achieve a CPI of 2?

5.4.5 For a write-back, write-allocate cache, assuming 30% of replaced data cache blocks are dirty, what are the minimal read and write bandwidths needed for a CPI of 2?

5.4.6 What are the minimal bandwidths needed to achieve the performance of CPI=1.5?

For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache.

Tag

Index

offset

31-10

9-5

4-0

5.3.1 What is the cache block size (in words)?

5.3.2 How many entries does the cache have?

5.3.3 What is the ratio between total bits required for such a cache implementation over the data storage bits?

Starting from power on, the following byte-addressed cache references are recorded.

Address
041613223216010243014031001802180

5.3.4 How many blocks are replaced?

5.3.5 What is the hit ratio?

5.3.6 List the final state of the cache, with each valid entry represented as a record of <index, tag, data>

Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream:

0, 2, 4, 6, 8, 10, 12, 14, 16, …

5.5.1 Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, based on the 3C model?

5.5.2 Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?

5.5.3 “Prefetching” is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above?

Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of 1.35 references (both instruction and data) per instruction, help find the optimal block size given the following miss rates for various block sizes.

8;4%
16:3%
32:2%
64:1.5%
128:1%

5.5.4 What is the optimal block size for a miss latency of 20×B cycles?

5.5.5 What is the optimal block size for a miss latency of 24+B cycles?

5.5.6 For constant miss latency, what is the optimal block size

Question: As described in Section 5.7, virtual memory uses a page table to track the mapping of virtual addresses to physical addresses. This exercise shoes how this table must be updated as addresses are accessed. The following data constitutes a stream of virtual addresses as seen on a system. Assume 4 KiB pages, a 4-entry fully associative TLB, and true LRU replacement. If pages must be brought in from disk, increment the next largest page number.

4669, 2227, 13916, 34587, 48870, 12608, 49225

TLB

Valid

Tag

Physical Page Number

1

11

12

1

7

4

1

3

6

0

4

9

Page table

Valid

Physical Page or in Disk

1

5

0

Disk

0

Disk

1

6

1

9

1

11

0

Disk

1

4

0

Disk

0

Disk

1

3

1

12

(5.11.1) Given the address stream shown, and the initial TLB and page table states provided above, show the final state of the system. Also list for each reference if it is a hit in the TLB, a hit in the page table, or a page fault.

(5.11.2) Repeat 5.11.1, but this time use 16 KiB pages instead of 4 KiB pages. What would be some of the advantages of having a larger page size? What are some of the disadvantages?

(5.11.3) Show the final contents of the TLB if it is 2-way set associative. Also show the contents of the TLB if it is direct mapped. Discuss the importance of having a TLB to high performance. How would virtual memory accesses be handles if there were no TLB?

There are several parameters that impact the overall size of the page table. Listed below are key page parameters.

Virtual Address Size

Page Size

Page Table Entry Size

32 bits

8 KiB

4 bytes

(5.11.4) Given the parameters shown above, calculate the total page table size for a system running 5 applications that utilize half of the memory available.

(5.11.5) Given the parameters shown above, calculate the total page table size for a system running 5 applications that utilize half of the memory available, given a two level page table approach with 256 entries. Assume each entry of the main page table is 6 bytes. Calculate the minimum amount of memory required.

(5.11.6) A cache designer wants to increase the size of a 4 KiB virtually indexed, physically tagged cache. Given the page size shown above, is it possible to make a 16 KiB direct-mapped cache, assuming 2 words per block? How would the designer increase the data size of the cache?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free