Chapter 5: Q18E (page 496)

Chip multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for benchmarks with private vs shared L2 cache designs. Assume L1 cache misses once every 32 instructions.
Private
Shared
Benchmark A misses-per-instruction
0.30%
0.12%
Benchmark B misses-per-instruction
0.06%
0.03%
Assume the following hit latencies:
Private Cache
Shared Cache
Memory
5
20
180
5.18.1 Which cache design is better for each of these benchmarks? Use data to support your conclusion.
5.18.2 Shared cache latency increases with the CMP size. Choose the best design if the shared cache latency doubles. Off-chip bandwidth becomes the bottleneck as the number of CMP cores increases. Choose the best design if off-chip memory latency doubles.
5.18.3 Discuss the pros and cons of shared vs. private L2 caches for both single-threaded, multi-threaded, and multiprogrammed workloads, and reconsider them if having on-chip L3 caches.
5.18.4 Assume both benchmarks have a base CPI of 1(ideal L2 cache). If having a non-blocking cache improves the average number of concurrent L2 misses from 1 to 2, how much performance improvement does this provide over a shared L2 cache? How much improvement can be achieved over private L2?
5.18.5 Assume new generations of processors double the number of cores every 18 months. To maintain the same level of per-core performance, how much more off-chip memory bandwidth is needed for a processor released in three years?
5.18.6 Consider the entire memory hierarchy. What kinds of optimizations can improve the number of concurrent misses?

Short Answer

Expert verified

5.18.1 Private cache is best for both of these benchmarks.

5.18.2 Private cache is the best if the shared cache latency doubles and if off-chip memory latency doubles.

5.18.3

	Shared L2	Private L2
Single-threaded	No advantages and disadvantages	No advantages and disadvantages
Multi-threaded	Advantage: Shared caches perform better for workloads. Threads are tightly coupled and share the data frequently. No disadvantages	Prevents contamination and conflict misses between threads.
Multi programmed	In rare cases, processes communicate. The disadvantage is higher cache latency	If the OS attempts to assign the same CPU to each process, it works well.

5.18.4 In both cases, L2 would reduce the latency.

5.18.5 4 times.

5.18.6 Additional bandwidth, dynamic memory schedulers, higher cache associativity, and additional cache levels.

Step by step solution

Determine performance formulae and non-blocking cache.

Write the formula to find the performance.

$A M A T = (m i s s p e r i n s t r u c t i o n) \times H i t l a t e n c y o f c a c h e + P e r o e n t a g e o f b e n c h m a r k m i s s e s \times H i t l a t r n c y o f m e m o r y$ ……(1)

While the cache is handling the earlier miss, it will be referenced by the processor by another cache, which is called a non-blocking cache.

Determine which cache design is better for each of these benchmarks.

5.18.1 To know which cache design is better, the performance of both designs must be calculated using the given data.

	Private	Shared
Benchmark A misses-per-instruction	0.30%	0.12%
Benchmark B misses-per-instruction	0.06%	0.03%

Assume the following hit latencies:

Private Cache	Shared Cache	Memory
5	20	180

Benchmark A:

$\begin{array}{rcl} A M A T_{p r i v a t e} & = & \frac{1}{32} \times 5 + 0.0030 \times 180 \\ = & 0.70 \\ A M A T_{s h a r e d} & = & \frac{1}{32} \times 20 + 0.0012 \times 180 \\ = & 0.84 \end{array}$

Benchmark B:

$\begin{array}{rcl} A M A T_{p r i v a t e} & = & \frac{1}{32} \times 5 + 0.0006 \times 180 \\ = & 0.26 \\ A M A T_{s h a r e d} & = & \frac{1}{32} \times 20 + 0.0003 \times 180 \\ = & 0.68 \end{array}$

In both the benchmarks, the private cache is superior.

Determine the best design if the shared cache latency doubles and if off-chip memory latency doubles.

5.18.2 Given that the shared cache latency doubles for the shared cache and the private cache the memory latency doubles.

Benchmark A:

$\begin{array}{rcl} A M A T_{p r i v a t e} & = & \frac{1}{32} \times 5 + 0.00030 \times 360 \\ = & 1.24 \\ A M A T_{s h a r e d} & = & \frac{1}{32} \times 40 + 0.0012 \times 180 \\ = & 1.47 \end{array}$

Benchmark B:

$\begin{array}{rcl} A M A T_{p r i v a t e} & = & \frac{1}{32} \times 5 + 0.0006 \times 360 \\ = & 0.37 \\ A M A T_{s h a r e d} & = & \frac{1}{32} \times 40 + 0.0003 \times 180 \\ = & 1.30 \end{array}$

Comparing both, private is superior

Determine the pros and cons of shared vs. private L2 caches.

5.18.3

	Shared L2	Private L2
Single-threaded	No advantages and disadvantages	No advantages and disadvantages
Multi-threaded	Advantage: Shared caches perform better for workloads. Threads are tightly coupled and share the data frequently. No disadvantages	Prevents contamination and conflict misses between threads.
Multi programmed	In rare cases, processes communicate. The disadvantage is higher cache latency	If the OS attempts to assign the same CPU to each process, it works well.

Determine the performance improvement

5.18.4

A Shared L2 non-blocking cache will reduce the latency of the L2, by allowing hits for one CPU to be serviced while a miss is serviced for another CPU or allow for misses from both CPUs to be serviced simultaneously.

Private L2 will reduce the latency assuming that multiple memory instructions can be executed concurrently.

Determine how much more off-chip memory bandwidth is needed

5.18.5

Given that the new generations of processors double the number of cores every 18 months. To maintain the same level of per-core performance, it needs 4 times more off-chip memory bandwidth is needed for a processor released in three years.

Determine optimizations that can improve the number of concurrent misses.

To improve the number of concurrent misses, the following things can be optimized.

Additional DRAM bandwidth
Multi-banked memory systems
Dynamic memory schedulers
Higher cache associativity
Additional levels of cache.

Unlock Step-by-Step Solutions & Ace Your Exams!

Full Textbook Solutions
Get detailed explanations and key concepts
Unlimited Al creation
Al flashcards, explanations, exams and more...
Ads-free access
To over 500 millions flashcards
Money-back guarantee
We refund you if you fail your exam.

Start your free trial

Over 30 million students worldwide already upgrade their learning with Vaia!

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Data Reads per 100 Instructions	Data writes per 1000 Instructions	Instruction Cache Miss Rate	Data Cache Miss Rate	Block Size(byte)
250	100	0.30%	2%	64%

Data Reads per 100 Instructions	Data writes per 1000 Instructions	Instruction Cache Miss Rate	Data Cache Miss Rate	Block Size(byte)
250	100	0.30%	2%	64%

Tag	Index	offset
31-10	9-5	4-0

Address
0	4	16	132	232	160	1024	30	140	3100	180	2180

Valid	Tag	Physical Page Number
1	11	12
1	7	4
1	3	6
0	4	9

Short Answer

Step by step solution

Determine performance formulae and non-blocking cache.

Determine which cache design is better for each of these benchmarks.

Determine the best design if the shared cache latency doubles and if off-chip memory latency doubles.

Determine the pros and cons of shared vs. private L2 caches.

Determine the performance improvement

Determine how much more off-chip memory bandwidth is needed

Determine optimizations that can improve the number of concurrent misses.

One App. One Place for Learning.

Most popular questions from this chapter

Recommended explanations on Computer Science Textbooks

Algorithms in Computer Science

Computer Network

Big Data

Game Design in Computer Science

Computer Systems

Problem Solving Techniques

Study anywhere. Anytime. Across all devices.

Company

Product

Help

L1	L2
Write through, non-write allocate	Write back, write allocate

Valid	Physical Page or in Disk
1	5
0	Disk
0	Disk
1	6
1	9
1	11
0	Disk
1	4
0	Disk
0	Disk
1	3
1	12

Virtual Address Size	Page Size	Page Table Entry Size
32 bits	8 KiB	4 bytes