Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same rwo are stored contiguously. Assume each word is a 32-bit integer.

for(I=0;I<8;I++)

for(J=0;J<8000;J++)

A[I][J]=B[I][0]+A[J][I];

5.1.1 [5] How many 32-bit integers can be stored in a 16-byte cache block?

5.1.2 [5] References to which variables exhibit temporal locality?

5.1.3 [5] References to which variables exhibit spatial locality?

Locality is affected by both the reference order and data layout. The same computation can also be written below in Matlab, which differs from C by storing matrix elements within the same column contiguously in memory.

for I=1:8

for J=1:8000

A(I,J)=B(I,0)+A(J,I);

end

end

5.1.4. [10] How many 16-byte cache blocks are needed to store all 32-bit matrix elements being referenced?

5.1.5 [5] References to which variables exhibit temporal locality?

5.1.6 [5] References to which variables exhibit spatial locality?

Short Answer

Expert verified

5.1.1 - Four 32-bit integers can be stored in a 16-byte cache block.

5.1.2 - The variables I, J exhibit temporal locality.

5.1.3 - The variables A[I][J] exhibit spatial locality.

5.1.4 - 500 16-byte lines are required.

5.1.5 - I,J,B(I,0) are the variables which exhibit temporal locality.

5.1.6 - A[I][J],A[J][I],B(I,0) are the variables which exhibit spatial locality.

Step by step solution

01

(5.1.1)Step 1: Calculating the number of integers that can be stored in a 16-byte cache block.

To calculate the number of 32-bit integers in a 16 byte cache block, solve the following things:

Number of bits in a byte=8

Number of bits in 16 bytes will be:

=16×8=128

Number of 32-bit integers will be:

=12832=4

Therefore, four 32-bit integers can be stored in a 16-byte cache block.

02

(5.1.2)Step 2: Temporal Locality.

During each iteration of the code, the variables I and J are constantly accessed, and because of processors taking advantage of temporal locality, these will likely stay in the cache the entire time the code is executing as the processor will assume that if a piece of data is accessed, it will likely be accessed again.

03

(5.1.3)Step 3: Spatial Locality.

The spatial locality is the tendency for data with addresses near the current piece of data we are working with to be needed soon. This is why when one piece of data is needed, the processor will also load the entire block with data that is next to the data which is just accessed. In this example, A[I][J] again exhibit spatial locality as they would be located near each other in the array.

04

(5.1.4)Step 4: Calculating the number of 16-byte cache lines needed to store all 32-bit matrix elements being referenced.

To calculate the number of 16-byte cache lines needed to store all 32 bit matrix elements being referenced, solve the following things:

Number of bits in a byte = 8.

Number of elements in the bit matrix =8000×8=64000

Number of bytes =640008=8000

Number of 16 byte cache lines needed =800016=500

Therefore, 500 16-byte cache lines are needed to store all 32 bit matrix elements being referenced.

05

(5.1.5) Step 5: Temporal locality

During each iteration of the code, the variables I,J and B(I,0) are constantly accessed, and because of processors taking advantage of temporal locality, these will likely stay in the cache the entire time the code is executing as the processor will assume that if a piece of data is accessed, it will likely be accessed again.

06

(5.1.6)Step 6:Spatial Locality

The spatial locality is the tendency for data with addresses near the current piece of data we are working with to be needed soon. This is why when one piece of data is needed, the processor will also load the entire block with data that is next to the data which is just accessed. In this example, A[I][J], A[J][I] and B(I,0) again exhibit spatial locality as they would be located near each other in the array.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

This exercise examines the impact of different cache designs, specifically comparing associative caches to the direct-mapped caches from Section 5.4. For these exercises, refer to the address stream shown in Exercise 5.2.

(5.7.1) Using the sequence of references from Exercise 5.2, show the final cache contents for a three-way set associative cache with two- word blocks and a total size of 24 words. Use LRU replacement. For each reference identify the index bits, the tag bits, the block offset bits, and if it is a hit or a miss.

(5.7.2) Using the references from Exercise 5.2, show that final cache contents for a fully associative cache with one-word blocks and a total size of 8 words. Use LRU replacement. For each reference identify the index bits, the tag bits, and if it is a hit or a miss.

(5.7.3) Using the references from Exercise 5.2, what is the miss rate for a fully associative cache with two-word blocks and a total size of 8 words, using LRU replacement? What is the miss rate using MRU (most recently used) replacement? Finally what is the best possible miss rate for this cache, given any replacement policy?

Multilevel caching is an important technique to overcome the limited amount of space that a first level cache can provide while still maintaining its speed. Consider a processor with the following parameters:

Base CPI, No Memory Stalls

Processor Speed

Main Memory Access Time

First Level Cache MissRate per Instruction

Second Level Cache, Direct-Mapped Speed

Global Miss Rate with Second Level Cache, Direct-Mapped

Second Level Cache, Eight-Way Set Associative Speed

Global Miss Rate with Second Level Cache, Eight-Way Set Associative

1.5

2 GHz

100 ns

7%

12 cycles

3.5%

28 cycles

1.5%

(5.7.4) Calculate the CPI for the processor in the table using: 1) only a first level cache, 2) a second level direct-mapped cache, and 3) a second level eight-way set associative cache. How do these numbers change if main memory access time is doubled? If it is cut in half?

(5.7.5) It is possible to have an even greater cache hierarchy than two levels. Given the processor above with a second level, direct-mapped cache, a designer wants to add a third level cache that takes 50 cycles to access and will reduce the global miss rate to 1.3%. Would this provide better performance? In general, what are the advantages and disadvantages of adding a third level cache?

(5.7.6) In older processors such as the Intel Pentium or Alpha 21264, the second level of cache was external (located on a different chip) from the main processor and the first level cache. While this allowed for large second level caches, the latency to access the cache was much higher, and the bandwidth was typically lower because the second level cache ran at a lower frequency. Assume a 512 KiB off-chip second level cache has a global miss rate of 4%. If each additional 512 KiB of cache lowered global miss rates by 0.7%, and the cache had a total access time of 50 cycles, how big would the cache have to be to match the performance of the second level direct-mapped cache listed above? Of the eight way-set associative cache?

Question: In this exercise, we will examine space/time optimizations for page tables. The following list provides parameters of a virtual memory system.

Virtual Address (bits)

Physical DRAM Installed

Page Size

PTE Size (byte)

43

16 GiB

4KiB

4

(5.12.1) For a single-level page table, how many page table entries (PTEs) are needed? How much physical memory is needed for storing the page table?

(5.12.2) Using a multilevel page table can reduce the physical memory consumption of page tables, by keeping active PTEs in physical memory. How many levels of page tables will be needed in this case? And how many memory references are needed for address translation if missing in TLB?

(5.12.3) An inverted page table can be used to further optimize space and time. How many PTEs are needed to store the page table? Assuming a hash table implementation, what are the common case and worst case numbers of memory references needed for servicing a TLB miss?

The following table shows the contents of a 4-entry TLB.

Entry-ID

Valid

VA Page

Modified

Protection

PA Page

1

1

140

1

RW

30

2

0

40

0

RX

34

3

1

200

1

RO

32

4

1

280

0

RW

31

(5.12.4) Under what scenarios would entry 2’s valid bit be set to zero?

(5.12.5) What happens when an instruction writes to VA page 30? When would software managed TLB be faster than hardware managed TLB?

(5.12.6) What happens when an instruction writes to VA page 200?

In this exercise, we will examine how replacement policies impact miss rate. Assume a 2-way set associative cache with 4 blocks. To solve the problems in this exercise, you may find it helpful to draw a table like the one below, as demonstrated for the address sequence “0, 1, 2, 3, 4”.

Address of Memory Block Accessed

Hit or Miss

Evicted Block

Contents of Cache Blocks After Reference

Set 0

Set 0

Set 1

Set 1

0

Miss

Mem[0]

1

Miss

Mem[0]

Mem[1]

2

Miss

Mem[0]

Mem[2]

Mem[1]

3

Miss

Mem[0]

Mem[2]

Mem[1]

Mem[3]

4

Miss

0

Mem[4]

Mem[2]

Mem[1]

Mem[3]

Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0

(5.13.1) Assuming an LRU replacement policy, how many hits does this address sequence exhibit?

(5.13.2) Assuming an MRU (most recently used) replacement policy, how many hits does this address sequence exhibit?

(5.13.3) Simulate a random replacement policy by flipping a coin. For example, “heads” means to evict the first block in a set, and “tails” means to evict the second block in a set. How many hits does this address sequence exhibit?

(5.13.4) Which address should be evicted at each replacement to maximize the number of hits? How many does this address sequence exhibit if you follow this “optimal” policy?

(5.13.5) Describe why it is difficult to implement a cache replacement policy that is optimal for all address sequences.

(5.13.6) Assume you could make a decision upon each memory reference whether or not you want the requested address to be cached. What impact could this have on the miss rate?

One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior.

Base CPI

Priviliged O/S Accesses per 10,000 Instructions

Performance Impact to Trap to the Guest O/S

Performance Impact to Trap to VMM

I/O Access per 10,000 Instructions

I/O Access Time (Includes Time to Trap to Guest O/S)

1.5

120

15 cycles

175 cycles

30

1100 cycles

(5.15.1) Calculate the CPI for the system listed above assuming that there are no accesses to I/O. What is the CPI if the VMM performance impact doubles? If it is cut in half? If a virtual machine software company wishes to obtain a 10% performance degradation, what is the longest possible penalty to trap to the VMM?

(5.15.2) I/O accesses often have a large impact on overall system performance. Calculate the CPI of a machine using the performance characteristics above, assuming a non-virtualized system. Calculate the CPI again, this time using a virtualized system. How do these CPIs change if the system has the I/O access? Explain why I/O bound applications have a smaller impact from virtualization.

(5.15.3) Compare and contrast the ideas of virtual memory and virtual machines. How do the goals of each compare? What are the pros and cons of each? List a few cases where virtual memory is desired, and a few cases where virtual machines are desired.

(5.15.4) Section 5.6 discusses virtualization under the assumption that the virtualized system is running the same ISA as the underlying hardware. However, one possible used of virtualization is to emulate non-native ISAs. An example of this is QEMU, which emulates a variety of ISAs such as MIPS, SPARC, and PowerPC. What are some of the difficulties involved in this kind of virtualization? Is it possible for an emulated system to run faster than on its native ISA?

This exercise examines the impact of different cache designs, specifically comparing associative caches to the direct-mapped caches from Section 5.4. For these exercises, refer to the address stream shown in Exercise 5.2.

(5.7.1) Using the sequence of references from Exercise 5.2, show the final cache contents for a three-way set associative cache with two- word blocks and a total size of 24 words. Use LRU replacement. For each reference identify the index bits, the tag bits, the block offset bits, and if it is a hit or a miss.

(5.7.2) Using the references from Exercise 5.2, show that final cache contents for a fully associative cache with one-word blocks and a total size of 8 words. Use LRU replacement. For each reference identify the index bits, the tag bits, and if it is a hit or a miss.

(5.7.3) Using the references from Exercise 5.2, what is the miss rate for a fully associative cache with two-word blocks and a total size of 8 words, using LRU replacement? What is the miss rate using MRU (most recently used) replacement? Finally what is the best possible miss rate for this cache, given any replacement policy?

Multilevel caching is an important technique to overcome the limited amount of space that a first level cache can provide while still maintaining its speed. Consider a processor with the following parameters:

Base CPI, No Memory Stalls

Processor Speed

Main Memory Access Time

First Level Cache MissRate per Instruction

Second Level Cache, Direct-Mapped Speed

Global Miss Rate with Second Level Cache, Direct-Mapped

Second Level Cache, Eight-Way Set Associative Speed

Global Miss Rate with Second Level Cache, Eight-Way Set Associative

1.5

2 GHz

100 ns

7%

12 cycles

3.5%

28 cycles

1.5%

(5.7.4) Calculate the CPI for the processor in the table using: 1) only a first level cache, 2) a second level direct-mapped cache, and 3) a second level eight-way set associative cache. How do these numbers change if main memory access time is doubled? If it is cut in half?

(5.7.5) It is possible to have an even greater cache hierarchy than two levels. Given the processor above with a second level, direct-mapped cache, a designer wants to add a third level cache that takes 50 cycles to access and will reduce the global miss rate to 1.3%. Would this provide better performance? In general, what are the advantages and disadvantages of adding a third level cache?

(5.7.6) In older processors such as the Intel Pentium or Alpha 21264, the second level of cache was external (located on a different chip) from the main processor and the first level cache. While this allowed for large second level caches, the latency to access the cache was much higher, and the bandwidth was typically lower because the second level cache ran at a lower frequency. Assume a 512 KiB off-chip second level cache has a global miss rate of 4%. If each additional 512 KiB of cache lowered global miss rates by 0.7%, and the cache had a total access time of 50 cycles, how big would the cache have to be to match the performance of the second level direct-mapped cache listed above? Of the eight way-set associative cache?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free