Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Question: In this exercise, we will examine space/time optimizations for page tables. The following list provides parameters of a virtual memory system.

Virtual Address (bits)

Physical DRAM Installed

Page Size

PTE Size (byte)

43

16 GiB

4KiB

4

(5.12.1) For a single-level page table, how many page table entries (PTEs) are needed? How much physical memory is needed for storing the page table?

(5.12.2) Using a multilevel page table can reduce the physical memory consumption of page tables, by keeping active PTEs in physical memory. How many levels of page tables will be needed in this case? And how many memory references are needed for address translation if missing in TLB?

(5.12.3) An inverted page table can be used to further optimize space and time. How many PTEs are needed to store the page table? Assuming a hash table implementation, what are the common case and worst case numbers of memory references needed for servicing a TLB miss?

The following table shows the contents of a 4-entry TLB.

Entry-ID

Valid

VA Page

Modified

Protection

PA Page

1

1

140

1

RW

30

2

0

40

0

RX

34

3

1

200

1

RO

32

4

1

280

0

RW

31

(5.12.4) Under what scenarios would entry 2’s valid bit be set to zero?

(5.12.5) What happens when an instruction writes to VA page 30? When would software managed TLB be faster than hardware managed TLB?

(5.12.6) What happens when an instruction writes to VA page 200?

Short Answer

Expert verified

(5.12.1)

The page table entries are and the page table size is .

(5.12.2)

The number of levels needed is four. On TLB miss, the references needed for address translation is also four.

(5.12.3)

The page table entries needed to store the page table are . With hash table implementation, the best-case scenario requires one memory reference and the worst-case scenario requires memory references equal to the number of page table entries.

(5.12.4)

Entry 2’s valid bit is set to zero if there is a context switch of processes and it is no longer a valid translation.

(5.12.5)

When an instruction writes to a VA page 30, it is a TLB miss. The software managed TLB is faster than hardware managed TLB in case of TLB miss available in the processor’s cache.

(5.12.6)

When an instruction writes to VA page 200, an interrupt is generated.

Step by step solution

01

Formula to calculate the number of page table entries (PTEs) and memory to store page table

(5.12.1)

The number of entries in the page table is calculated using the following formula:

……… (1)

The memory required to store the page table is given as:

02

Calculation of PTEs and page table size

The virtual address uses 43 bits. So, the virtual address space is 243 bits. The size of the page is 4 KiB. This is equal to 212 byte. So, the number of PTEs is:

The size of each PTE is 4 bytes. Thus, the page table size is:

The page table size is equal to 8GB.

03

Formula to calculate the number of levels required in the multilevel page table

(5.12.2)

Multilevel page tables have tree-like structures to hold the page tables. The required number of levels is calculated using the following formula:

04

Calculation of number of levels and memory references required in case of TLB miss

The number of bits required to represent virtual address space is 43. The size of the page is 4 KiB. Each page table entry is of size 4 bytes. Therefore, the number of entries on each page is:

Therefore, 10 bits are required to represent one entry on a page. As the number of page table entries is 231, 31 bits are required to represent a page.

The number of levels are equal to:

If there is a TLB miss, then memory references are equal to the number of levels in a multilevel page table. In this case, the number of levels is four. So, the number of memory references is four.

05

Calculation of PTEs needed to store the inverted page table

(5.12.3)

The number of entries in the Inverted page table is given as:

The physical address space is 16 GiB. It is equal to 234 bytes. The page size is 4 KiB. It is equal to 212 byte. The number of entries is equal to:

06

Number of memory references in hash table implementation

In the best or common case, the is no hash conflict. So, only one memory reference is required to translate the address. In the worst case, the number of memory references is equal to the number of entries. Because the number of entries is 222. So, if there is a TLB miss then in a worst-case scenario, the number of memory references is 222.

07

Identify the meaning of the valid bit set to 0 in TLB

(5.12.4)

TLB contains the translation of virtual page numbers to physical page numbers. The valid bit in TLB refers to whether the entry in TLB is valid or not. If the value of the valid bit is 1, then it’s a valid translation otherwise it’s an invalid translation.

Entry 2’s valid bit is 0 in case there is a context switch of processes. It means that the page is removed from the memory. It is no longer available at the mentioned physical address.

08

Analyze the scenario of writing to a VA page 30

(5.12.5)

TLB contains a translation of virtual pages to physical pages. Writing to a VA page 30 searches in the TLB whether the translation of this page is available. The given TLB doesn’t contain any VA 30. The four entries of the TLB contain VA 140, 49, 200, and 280. So, it will generate a TLB miss.

The software TLBs are faster than hardware TLBs when the missed TLBs are found in the processor’s cache.

09

Analyze the scenario of writing to a VA page 200

(5.12.6)

Writing to a VA page 200 searches for the physical address in TLB. TLB contains an entry for VA page 200. The valid bit for this entry is 1. So, the translation is valid too. But this page is Read-only, that is, there are no write permissions for this page. The page cannot be written to. So, it generates an interrupt.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

In this exercise we show the definition of a web server log and examine code optimizations to improve log processing speed. The data structure for the log is defined as follows:

struct entry {

int srcIP; // remote IP address

char URL[128]; // request URL (e.g., “GET index.html”)

long long refTime; // reference time

int status; // connection status

char browser[64]; // client browser name

} log [NUM_ENTRIES];

Assume the following processing function for the log:

topK_sourceIP (int hour);

5.19.1 Which fields in a log entry will be accessed for the given log processing function? Assuming 64-byte cache blocks and no prefetching, how many caches misses per entry does the given function incur on average?

5.19.2 How can you reorganize the data structure to improve cache utilization and access locality? Show your structure definition code.

5.19.3 Give an example of another log processing function that would prefer a different data structure layout. If both functions are important, how would you rewrite the program to improve the overall performance? Supplement the discussion with code snippets and data.

For the problems below, use data from “Cache performance for SPEC CPU2000 Benchmarks”(http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the pairs of benchmarks shown in the following table.

a.

Mesa/gcc

b.

mcf/swim

5.19.4 For 64KiB data caches with varying set associativities, what are the miss rates broken down by miss types (cold, capacity, and conflict misses) for each benchmark?

5.19.5 Select the set associativity to be used by a 64KiB L1 data cache shared by both benchmarks. If the L1 cache has to be directly mapped, select the set associativity for the 1 MiB cache.

5.19.6 Give an example in the miss rate table where higher set associativity increases the miss rate. Construct a cache configuration and reference stream to demonstrate this.

To support multiple virtual machines, two levels of memory virtualization are needed. Each virtual machine still controls the mapping of virtual address (VA) to physical address (PA), while the hypervisor maps the physical address (PA) of each virtual machine to the actual machine address (MA). To accelerate such mappings, a software approach called “shadow paging” duplicates each virtual machine’s page tables in the hypervisor, and intercepts VA to PA mapping changes to keep both copies consistent. To remove the complexity of shadow page tables, a hardware approach called nested page table (NPT) explicitly supports two classes of page tables (VAPA and PAMA) and can walk such tables purely in hardware.

Consider the following sequence of operations: (1) Create process; (2) TLB miss; (3) page fault; (4) context switch;

(5.14.1) What would happen for the given operation sequence for shadow page table and nested page table, respectively?

(5.14.2) Assuming an x86-based 4-level page table in both guest and nested page table, how many memory references are needed to service a TLB miss for native vs. nested page table?

(5.14.3) Among TLB miss rate, TLB miss latency, page fault rate, and page fault latency, which metrics are more important for shadow page table? Which are important for nested page table?

Assume the following parameters for a shadow paging system

TLB Misses per 1000 instructions

NPT TLB Miss Latency

Page Faults per 1000 instructions

Shadowing Page Fault Overhead

0.2

200 cycles

0.001

30,000 cycles

(5.14.4) For a benchmark with native execution CPI of 1, what are the CPI numbers if using shadow page tables vs. NPT (assuming only page table virtualization overhead)?

(5.14.5) What techniques can be used to reduce page table shadowing induced overhead?

(5.14.6) What techniques can be used to reduce NPT induced overhead?

In this exercise, we will explore the control unit for a cache controller for a processor with a write buffer. Use the finite state machine found in Figure 5.40 as a starting point for designing your finite state machines. Assume that the cache controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in Section 5.9), but you will add a write buffer with a capacity of one block.

Recall that the purpose of a write buffer is to serve as temporary storage so that the processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than writing back the dirty block before reading the new block, it buffers the dirty block and immediately begins reading the new block. The dirty block can then be written to the main memory while the processor is working.

5.16.1 What should happen if the processor issues a request that hits in the cache while a block is being written back to main memory from the write buffer?

5.16.2 What should happen if the processor issues a request that misses in the cache while a block is being written back to main memory from the write buffer?

5.16.3 Design a finite state machine to enable the use of a write buffer.

In this exercise, we will look at the different ways capacity affects overall performance. In general, cache access time is proportional to capacity. Assume that main memory accesses take 70 ns and that memory accesses are 36% of all instructions. The following table shows data for L1 caches attached to each of two processors, P1 and P2.

L1 Size

L1 Miss Rate

L1 Hit Time

P1

2 KiB

8.0%

0.66 ns

P2

4 KiB

6.0%

0.90 ns

(5.6.1) Assuming that the L1 hit time determines the cycle times for P1 and P2, what are their respective clock rates?

(5.6.2) What is the Average Memory Access Time for P1 and P2?

(5.6.3) Assuming a base CPI of 1.0 without any memory stalls, what is the total Cpi for P1 and P2? Which processor is faster?

For the next three problems, we will consider the addition of an L2 cache to P1 to presumably make up for its limited L1 cache capacity. Use the L1 cache capacities and hit times from the previous table when solving these problems. The L2 miss rate indicated is its local miss rate.

L2 Size

L2 Miss Rate

L2 Hit Time

1 MiB

95%

5.62 ns

(5.6.4) What is the AMAT for P1 with the addition of an L2 cache? Is the AMAT better or worse with the L2 cache?

(5.6.5) Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 with the addition of an L2 cache?

(5.6.6) Which processor is faster, now that P1 has an L2 cache? If P1 is faster, what miss rate would P2 need in its L1 cache to match P1’s performance? If P2 is faster, what miss rate would P1 need in its L1 cache to match P2’s performance?

Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream:

0, 2, 4, 6, 8, 10, 12, 14, 16, …

5.5.1 Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, based on the 3C model?

5.5.2 Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?

5.5.3 “Prefetching” is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above?

Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of 1.35 references (both instruction and data) per instruction, help find the optimal block size given the following miss rates for various block sizes.

8;4%
16:3%
32:2%
64:1.5%
128:1%

5.5.4 What is the optimal block size for a miss latency of 20×B cycles?

5.5.5 What is the optimal block size for a miss latency of 24+B cycles?

5.5.6 For constant miss latency, what is the optimal block size

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free