Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

In this exercise we show the definition of a web server log and examine code optimizations to improve log processing speed. The data structure for the log is defined as follows:

struct entry {

int srcIP; // remote IP address

char URL[128]; // request URL (e.g., “GET index.html”)

long long refTime; // reference time

int status; // connection status

char browser[64]; // client browser name

} log [NUM_ENTRIES];

Assume the following processing function for the log:

topK_sourceIP (int hour);

5.19.1 Which fields in a log entry will be accessed for the given log processing function? Assuming 64-byte cache blocks and no prefetching, how many caches misses per entry does the given function incur on average?

5.19.2 How can you reorganize the data structure to improve cache utilization and access locality? Show your structure definition code.

5.19.3 Give an example of another log processing function that would prefer a different data structure layout. If both functions are important, how would you rewrite the program to improve the overall performance? Supplement the discussion with code snippets and data.

For the problems below, use data from “Cache performance for SPEC CPU2000 Benchmarks”(http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the pairs of benchmarks shown in the following table.

a.

Mesa/gcc

b.

mcf/swim

5.19.4 For 64KiB data caches with varying set associativities, what are the miss rates broken down by miss types (cold, capacity, and conflict misses) for each benchmark?

5.19.5 Select the set associativity to be used by a 64KiB L1 data cache shared by both benchmarks. If the L1 cache has to be directly mapped, select the set associativity for the 1 MiB cache.

5.19.6 Give an example in the miss rate table where higher set associativity increases the miss rate. Construct a cache configuration and reference stream to demonstrate this.

Short Answer

Expert verified

5.19.1 srcIP and refTime fields. 2 misses per entry.

5.19.2 Group the srcIP and refTime fields into a separate array.

5.19.3 peak_hour (int status);

5.19.4 Conflict and compulsory misses are not affected by associativity. And the selection varies with the data set used.

5.19.5 The set associativity will vary with the data set used.

5.19.6 The“Cache performance for SPEC CPU2000”has many examples like apsi/mesa/ammp/mcf.

Cache:4-block caches, direct-mapped vs 2-way LRU.

Reference stream: 1 2 2 6 1

Step by step solution

01

Determine the code optimization to improve the log processing speed.

In a multiprocessor system or a system with larger data, the log processing is difficult with lesser code.

The log instruction will be optimized in a way that they get the process log sooner.

By comparing and merging the fields of the log table, the data can be accessed.

02

Determine which fields in a log entry will be accessed and how many cache misses occur.

5.19.1 Given log structure is as follows:

struct entry {

int srcIP; // remote IP address

char URL[128]; // request URL (e.g., “GET index.html”)

long long refTime; // reference time

int status; // connection status

char browser[64]; // client browser name

} log [NUM_ENTRIES];

The srcIP and the refTime fields will be used to access the given log processing function.

Here, assuming a 64-byte cache block, and no prefetching 2 misses per entry occur.

03

Determine how the data structure can be reorganized to improve cache utilization and access locality.

5.19.2 To improve the cache utilization and the access locality, group the srcIP and refTime fields into a separate array.

The struct code with srcIP and the refTime into separate array as follows:

struct entry {

long long srcIP refTime [1000][1000000]; // remote IP address and reference time grouped

char URL[128]; // request URL (e.g., “GET index.html”)

int status; // connection status

char browser[64]; // client browser name

} log [NUM_ENTRIES];

In the above structure, the srcIP and the refTime fields are grouped as a two-dimensional array. This improves the cache utilization and the access locality by providing the index for each field. In an array, data can be accessed faster by the array index, which improves the access locality.

04

Determine how the code can be rewritten to improve overall performance.

5.19.3

Given pairs of benchmarks are:

a.

Mesa/gcc

b.

mcf/swim

The array is the data structure that can be used as a log processing function.

peak_hour (int status);

peak_hour will provide the peak time in which the data is accessed and this will improve the correctness of the result.

The above code will return the peak hours of the given status. Group srcIP, refTime, and status together.

Code snippet:

struct entry {

long long srcIP refTime [1000][1000000];

char URL[128]; // request URL (e.g., “GET index.html”)

peak_hour(int status);// connection status

char browser[64]; // client browser name

} log [NUM_ENTRIES];

05

Determine the miss rate.

5.19.4

Conflict miss and compulsory misses do not occur and get affected by full associative caches.

The capacity miss rate is computed by subtracting the compulsory miss rate and the fully associative miss rate from the total miss rate. Conflict miss rate can be computed by subtracting the cold and newly computed capacity miss rate from the total miss rate.

The values will be reported as per the miss rate per instruction.

The miss rate will vary depending on which data set is used.

06

Determine set associativity.

5.19.5

Set associativity will be determined based on the dataset used. Since no specific dataset is given, it cannot be found. It will vary depending upon the data set used.

07

Determine the miss rate table, cache configuration, and reference stream

5.19.6

The“Cache performance for SPEC CPU2000”has many examples like apsi/mesa/ammp/mcf for miss rate table. They can’t be fit into the document since they are larger.

Cache:4-block caches, direct-mapped vs 2-way LRU.

Reference stream: 1 2 2 6 1

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior.

Base CPI

Priviliged O/S Accesses per 10,000 Instructions

Performance Impact to Trap to the Guest O/S

Performance Impact to Trap to VMM

I/O Access per 10,000 Instructions

I/O Access Time (Includes Time to Trap to Guest O/S)

1.5

120

15 cycles

175 cycles

30

1100 cycles

(5.15.1) Calculate the CPI for the system listed above assuming that there are no accesses to I/O. What is the CPI if the VMM performance impact doubles? If it is cut in half? If a virtual machine software company wishes to obtain a 10% performance degradation, what is the longest possible penalty to trap to the VMM?

(5.15.2) I/O accesses often have a large impact on overall system performance. Calculate the CPI of a machine using the performance characteristics above, assuming a non-virtualized system. Calculate the CPI again, this time using a virtualized system. How do these CPIs change if the system has the I/O access? Explain why I/O bound applications have a smaller impact from virtualization.

(5.15.3) Compare and contrast the ideas of virtual memory and virtual machines. How do the goals of each compare? What are the pros and cons of each? List a few cases where virtual memory is desired, and a few cases where virtual machines are desired.

(5.15.4) Section 5.6 discusses virtualization under the assumption that the virtualized system is running the same ISA as the underlying hardware. However, one possible used of virtualization is to emulate non-native ISAs. An example of this is QEMU, which emulates a variety of ISAs such as MIPS, SPARC, and PowerPC. What are some of the difficulties involved in this kind of virtualization? Is it possible for an emulated system to run faster than on its native ISA?

Question: In this exercise, we will examine space/time optimizations for page tables. The following list provides parameters of a virtual memory system.

Virtual Address (bits)

Physical DRAM Installed

Page Size

PTE Size (byte)

43

16 GiB

4KiB

4

(5.12.1) For a single-level page table, how many page table entries (PTEs) are needed? How much physical memory is needed for storing the page table?

(5.12.2) Using a multilevel page table can reduce the physical memory consumption of page tables, by keeping active PTEs in physical memory. How many levels of page tables will be needed in this case? And how many memory references are needed for address translation if missing in TLB?

(5.12.3) An inverted page table can be used to further optimize space and time. How many PTEs are needed to store the page table? Assuming a hash table implementation, what are the common case and worst case numbers of memory references needed for servicing a TLB miss?

The following table shows the contents of a 4-entry TLB.

Entry-ID

Valid

VA Page

Modified

Protection

PA Page

1

1

140

1

RW

30

2

0

40

0

RX

34

3

1

200

1

RO

32

4

1

280

0

RW

31

(5.12.4) Under what scenarios would entry 2’s valid bit be set to zero?

(5.12.5) What happens when an instruction writes to VA page 30? When would software managed TLB be faster than hardware managed TLB?

(5.12.6) What happens when an instruction writes to VA page 200?

Cache coherence concerns the views of multiple processors on a given cache block. The following data shows two processors and their read/write operations on two different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 32 bits.

P1

P2

X0++;X1=3

X0=5;X1+=2;

5.17.1 List the possible values of the given cache block for a correct cache coherence protocol implementation. List at least one more possible value of the block if the protocol doesn’t ensure cache coherency.

5.17.2 For a snooping protocol, list a valid operation sequence on each processor/cache to finish the above read/write operations.

5.17.3 What are the best-case and worst-case numbers of cache misses

needed to execute the listed read/write instructions?

Memory consistency concerns the views of multiple data items. The following data shows two processors and their read/write operations on different cache blocks (A and B initially 0).

P1

P2

A=1;B-2;A+=2;B++;

C=B;D=A;

5.17.4 List the possible values of C and D for an implementation that ensures both consistency assumptions on page 470.

5.17.5List at least one more possible pair of values for C and D if such assumptions are not maintained.

5.17.6 For various combinations of write policies and write allocation policies, which combinations make the protocol implementation simpler?

In this exercise, we will look at the different ways capacity affects overall performance. In general, cache access time is proportional to capacity. Assume that main memory accesses take 70 ns and that memory accesses are 36% of all instructions. The following table shows data for L1 caches attached to each of two processors, P1 and P2.

L1 Size

L1 Miss Rate

L1 Hit Time

P1

2 KiB

8.0%

0.66 ns

P2

4 KiB

6.0%

0.90 ns

(5.6.1) Assuming that the L1 hit time determines the cycle times for P1 and P2, what are their respective clock rates?

(5.6.2) What is the Average Memory Access Time for P1 and P2?

(5.6.3) Assuming a base CPI of 1.0 without any memory stalls, what is the total Cpi for P1 and P2? Which processor is faster?

For the next three problems, we will consider the addition of an L2 cache to P1 to presumably make up for its limited L1 cache capacity. Use the L1 cache capacities and hit times from the previous table when solving these problems. The L2 miss rate indicated is its local miss rate.

L2 Size

L2 Miss Rate

L2 Hit Time

1 MiB

95%

5.62 ns

(5.6.4) What is the AMAT for P1 with the addition of an L2 cache? Is the AMAT better or worse with the L2 cache?

(5.6.5) Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 with the addition of an L2 cache?

(5.6.6) Which processor is faster, now that P1 has an L2 cache? If P1 is faster, what miss rate would P2 need in its L1 cache to match P1’s performance? If P2 is faster, what miss rate would P1 need in its L1 cache to match P2’s performance?

Question: As described in Section 5.7, virtual memory uses a page table to track the mapping of virtual addresses to physical addresses. This exercise shoes how this table must be updated as addresses are accessed. The following data constitutes a stream of virtual addresses as seen on a system. Assume 4 KiB pages, a 4-entry fully associative TLB, and true LRU replacement. If pages must be brought in from disk, increment the next largest page number.

4669, 2227, 13916, 34587, 48870, 12608, 49225

TLB

Valid

Tag

Physical Page Number

1

11

12

1

7

4

1

3

6

0

4

9

Page table

Valid

Physical Page or in Disk

1

5

0

Disk

0

Disk

1

6

1

9

1

11

0

Disk

1

4

0

Disk

0

Disk

1

3

1

12

(5.11.1) Given the address stream shown, and the initial TLB and page table states provided above, show the final state of the system. Also list for each reference if it is a hit in the TLB, a hit in the page table, or a page fault.

(5.11.2) Repeat 5.11.1, but this time use 16 KiB pages instead of 4 KiB pages. What would be some of the advantages of having a larger page size? What are some of the disadvantages?

(5.11.3) Show the final contents of the TLB if it is 2-way set associative. Also show the contents of the TLB if it is direct mapped. Discuss the importance of having a TLB to high performance. How would virtual memory accesses be handles if there were no TLB?

There are several parameters that impact the overall size of the page table. Listed below are key page parameters.

Virtual Address Size

Page Size

Page Table Entry Size

32 bits

8 KiB

4 bytes

(5.11.4) Given the parameters shown above, calculate the total page table size for a system running 5 applications that utilize half of the memory available.

(5.11.5) Given the parameters shown above, calculate the total page table size for a system running 5 applications that utilize half of the memory available, given a two level page table approach with 256 entries. Assume each entry of the main page table is 6 bytes. Calculate the minimum amount of memory required.

(5.11.6) A cache designer wants to increase the size of a 4 KiB virtually indexed, physically tagged cache. Given the page size shown above, is it possible to make a 16 KiB direct-mapped cache, assuming 2 words per block? How would the designer increase the data size of the cache?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free