Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior.

Base CPI

Priviliged O/S Accesses per 10,000 Instructions

Performance Impact to Trap to the Guest O/S

Performance Impact to Trap to VMM

I/O Access per 10,000 Instructions

I/O Access Time (Includes Time to Trap to Guest O/S)

1.5

120

15 cycles

175 cycles

30

1100 cycles

(5.15.1) Calculate the CPI for the system listed above assuming that there are no accesses to I/O. What is the CPI if the VMM performance impact doubles? If it is cut in half? If a virtual machine software company wishes to obtain a 10% performance degradation, what is the longest possible penalty to trap to the VMM?

(5.15.2) I/O accesses often have a large impact on overall system performance. Calculate the CPI of a machine using the performance characteristics above, assuming a non-virtualized system. Calculate the CPI again, this time using a virtualized system. How do these CPIs change if the system has the I/O access? Explain why I/O bound applications have a smaller impact from virtualization.

(5.15.3) Compare and contrast the ideas of virtual memory and virtual machines. How do the goals of each compare? What are the pros and cons of each? List a few cases where virtual memory is desired, and a few cases where virtual machines are desired.

(5.15.4) Section 5.6 discusses virtualization under the assumption that the virtualized system is running the same ISA as the underlying hardware. However, one possible used of virtualization is to emulate non-native ISAs. An example of this is QEMU, which emulates a variety of ISAs such as MIPS, SPARC, and PowerPC. What are some of the difficulties involved in this kind of virtualization? Is it possible for an emulated system to run faster than on its native ISA?

Short Answer

Expert verified

(5.15.1)

The CPI of the system is 3.78. The CPI if VMM performance doubles are 5.88 and if halves are 2.73. The longest possible penalty to trap is 14 cycles.

(5.15.2)

The CPI for a non-virtualized system is 4.8. The CPI for the virtualized system is 7.605. The CPI for a virtualized system with half I/O access is 5.69. The impact of virtualization on I/O bound applications is less because longer time is spent waiting for I/O accesses to complete.

(5.15.3)

Virtual memory gives an illusion of address space and the virtual machine gives an illusion of having dedicated physical hardware.

(5.15.4)

It is difficult to emulate each instruction for the target ISA. An emulated system can run faster than on its native ISA.

Step by step solution

01

Formula to calculate CPI of the given system

(5.15.1)

The calculation of CPI for the system given in the question is done using the following formula:

02

Calculation of CPI for the given system

CPI of the system with no I/O access is equal to:


CPI of the system if VMM performance doubles:


CPI of the system if VMM performance halves:

03

Degrading the performance by 10%

The CPI of the system on native hardware is equal to:


To degrade the performance by 10%, the following equation should satisfy

The longest possible penalty to trap to VMM is 14 cycles.

04

Calculation of CPI assuming a non-virtualized system

(5.15.2)

The formula to calculate CPI for a non-virtualized system with I/O impact is:

Using the values given in the question

05

Calculation of CPI for virtualized system

The CPI using a virtualized system is:

The CPI using a virtualized system with half I/O accesses is:

CPI doesn’t change much if the system has I/O access. The execution of I/O traps is mostly performed on the guest OS and less on a virtual machine. So, the impact of virtualization on I/O bound applications is less.

06

Comparing virtual memory and virtual machines

(5.15.3)

Virtual memory provides the illusion of address space to the application while the virtual machine provides the illusion of having the entire machine at the disposal to an operating system. Thus, the goals of virtual memory and virtual machine are almost the same.

Virtual memory helps in implementing multiple programming, allowing more applications to run at the same time. It also helps to increase speed by executing only the required segment of the program. With virtual memory, there is more switching between applications. It offers less performance than RAM. With virtual machines, less physical hardware is required. Disaster recovery is also quick with virtual machines. The upfront cost of virtual machines is high and these are complex too.

Virtual memory is desired to compensate for the shortage of physical memory and to temporarily transfer the data from RAM to disk storage. Virtual machines are desired to run multiple operating systems on the same hardware, testing apps and websites across multiple platforms. The virtual machine works in isolation from the main memory, so security risks can be taken.

07

Impact of virtualization with non-native ISAs

(5.15.4)

Each ISA behaves differently to execute the instruction, handling interrupts and traps. Emulating different ISA requires more instructions to emulate each instruction for the target ISA. It affects the performance because it is difficult to communicate with external devices. An emulated system can run faster than on its native ISA with dynamic examination and optimization of the emulated code.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream:

0, 2, 4, 6, 8, 10, 12, 14, 16, …

5.5.1 Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, based on the 3C model?

5.5.2 Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?

5.5.3 “Prefetching” is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above?

Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of 1.35 references (both instruction and data) per instruction, help find the optimal block size given the following miss rates for various block sizes.

8:4%

16:3%

32:2%

64:1.5%

128:1%

5.5.4 What is the optimal block size for a miss latency of 20×B cycles?

5.5.5 What is the optimal block size for a miss latency of 24+B cycles?

5.5.6 For constant miss latency, what is the optimal block size

In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same rwo are stored contiguously. Assume each word is a 32-bit integer.

for(I=0;I<8;I++)

for(J=0;J<8000;J++)

A[I][J]=B[I][0]+A[J][I];

5.1.1 [5] How many 32-bit integers can be stored in a 16-byte cache block?

5.1.2 [5] References to which variables exhibit temporal locality?

5.1.3 [5] References to which variables exhibit spatial locality?

Locality is affected by both the reference order and data layout. The same computation can also be written below in Matlab, which differs from C by storing matrix elements within the same column contiguously in memory.

for I=1:8

for J=1:8000

A(I,J)=B(I,0)+A(J,I);

end

end

5.1.4. [10] How many 16-byte cache blocks are needed to store all 32-bit matrix elements being referenced?

5.1.5 [5] References to which variables exhibit temporal locality?

5.1.6 [5] References to which variables exhibit spatial locality?

Cache coherence concerns the views of multiple processors on a given cache block. The following data shows two processors and their read/write operations on two different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 32 bits.

P1

P2

X0++;X1=3

X0=5;X1+=2;

5.17.1 List the possible values of the given cache block for a correct cache coherence protocol implementation. List at least one more possible value of the block if the protocol doesn’t ensure cache coherency.

5.17.2 For a snooping protocol, list a valid operation sequence on each processor/cache to finish the above read/write operations.

5.17.3 What are the best-case and worst-case numbers of cache misses

needed to execute the listed read/write instructions?

Memory consistency concerns the views of multiple data items. The following data shows two processors and their read/write operations on different cache blocks (A and B initially 0).

P1

P2

A=1;B-2;A+=2;B++;

C=B;D=A;

5.17.4 List the possible values of C and D for an implementation that ensures both consistency assumptions on page 470.

5.17.5List at least one more possible pair of values for C and D if such assumptions are not maintained.

5.17.6 For various combinations of write policies and write allocation policies, which combinations make the protocol implementation simpler?

In this exercise, we will explore the control unit for a cache controller for a processor with a write buffer. Use the finite state machine found in Figure 5.40 as a starting point for designing your finite state machines. Assume that the cache controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in Section 5.9), but you will add a write buffer with a capacity of one block.

Recall that the purpose of a write buffer is to serve as temporary storage so that the processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than writing back the dirty block before reading the new block, it buffers the dirty block and immediately begins reading the new block. The dirty block can then be written to the main memory while the processor is working.

5.16.1 What should happen if the processor issues a request that hits in the cache while a block is being written back to main memory from the write buffer?

5.16.2 What should happen if the processor issues a request that misses in the cache while a block is being written back to main memory from the write buffer?

5.16.3 Design a finite state machine to enable the use of a write buffer.

Question: In this exercise, we will examine space/time optimizations for page tables. The following list provides parameters of a virtual memory system.

Virtual Address (bits)

Physical DRAM Installed

Page Size

PTE Size (byte)

43

16 GiB

4KiB

4

(5.12.1) For a single-level page table, how many page table entries (PTEs) are needed? How much physical memory is needed for storing the page table?

(5.12.2) Using a multilevel page table can reduce the physical memory consumption of page tables, by keeping active PTEs in physical memory. How many levels of page tables will be needed in this case? And how many memory references are needed for address translation if missing in TLB?

(5.12.3) An inverted page table can be used to further optimize space and time. How many PTEs are needed to store the page table? Assuming a hash table implementation, what are the common case and worst case numbers of memory references needed for servicing a TLB miss?

The following table shows the contents of a 4-entry TLB.

Entry-ID

Valid

VA Page

Modified

Protection

PA Page

1

1

140

1

RW

30

2

0

40

0

RX

34

3

1

200

1

RO

32

4

1

280

0

RW

31

(5.12.4) Under what scenarios would entry 2’s valid bit be set to zero?

(5.12.5) What happens when an instruction writes to VA page 30? When would software managed TLB be faster than hardware managed TLB?

(5.12.6) What happens when an instruction writes to VA page 200?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free