Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

IEEE 754-2008 contains a half precision that is only 16 bits wide. The left most bit is still the sign bit, the exponent is 5 bits wide and has a bias of 15, and the mantissa is 10 bits long. A hidden 1 is assumed. Write down the bit pattern to represent -1·5625×10-1assuming a version of this format, which uses an excess-16 format to store the exponent. Comment on how the range and accuracy of this 16-bit floating point format compares to the single precision IEEE 754 standard.

Short Answer

Expert verified

Bit pattern to represent -1.5625×10-1

1011000100000000

Step by step solution

01

Convert into binary

We can writeas-1.5625×10-1as-015625

Now, we convert it into binary representation

So, we write it

-0.1562510=-0.001012

02

Step 2: Normalize the binary representation

Since a hidden 1 is assumed, the normalized form is

-0.001012=-1.01×2-32

03

Find sign, exponent, and fraction

The number is negative so Sign =1

, 15 being bias so the exponent =12

Fraction = 01

04

Binaryrepresentation

1 011000100000000

The range is much smaller than the range of single-precision IEEE754 standard as the exponent is much smaller. And this half-precision standard is not as accurate as the fraction is also much smaller.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

Question: [45] The following C code implements a four-tap FIR filter on input array sig_in. Assume that all arrays are 16-bit fixed-point values. for (i 3;i< 128;i ) sig_out[i] sig_in[i-3] * f[0] sig_in[i-2] * f[1] sig_in[i-1] * f[2] sig_in[i] * f[3]; Assume you are to write an optimized implementation this code in assembly language on a processor that has SIMD instructions and 128-bit registers. Without knowing the details of the instruction set, briefly describe how you would implement this code, maximizing the use of sub-word operations and minimizing the amount of data that is transferred between registers and memory. State all your assumptions about the instructions you us

Using a table similar to that shown in Figure 3.6, calculate the product of the octal unsigned 6-bit integers 62 and 12 using the hardware described in Figure 3.3. You should show the contents of each register on each step.

Calculate 1.666015625×100×1.9760×104+1.666015625×100×-1.9744×104

by hand, assuming each of the values is stored in the 16-bit half-precision format described in Exercise 3.27 (and also described in the text). Assume 1 guard, 1 round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and write your answer in both the 16-bit floating-point format and in decimal.

Assume 185 and 122 are unsigned 8-bit decimal integers. Calculate 185 – 122. Is there overflow, underflow, or neither?

Using the IEEE 754 floating-point format, write down the bit pattern that would represent -14. Can you represent -14exactly?

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free