Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Write a program to construct a dictionary of all "words," defined to be runs of consecutive nonwhitespace, in a given text file. We might then compress the file (ignoring the loss of whitespace information) by representing each word as an index in the dictionary. Retrieve the file rfc791.txt containing [Pos81], and run your program on it. Give the size of the compressed file assuming first that each word is encoded with 12 bits (this should be sufficient), and then that the 128 most common words are encoded with 8 bits and the rest with 13 bits. Assume that the dictionary itself can be stored by using, for each word, length(word) + 1 bytes.

Short Answer

Expert verified
For 12 bits per word, total size = 12 * number of words. For mixed encoding, 128 common words = 8 bits each, rest = 13 bits each. Dictionary size is sum of (len(word) + 1) for all unique words.

Step by step solution

01

Import necessary libraries

Use Python libraries for handling files and collections. You will need `collections` for handling word frequencies efficiently.
02

Read the file

Open and read the content of the file `rfc791.txt`. This will provide you with the text to process.
03

Split the text into words

Use Python string methods to split the text into a list of words, which are sequences of non-whitespace characters.
04

Create a dictionary of word frequencies

Count the frequency of each word in the text using a `Counter` from the `collections` library.
05

Calculate the size of the compressed file (12 bits per word)

Assume each word is encoded with 12 bits. Multiply the total number of words by 12 to get the size in bits.
06

Calculate the size of the compressed file (128 common words with 8 bits, rest with 13 bits)

Identify the 128 most common words. For these words, use 8 bits for encoding. For all other words, use 13 bits. Calculate the total compressed size in bits taking these into account.
07

Calculate dictionary storage size

For the dictionary size, consider that each word uses `len(word) + 1` bytes. Sum this for all unique words to find the total size needed for storing the dictionary.

Key Concepts

These are the key concepts you need to understand to accurately answer the question.

File Compression
File compression is a technique used to reduce the size of files. This is crucial for saving storage space and speeding up file transfers. The idea is to represent data in a more compact form without losing essential information. Various algorithms can achieve these reductions by identifying patterns and redundancies in the data.

In the context of our exercise, we are compressing text files by replacing words with indexes from a dictionary. This simplified version doesn't worry about preserving white spaces. By converting words into indexes, we reduce the overall file size while still retaining the important content. The efficiency of file compression depends on the type of algorithm and the nature of the file's data.

File compression is widely used in everyday applications, including:
  • Archiving data using formats like ZIP or RAR
  • Streaming media using formats like MP3, JPEG, or MP4
  • Reducing the size of software packages for easier download and installation
Word Frequency Analysis
Word frequency analysis involves counting the number of times each word appears in a text. This is a fundamental step in many text processing applications, including file compression. By analyzing word frequencies, we can identify the most commonly used words and optimize how we store them.

Let's break down the process further:
  • Read the complete text from the file
  • Split the text into individual words
  • Count the occurrences of each word

Python's `collections` library is particularly useful here, with its `Counter` class allowing efficient frequency counting.

Word frequency data helps in compressing files more effectively. For example, in our exercise, encoding frequently used words with fewer bits reduces the overall file size more than treating all words equally.

Applications of word frequency analysis include:
  • Text summarization
  • Search engine optimization (SEO)
  • Natural Language Processing (NLP)
Encoding Schemes
Encoding schemes are methods by which data is converted into a specific format for efficient storage and transfer. In the context of file compression, encoding schemes determine how compactly data can be stored. Different schemes offer various trade-offs between compression efficiency and computational complexity.

In our exercise, we explored two encoding schemes:
  • Using 12 bits to encode each word uniformly
  • Using 8 bits for the 128 most frequent words and 13 bits for the rest

The choice of scheme affects the compressed file size. Using fewer bits for more common words can lead to significant size reductions without losing information.

General encoding schemes include:
  • Fixed-Length Encoding: Every symbol is represented by a fixed number of bits. Simple but might be inefficient.
  • Variable-Length Encoding: Symbols are encoded with different lengths of bits, often based on frequency. More efficient for varied data.
  • Huffman Coding: A type of variable-length encoding that uses a priority queue to ensure the most frequent symbols have the shortest codes.

Understanding and choosing the appropriate encoding scheme is essential for effective file compression.

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

Using the programming language of your choice that supports user-defined automatic type conversions, define a type netint and supply conversions that enable assignments and equality comparisons between ints and netints. Can a generalization of this approach solve the problem of network argument marshalling?

Use XDR and htonl to encode a 1000 -element array of integers. Measure and compare the performance of each. How do these compare to a simple loop that reads and writes a 1000-element array of integers? Perform the experiment on a computer for which the native byte order is the same as the network byte order, as well as on a computer for which the native byte order and the network byte order are different.

Let \(p \leq 1\) be the fraction of machines in a network that are big-endian; the remaining \(1-p\) fraction are little-endian. Suppose we choose two machines at random and send an int from one to the other. Give the average number of byte-order conversions needed for both big-endian network byte order and receiver-makes-right, for \(p=0.1, p=0.5\), and \(p=0.9\). Hint: The probability that both endpoints are big-endian is \(p^{2} ;\) the probability that the two endpoints use different byte orders is \(2 p(1-p)\).

Suppose we have a video of two white points moving toward each other at a uniform rate against a black background. We encode it via MPEG. In one I frame the two points are 100 pixels apart; in the next I frame they have merged. The final point of merger happens to lie at the center of a \(16 \times 16\) macroblock. (a) Describe how you might optimally encode the Y component of the intervening B (or P) frames. (b) Now suppose the points are in color, and that the color changes slowly as the points move. Describe what the encoding of the U and V values might look like.

Suppose a file contains the letters \(a, b, c\), and \(d\). Nominally, we require 2 bits per letter to store such a file. (a) Assume the letter \(a\) occurs \(50 \%\) of the time, \(b\) occurs \(30 \%\) of the time, and \(c\) and \(d\) each occur \(10 \%\) of the time. Give an encoding of each letter as a bit string that provides optimal compression. Hint: Use a single bit for \(a\). (b) What is the percentage of compression you achieve above? (This is the average of the compression percentages achieved for each letter, weighted by the letter's frequency.) (c) Repeat this, assuming \(a\) and \(b\) each occur \(40 \%\) of the time, coccurs \(15 \%\) of the time, and \(d\) occurs \(5 \%\) of the time.

See all solutions

Recommended explanations on Computer Science Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free