Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

Open reading frames in \(E .\) coll In this problem, we will search the \(E\). coli genome for open reading frames. The actual genome sequence of \(E\). coli is available on the book's website. (a) Write a program that scans the DNA sequence and records the distance between start and stop codons in each of the three ORFs on the forward strand. You may skip the calculation for the reverse strand. You can find an example of this code implemented in Matlab on the book's website. (b) Plot the distribution of ORF lengths \(L\) and compare it with that expected for random DNA calculated in Problem 4.7 (c) Estimate a cut-off value \(L_{\text {cut }}\), above which the ORFs are statistically significant, that is, the number of observed ORFs with \(L>L\) cut is much greater than expected by chance. (Problem courtesy of Sharad Ramanathan.)

Short Answer

Expert verified
To solve this problem, a program was constructed to scan the DNA sequence of E.coli, identify the ORFs, and calculate the distance between the start and stop codons in each ORF. This data was used to generate a distribution of ORF lengths and estimate a cut-off value, \(L_{\text {cut }}\), above which ORFs are considered statistically significant.

Step by step solution

01

Analyze the problem

Understanding the problem is the first step. It involves biology and programming. In biology, Open Reading Frames (ORFs) are sequences that have the potential to be translated into proteins. In programming, a script must be designed to scan the DNA sequence of E.coli and record the distance between the start and stop codons in each ORF on the forward strand.
02

Write the program

This step involves writing a script that reads the DNA sequence, identifies the ORFs, and calculates the distance between the start and stop codons. This can be done in multiple coding languages, with languages such as Python, R, or Matlab being suitable.
03

Execute the program and gather data

Run the script written in step 2 with the E.coli genome sequence as input data. The script will iterate over the entirety of the DNA sequence, identifying each ORF and calculating the distance between the start and stop codons. These lengths will be recorded, forming a distribution of ORF lengths.
04

Plot distribution of ORF lengths

With the data gathered from running the script, the next step is to plot the distribution of ORF lengths. This can be done using various statistical software, with each utilizing their own plotting functions.
05

Compare ORF lengths with random DNA

The generated distribution of ORF lengths should be compared to that of random DNA calculated in Problem 4.7. This comparison will offer insight into the significance of the identified ORFs.
06

Estimate cut-off value

After analyzing the distribution and making the comparison with random DNA, a cut-off value \(L_{\text {cut }}\) should be estimated. This value will define the ORFs that are statistically significant. ORFs with lengths \(L\) greater than \(L_{\text {cut }}\) are considered significant, indicating that the occurrence of similar or greater lengths in the observed ORFs is not likely due to chance.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

Key Concepts

These are the key concepts you need to understand to accurately answer the question.

Open Reading Frames
Open Reading Frames (ORFs) are a fundamental concept in bioinformatics and genomic analysis. An ORF is a continuous stretch of nucleotides within a DNA sequence, which starts with a start codon (usually AUG in RNA, which corresponds to ATG in DNA) and ends with a stop codon (such as UAA, UAG, or UGA).

These frames indicate regions that have the potential to encode proteins. Understanding ORFs is crucial because proteins perform nearly all of the functions necessary for cells to operate.
  • Start Codons: Indicate where the protein-coding region begins.
  • Stop Codons: Signal the end of the protein-coding region.
When scanning a genome for ORFs, the direction (forward or reverse) and the reading frame (there are three possible reading frames on each strand) must be considered. For the provided exercise, focus is given to the forward strand.
E. coli Genome
The Escherichia coli (E. coli) genome is a widely studied model in genetics and microbiology. It serves as a vital resource for biological research because of its well-characterized genetic material and the ease with which it can be manipulated and studied.

The E. coli genome is composed of approximately 4.6 million base pairs and contains a well-organized arrangement of coding and non-coding sequences. Researchers analyze this genome to understand bacterial function and evolution, and it also serves as a reference for studying more complex organisms.
  • Order and Structure: The genome is circular and highly compact, encoding thousands of proteins.
  • Functional Significance: Each gene has a specific role, contributing to the survival and adaptation of the bacterium.
By analyzing its ORFs, scientists can identify which regions are active and potentially significant in biological processes.
DNA Sequence Analysis
DNA Sequence Analysis involves examining the sequence of bases (adenine, thymine, cytosine, and guanine) within a DNA molecule. This analysis allows scientists to identify genes, predict their function, and understand the evolutionary history of organisms.

For the E. coli genome, DNA sequence analysis helps determine the various ORFs and their characteristics. Steps in sequence analysis generally include:
  • Reading Sequence: Obtaining the raw genetic code from databases.
  • Identifying Codons: Finding the start, stop, and intermediate codons that make up ORFs.
  • Recording Data: Documenting the lengths and positions of these ORFs for further analysis.
Advanced software and algorithms, such as those written in Python or Matlab, are employed to automate and enhance the accuracy of these analyses.
Statistical Significance in Genomics
In genomics, determining the statistical significance of findings is crucial to distinguish between results that are meaningful and those that might be due to random chance. This helps in confirming which genetic sequences have real biological functions or impacts.

To identify statistically significant ORFs, comparisons are made between the observed lengths of ORFs and their expected random lengths. This includes:
  • Distribution Analysis: Analyze the frequency and length distribution of observed ORFs.
  • Threshold Calculation: Determine a cut-off value, referred to as \(L_{\text{cut}}\), which signifies ORFs that show a non-random and potentially significant pattern.
  • Comparison with Random DNA: Contrast the actual distribution of ORFs against a random model to ensure the result isn't coincidental.
This cutoff value helps researchers understand which sequences are more likely to be functionally important, guiding further genomic investigations.

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

Alignment tools and methods HIV virions wrap themselves with a lipid bilayer membrane as they bud off from infected cells; in this viral membrane envelope are "spikes" composed of two different proteins (actually glycoproteins), gp41 and gp120. The gp denotes glycoprotein, and the number indicates their molecular weight in kilodaltons. gp 120 and \(8 p 41\) together form the trimeric envelope spike on the surface of HIV that functions in viral entry into a host cell. The primary receptor for \(g p 120\) is \(\mathrm{CD} 4,\) a protein found mainly on the white blood cells known as T-lymphocytes. gp120 avoids detection by the host immune system through a number of strategies, including rapid changes in sequence due to mutations. In this exercise, you will grab a sequence for gp 120 and -blast" it to find related proteins. Use the "Search database" on the LANL HIV Sequence Database site (www.hiv.lanl.gov/) to find a sequence for \(\mathrm{gp} 120\); the complete gp120 molecule has about 500 amino acids, so a complete DNA sequence will have roughly 1500 base pairs. With the BLAST website (www.ncbi.nlm.nih.gov/ blast), open a new window and select "blastx" under the "Basic BLAST" heading. Copy and paste your approximately 1500 nucleotide sequence of gp 120 into the top box under "Enter Query Sequence." For the database, select "Protein Data Bank proteins (pdb)." What we are doing is having BLAST translate the \(g\) p 120 nucleotide sequence into an amino acid sequence and then compare it with amino acid sequences of proteins in the PDB. Note that there are some proteins that appear multiple times in the PDB because their structures have been analyzed and determined more than once or in different contexts. Finally, push the "BLAST" button and wait for your results to appear on a new page. (a) How does BLAST determine the ranking of the results from your search? (b) For your top search result, identify the percentage of sequence identity with your query sequence. Explain how this number is determined. (c) You will notice that BLAST has used gaps in many of your alignments. What is the evolutionary significance of these gaps? (d) For your top hit, how many alignments with this high a score or better would have been expected by chance? (e) Looking at your ranked list of results and using only the "E-value," which hits do you expect to (possibly) have a genuine evolutionary relationship with your gp 120 sequence and why?

Mutations of bacteria in our gut (a) The populations of the \(E\). coli in the guts of a collection of humans can be large enough that multiple mutations can occur simultaneously in one bacterium. Suppose that a very particular combination of \(k\) point mutations is required for a pathogenic strain to emerge and that these must all arise in one cell division (as could be the case if the subsets of these mutations are deleterious). With the point mutation rate per base pair per cell division of \(\mu,\) what is the probability \(m_{k}\) that this occurs in a single cell division? The simplest assumption is that the probabilities of the different mutations are independent. (b) In a human large intestine, the density of bacteria is estimated to be about \(10^{11.5}\) per milliliter, of which a fraction of about \(10^{-4}\) are \(E\) coll. Estimate how many \(E\) coli per person this implies. In a population of \(N\) humans, with \(n\) \(E\) coli in each of their guts, in \(T\) generations of the \(E\). coli estimate the total probability \(P_{k}\) that the particular combination of \(k\) mutations occurs at least once. (c) With the population of Silicon Valley over one year, what are the chances this occurs for \(k=2 ?\) For \(k=3 ?\) Some crucial factors in your estimate are \(\mu \approx 10^{-10}-10^{-9}\) mutations per base pair per cell division and the generation time of \(\bar{E}\). colt. the standard lab result is that \(E\). coll divide every 20 minutes. A low-end estimate for the division rate of \(E\). coli in human guts is about once every few days. Why is this more realistic? Given these and other uncertainties, how big are the uncertainties in your estimates of \(P_{2}\) and \(P_{3} ?\) (Problem courtesy of Daniel Fisher.)

Mutual information by another name In the chapter, we introduced the concept of mutual information as the average decrease in the missing information associated with one variable when the value of another variable in known. In terms of probability distributions, this can be written mathematically as \\[I=\sum_{y} p(y)\left[-\sum_{x} p(x) \log _{2} p(x)+\sum_{x} p(x | y) \log _{2} p(x | y)\right]\\] where the expression in square brackets is the difference in missing information, \(S_{x}-S_{x} y,\) associated with probability of \(x, p(x),\) and with probabilify of \(x\) conditioned on \(y, p(x | y)\) Using the relation between the conditional probability \(p(x | y)\) and the joint probability \(p(x, y)\) \\[p(x | y)=\frac{p(x, y)}{p(y)}\\] show that the formula for mutual information given in Equation 21.77 can be used to derive the formula used in the chapter (Equation 21.17 ), namely \\[I=\sum_{x, y} p(x, y) \log _{2}\left[\frac{p(x, y)}{p(x) p(y)}\right]\\].

The molecular clock In eukaryotes, the majority of individual point mutations are thought to be "neutral" and have little or no effect on phenotype. Only a small fraction of the genome codes for proteins and critical DNA regulatory sequences. Even within coding regions, the redundancy of the genetic code is suffcient to render many mutations "synonymous" (that is, they do not change the amino acid, and hence the protein, encoded by the DNA). The slow accumulation of neutral mutations between two populations can be used as a "molecular clock" to estimate the length of time that has passed since the existence of their last common ancestor. In these estimates, it is common to make the simplifying approximations that (1) most mutations are neutral and (2) the rate of accumulation of neutral mutations is just the average point mutation rate per generation (that is, ignoring other kinds of mutations such as deletions, inversions, etc., as well as variations in and correlations among mutations). (a) With a crude estimate of the point mutation rate of humans of \(10^{-8}\) per base pair per generation, what fraction of the possible nucleotide differences would you expect there to be between chimpanzees and humans given that the fossil record and radiochemical dating indicate their lineages diverged about six million years ago? Compare your estimate with the observed result from sequencing of about \(1.5 \%\) (b) Some parasitic organisms (lice are an example) have specialized and co- evolved with humans and chimps separately. A natural hypothesis is that the most recent common ancestor of the human and chimp parasites existed at the same time as that of the human and chimp themselves. How might you test this from DNA sequence data and other information? What are likely to be the largest causes of uncertainty in the estimates? (Problem courtesy of Daniel Fisher.)

Restriction enzymes and sequences (a) Restriction enzymes are proteins that recognize specific sequences at which they cut the DNA. Two commonly used restriction enzymes are HindIII and EcoRI. Look up the recognition sequences that these enzymes each cut and make a sketch of the pattern of cutting they carry out. Consider the approximately 48,000 bp genome of lambda phage and make an estimate of the lengths of the fragments that you would get if the DNA is cut with both the HindIII and EcoRI restriction enzymes. There is a precise mathematical way to do this and it depends upon the length of the recognition sequence-a 5 cutter will have shorter fragments than an 8 cutter-explain that. (b) Find the actual fragment lengths obtained in the lambda genome using these restriction enzymes by going to the New England Biolabs website (www.neb.com) and looking up the tables identifying the sites on the lambda genome that are cut by these different enzymes. How do these cutting patterns compare with your results from (a)? (c) Plot the number of cuts in the lambda genome as a function of the length of the recognition sequence of several commercially available type II restriction enzymes. You can download the list of type II restriction enzymes from the book's website. Combine this plot with a curve showing your theoretical expectation.

See all solutions

Recommended explanations on Biology Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free