Name two reasons why it is impossible to determine a gene's nucleotide sequence from the amino acid sequence of the polypeptide

Name two reasons why it is impossible to determine a gene's nucleotide sequence from the amino acid sequence of the polypeptide

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I can only think of one reason, which is because different codons can specify the same amino acids. However I am having trouble thinking of another reason.

I can think of at least 3 reasons in addition to the one you gave:

1: As mentioned in the comments, RNA splicing takes place on most messenger RNA encoding proteins in eukaryotes. Sections of the mRNA may be spliced out, therefore multiple mRNAs with different codon sequence can encode the same gene.

2: Translation is a stateful process, since it depends on the frame of the codon. Therefore, a gene with the sequence GGATGATGATGTAA will encode the same protein as a gene with the sequence ATGATGATGTAA, due to the start codon shifting the frame of translation downwards.

3: Genes contain untranslated regions in regions before the start and after the stop codon. These nucleotides cannot be predicted from protein sequence, but are generally important in regulating protein expression.

Introns are not included in the polypeptide sequence.

Silent mutation

Silent mutations are mutations in DNA that do not have an observable effect on the organism's phenotype. They are a specific type of neutral mutation. The phrase silent mutation is often used interchangeably with the phrase synonymous mutation however, synonymous mutations are not always silent, nor vice versa. [1] [2] [3] [4] [5] Synonymous mutations can affect transcription, splicing, mRNA transport, and translation, any of which could alter phenotype, rendering the synonymous mutation non-silent. [3] The substrate specificity of the tRNA to the rare codon can affect the timing of translation, and in turn the co-translational folding of the protein. [1] This is reflected in the codon usage bias that is observed in many species. Mutations that cause the altered codon to produce an amino acid with similar functionality (e.g. a mutation producing leucine instead of isoleucine) are often classified as silent if the properties of the amino acid are conserved, this mutation does not usually significantly affect protein function. [6]


Point mutations usually take place during DNA replication. DNA replication occurs when one double-stranded DNA molecule creates two single strands of DNA, each of which is a template for the creation of the complementary strand. A single point mutation can change the whole DNA sequence. Changing one purine or pyrimidine may change the amino acid that the nucleotides code for.

Point mutations may arise from spontaneous mutations that occur during DNA replication. The rate of mutation may be increased by mutagens. Mutagens can be physical, such as radiation from UV rays, X-rays or extreme heat, or chemical (molecules that misplace base pairs or disrupt the helical shape of DNA). Mutagens associated with cancers are often studied to learn about cancer and its prevention.

There are multiple ways for point mutations to occur. First, ultraviolet (UV) light and higher-frequency light are capable of ionizing electrons, which in turn can affect DNA. Reactive oxygen molecules with free radicals, which are a byproduct of cellular metabolism, can also be very harmful to DNA. These reactants can lead to both single-stranded DNA breaks and double-stranded DNA breaks. Third, bonds in DNA eventually degrade, which creates another problem to keep the integrity of DNA to a high standard. There can also be replication errors that lead to substitution, insertion, or deletion mutations.

Transition/transversion categorization Edit

In 1959 Ernst Freese coined the terms "transitions" or "transversions" to categorize different types of point mutations. [2] [3] Transitions are replacement of a purine base with another purine or replacement of a pyrimidine with another pyrimidine. Transversions are replacement of a purine with a pyrimidine or vice versa. There is a systematic difference in mutation rates for transitions (Alpha) and transversions (Beta). Transition mutations are about ten times more common than transversions.

Functional categorization Edit

Nonsense mutations include stop-gain and start-loss. Stop-gain is a mutation that results in a premature termination codon (a stop was gained), which signals the end of translation. This interruption causes the protein to be abnormally shortened. The number of amino acids lost mediates the impact on the protein's functionality and whether it will function whatsoever. [4] Stop-loss is a mutation in the original termination codon (a stop was lost), resulting in abnormal extension of a protein's carboxyl terminus. Start-gain creates an AUG start codon upstream of the original start site. If the new AUG is near the original start site, in-frame within the processed transcript and downstream to a ribosomal binding site, it can be used to initiate translation. The likely effect is additional amino acids added to the amino terminus of the original protein. Frame-shift mutations are also possible in start-gain mutations, but typically do not affect translation of the original protein. Start-loss is a point mutation in a transcript's AUG start codon, resulting in the reduction or elimination of protein production.

Missense mutations code for a different amino acid. A missense mutation changes a codon so that a different protein is created, a non-synonymous change. [4] Conservative mutations result in an amino acid change. However, the properties of the amino acid remain the same (e.g., hydrophobic, hydrophilic, etc.). At times, a change to one amino acid in the protein is not detrimental to the organism as a whole. Most proteins can withstand one or two point mutations before their function changes. Non-conservative mutations result in an amino acid change that has different properties than the wild type. The protein may lose its function, which can result in a disease in the organism. For example, sickle-cell disease is caused by a single point mutation (a missense mutation) in the beta-hemoglobin gene that converts a GAG codon into GUG, which encodes the amino acid valine rather than glutamic acid. The protein may also exhibit a "gain of function" or become activated, such is the case with the mutation changing a valine to glutamic acid in the BRAF gene this leads to an activation of the RAF protein which causes unlimited proliferative signalling in cancer cells. [5] These are both examples of a non-conservative (missense) mutation.

Silent mutations code for the same amino acid (a "synonymous substitution"). A silent mutation does not affect the functioning of the protein. A single nucleotide can change, but the new codon specifies the same amino acid, resulting in an unmutated protein. This type of change is called synonymous change since the old and new codon code for the same amino acid. This is possible because 64 codons specify only 20 amino acids. Different codons can lead to differential protein expression levels, however. [4]

Single base pair insertions and deletions Edit

Sometimes the term point mutation is used to describe insertions or deletions of a single base pair (which has more of an adverse effect on the synthesized protein due to the nucleotides' still being read in triplets, but in different frames: a mutation called a frameshift mutation). [4]

Point mutations that occur in non-coding sequences are most often without consequences, although there are exceptions. If the mutated base pair is in the promoter sequence of a gene, then the expression of the gene may change. Also, if the mutation occurs in the splicing site of an intron, then this may interfere with correct splicing of the transcribed pre-mRNA.

By altering just one amino acid, the entire peptide may change, thereby changing the entire protein. The new protein is called a protein variant. If the original protein functions in cellular reproduction then this single point mutation can change the entire process of cellular reproduction for this organism.

Point germline mutations can lead to beneficial as well as harmful traits or diseases. This leads to adaptations based on the environment where the organism lives. An advantageous mutation can create an advantage for that organism and lead to the trait's being passed down from generation to generation, improving and benefiting the entire population. The scientific theory of evolution is greatly dependent on point mutations in cells. The theory explains the diversity and history of living organisms on Earth. In relation to point mutations, it states that beneficial mutations allow the organism to thrive and reproduce, thereby passing its positively affected mutated genes on to the next generation. On the other hand, harmful mutations cause the organism to die or be less likely to reproduce in a phenomenon known as natural selection.

There are different short-term and long-term effects that can arise from mutations. Smaller ones would be a halting of the cell cycle at numerous points. This means that a codon coding for the amino acid glycine may be changed to a stop codon, causing the proteins that should have been produced to be deformed and unable to complete their intended tasks. Because the mutations can affect the DNA and thus the chromatin, it can prohibit mitosis from occurring due to the lack of a complete chromosome. Problems can also arise during the processes of transcription and replication of DNA. These all prohibit the cell from reproduction and thus lead to the death of the cell. Long-term effects can be a permanent changing of a chromosome, which can lead to a mutation. These mutations can be either beneficial or detrimental. Cancer is an example of how they can be detrimental. [6]

Other effects of point mutations, or single nucleotide polymorphisms in DNA, depend on the location of the mutation within the gene. For example, if the mutation occurs in the region of the gene responsible for coding, the amino acid sequence of the encoded protein may be altered, causing a change in the function, protein localization, stability of the protein or protein complex. Many methods have been proposed to predict the effects of missense mutations on proteins. Machine learning algorithms train their models to distinguish known disease-associated from neutral mutations whereas other methods do not explicitly train their models but almost all methods exploit the evolutionary conservation assuming that changes at conserved positions tend to be more deleterious. While majority of methods provide a binary classification of effects of mutations into damaging and benign, a new level of annotation is needed to offer an explanation of why and how these mutations damage proteins. [7]

Moreover, if the mutation occurs in the region of the gene where transcriptional machinery binds to the protein, the mutation can affect the binding of the transcription factors because the short nucleotide sequences recognized by the transcription factors will be altered. Mutations in this region can affect rate of efficiency of gene transcription, which in turn can alter levels of mRNA and, thus, protein levels in general.

Point mutations can have several effects on the behavior and reproduction of a protein depending on where the mutation occurs in the amino acid sequence of the protein. If the mutation occurs in the region of the gene that is responsible for coding for the protein, the amino acid may be altered. This slight change in the sequence of amino acids can cause a change in the function, activation of the protein meaning how it binds with a given enzyme, where the protein will be located within the cell, or the amount of free energy stored within the protein.

If the mutation occurs in the region of the gene where transcriptional machinery binds to the protein, the mutation can affect the way in which transcription factors bind to the protein. The mechanisms of transcription bind to a protein through recognition of short nucleotide sequences. A mutation in this region may alter these sequences and, thus, change the way the transcription factors bind to the protein. Mutations in this region can affect the efficiency of gene transcription, which controls both the levels of mRNA and overall protein levels. [8]

Cancer Edit

Point mutations in multiple tumor suppressor proteins cause cancer. For instance, point mutations in Adenomatous Polyposis Coli promote tumorigenesis. [9] A novel assay, Fast parallel proteolysis (FASTpp), might help swift screening of specific stability defects in individual cancer patients. [10]

Neurofibromatosis Edit

Sickle-cell anemia Edit

Sickle-cell anemia is caused by a point mutation in the β-globin chain of hemoglobin, causing the hydrophilic amino acid glutamic acid to be replaced with the hydrophobic amino acid valine at the sixth position.

The β-globin gene is found on the short arm of chromosome 11. The association of two wild-type α-globin subunits with two mutant β-globin subunits forms hemoglobin S (HbS). Under low-oxygen conditions (being at high altitude, for example), the absence of a polar amino acid at position six of the β-globin chain promotes the non-covalent polymerisation (aggregation) of hemoglobin, which distorts red blood cells into a sickle shape and decreases their elasticity. [14]

Hemoglobin is a protein found in red blood cells, and is responsible for the transportation of oxygen through the body. [15] There are two subunits that make up the hemoglobin protein: beta-globins and alpha-globins. [16] Beta-hemoglobin is created from the genetic information on the HBB, or "hemoglobin, beta" gene found on chromosome 11p15.5. [17] A single point mutation in this polypeptide chain, which is 147 amino acids long, results in the disease known as Sickle Cell Anemia. [18] Sickle-cell anemia is an autosomal recessive disorder that affects 1 in 500 African Americans, and is one of the most common blood disorders in the United States. [17] The single replacement of the sixth amino acid in the beta-globin, glutamic acid, with valine results in deformed red blood cells. These sickle-shaped cells cannot carry nearly as much oxygen as normal red blood cells and they get caught more easily in the capillaries, cutting off blood supply to vital organs. The single nucleotide change in the beta-globin means that even the smallest of exertions on the part of the carrier results in severe pain and even heart attack. Below is a chart depicting the first thirteen amino acids in the normal and abnormal sickle cell polypeptide chain. [18]

Sequence for normal hemoglobin
START Val His Leu Thr Pro Glu Glu Lys Ser Ala Val Thr

Sequence for sickle-cell hemoglobin
START Val His Leu Thr Pro Val Glu Lys Ser Ala Val Thr

Tay–Sachs disease Edit

The cause of Tay–Sachs disease is a genetic defect that is passed from parent to child. This genetic defect is located in the HEXA gene, which is found on chromosome 15.

The HEXA gene makes part of an enzyme called beta-hexosaminidase A, which plays a critical role in the nervous system. This enzyme helps break down a fatty substance called GM2 ganglioside in nerve cells. Mutations in the HEXA gene disrupt the activity of beta-hexosaminidase A, preventing the breakdown of the fatty substances. As a result, the fatty substances accumulate to deadly levels in the brain and spinal cord. The buildup of GM2 ganglioside causes progressive damage to the nerve cells. This is the cause of the signs and symptoms of Tay-Sachs disease. [19]

Color blindness Edit

People who are colorblind have mutations in their genes that cause a loss of either red or green cones, and they therefore have a hard time distinguishing between colors. There are three kinds of cones in the human eye: red, green, and blue. Now researchers have discovered that some people with the gene mutation that causes colorblindness lose an entire set of "color" cones with no change to the clearness of their vision overall. [20]

In molecular biology, repeat-induced point mutation or RIP is a process by which DNA accumulates G:C to A:T transition mutations. Genomic evidence indicates that RIP occurs or has occurred in a variety of fungi [21] while experimental evidence indicates that RIP is active in Neurospora crassa, [22] Podospora anserina, [23] Magnaporthe grisea, [24] Leptosphaeria maculans, [25] Gibberella zeae [26] and Nectria haematococca. [27] In Neurospora crassa, sequences mutated by RIP are often methylated de novo. [22]

RIP occurs during the sexual stage in haploid nuclei after fertilization but prior to meiotic DNA replication. [22] In Neurospora crassa, repeat sequences of at least 400 base pairs in length are vulnerable to RIP. Repeats with as low as 80% nucleotide identity may also be subject to RIP. Though the exact mechanism of repeat recognition and mutagenesis are poorly understood, RIP results in repeated sequences undergoing multiple transition mutations.

The RIP mutations do not seem to be limited to repeated sequences. Indeed, for example, in the phytopathogenic fungus L. maculans, RIP mutations are found in single copy regions, adjacent to the repeated elements. These regions are either non-coding regions or genes encoding small secreted proteins including avirulence genes. The degree of RIP within these single copy regions was proportional to their proximity to repetitive elements. [28]

Rep and Kistler have speculated that the presence of highly repetitive regions containing transposons, may promote mutation of resident effector genes. [29] So the presence of effector genes within such regions is suggested to promote their adaptation and diversification when exposed to strong selection pressure. [30]

As RIP mutation is traditionally observed to be restricted to repetitive regions and not single copy regions, Fudal et al. [31] suggested that leakage of RIP mutation might occur within a relatively short distance of a RIP-affected repeat. Indeed, this has been reported in N. crassa whereby leakage of RIP was detected in single copy sequences at least 930 bp from the boundary of neighbouring duplicated sequences. [32] To elucidate the mechanism of detection of repeated sequences leading to RIP may allow to understand how the flanking sequences may also be affected.

Mechanism Edit

RIP causes G:C to A:T transition mutations within repeats, however, the mechanism that detects the repeated sequences is unknown. RID is the only known protein essential for RIP. It is a DNA methyltransferease-like protein, that when mutated or knocked out results in loss of RIP. [33] Deletion of the rid homolog in Aspergillus nidulans, dmtA, results in loss of fertility [34] while deletion of the rid homolog in Ascobolus immersens, masc1, results in fertility defects and loss of methylation induced premeiotically (MIP). [35]

Consequences Edit

RIP is believed to have evolved as a defense mechanism against transposable elements, which resemble parasites by invading and multiplying within the genome. RIP creates multiple missense and nonsense mutations in the coding sequence. This hypermutation of G-C to A-T in repetitive sequences eliminates functional gene products of the sequence (if there were any to begin with). In addition, many of the C-bearing nucleotides become methylated, thus decreasing transcription.

Use in molecular biology Edit

Because RIP is so efficient at detecting and mutating repeats, fungal biologists often use it as a tool for mutagenesis. A second copy of a single-copy gene is first transformed into the genome. The fungus must then mate and go through its sexual cycle to activate the RIP machinery. Many different mutations within the duplicated gene are obtained from even a single fertilization event so that inactivated alleles, usually due to nonsense mutations, as well as alleles containing missense mutations can be obtained. [36]

The cellular reproduction process of meiosis was discovered by Oscar Hertwig in 1876. Mitosis was discovered several years later in 1882 by Walther Flemming.

Hertwig studied sea urchins, and noticed that each egg contained one nucleus prior to fertilization and two nuclei after. This discovery proved that one spermatozoon could fertilize an egg, and therefore proved the process of meiosis. Hermann Fol continued Hertwig's research by testing the effects of injecting several spermatozoa into an egg, and found that the process did not work with more than one spermatozoon. [37]

Flemming began his research of cell division starting in 1868. The study of cells was an increasingly popular topic in this time period. By 1873, Schneider had already begun to describe the steps of cell division. Flemming furthered this description in 1874 and 1875 as he explained the steps in more detail. He also argued with Schneider's findings that the nucleus separated into rod-like structures by suggesting that the nucleus actually separated into threads that in turn separated. Flemming concluded that cells replicate through cell division, to be more specific mitosis. [38]

Matthew Meselson and Franklin Stahl are credited with the discovery of DNA replication. Watson and Crick acknowledged that the structure of DNA did indicate that there is some form of replicating process. However, there was not a lot of research done on this aspect of DNA until after Watson and Crick. People considered all possible methods of determining the replication process of DNA, but none were successful until Meselson and Stahl. Meselson and Stahl introduced a heavy isotope into some DNA and traced its distribution. Through this experiment, Meselson and Stahl were able to prove that DNA reproduces semi-conservatively. [39]

Protein Structure

For an interactive illustration of the protein structure levels, check out the protein folding simulation by LabXchange, which uses hemoglobin as an example and describes the molecular structure in more detail.

As mentioned above, a protein’s shape is critical to its function. For example, an enzyme can bind to a specific substrate at an active site. If this active site is altered because of local changes or changes in overall protein structure, the enzyme may be unable to bind to the substrate. To understand how the protein gets its final shape or conformation, we need to understand the four levels of protein structure: primary, secondary, tertiary, and quaternary. See the image below and click on the information hotspots (labeled with an “i”) for explanations.

As seen in the image above, a strand of amino acids folds on itself, creating a unique shape in the tertiary structure of the protein. This is caused by the chemical properties of the amino acids. The chemical properties of the amino acids determine how this shape occurs. For instance, each amino acid is negatively (-), positively (+), or neutrally (N) charged. Negatively charged amino acids bind with positively charged amino acids (neutrally charged amino acids are not affected). Also, the amino acid called cysteine contains sulfur and sulfurs easily bind with each other, creating a “disulfide bond.” Because of this, cysteines bind with other cysteines. See the table below for a list of all 20 amino acids and their charges. There are other properties that also influence a protein’s shape, such as the amino acid’s polarity. Note that these bonds are not as strong as what is created between amino acids when an amino acid chain is created, but these bonds are strong enough to hold the shape in the protein.

Amino Acid 3-Letter Abbrev. 1-Letter Abbrev. Charge Di-sulfide Bond Formation?
Alanine Ala A Neutral
Arginine Arg R (+)
Asparagine Asn N Neutral
Aspartate (Aspartic acid) Asp D (-)
Cysteine Cys C Neutral Yes
Glutamine Gln Q Neutral
Glutamate (Glutamic acid) Glu E (-)
Glycine Gly G Neutral
Histidine His H (+)
Isoleucine Ile I Neutral
Leucine Leu L Neutral
Lysine Lys K (+)
Methionine Met M Neutral
Phenylalanine Phe F Neutral
Proline Pro P Neutral
Serine Ser S Neutral
Threonine Thr T Neutral
Tryptophan Trp W Neutral
Tyrosine Tyr Y Neutral
Valine Val V Neutral

Use the chart above to determine which amino acids may bond together to form the tertiary structure.

Here is an example of a polypeptide model depicting how charges influence the tertiary structure. The first and second images are the same, except the second image has hotspots with additional information marked with a question mark (?). The key at the bottom of the image is necessary for interpreting the image.

To best understand your protein synthesis worksheet, let’s cover the complete protein synthesis process. It starts with transcription. Special enzymes in the nucleus arrive to gently pull apart the DNA code needed, and RNA begins to transcribe or rewrite the genetic material.

During translation, the mRNA connects with the ribosome and its information is decoded again so that the correct sequence of amino acids will connect to form a polypeptide. It’s important to note here that the ribosome doesn’t make protein nor does it make amino acids. It simply instructs already-made amino acids to form the correct sequence.

The amino acids’ sequence determines its protein’s shape, function, and properties and it can do so thanks to the RNA’s four bases (all of which are nucleotides): adenine (A), cytosine (C), guanine (G), and uracil (U). A codon, as we explained earlier, is a combination of three of these bases in a specific order: UUC, for example.

Some codons tell the ribosome to start or stop (UAA, UAG, and UGA indicate stop) and the rest indicate specific amino acids.


Sequence features related to translation initiation:

Shine–Dalgarno sequences:

To determine the effects of SD sequences, we need to be certain that the same mechanism of translation initiation also exists in D. vulgaris. To confirm this, we systematically calculated the average free energy of the potential base pairing of 16S rRNA 3′ tail and upstream of D. vulgaris genes with the free-energy-based method (O sada et al. 1999). A sharp drop of the free energy was observed in the region of −25– ∼ −5 in all genes in the D. vulgaris chromosome and megaplasmid (Figure 1). This trend is highly similar to what has been observed in E. coli (O sada et al. 1999), suggesting that Osada's method works very well in D. vulgaris as well. It also indicates that the same mechanism of translation initiation is employed in D. vulgaris as in most prokaryotes. Interestingly, the average free energy by position is lower in D. vulgaris than in E. coli (Figure 1), which could be due to the high GC composition of the D. vulgaris genome.

Sharp decreases in free energy around positions −25 and −5 (relative to the start codon) are observed in D. vulgaris (DVU) and the metaplasmid of D. vulgaris (DVUA) and in E. coli (ECOLI). The free energy of all genes on each chromosome was averaged by position.

Using the same free-energy-based method (O sada et al. 1999), we calculated the MFE of 16S rRNA–SD base pairing for all D. vulgaris genes. As one might expect, the length of the 16S sRNA 3′ tail and upstream sequences might affect the free energy value calculated. To test this effect, we chose three lengths (13, 20, and 23) of 16S rRNA 3′ tail and two lengths (25 and 50) of mRNA upstream sequences and calculated the MFE for various combinations. In addition to the free-energy-based method, we also employed a probabilistic method to quantify the strength of RBS (S uzek et al. 2001). The putative SD sequence was found with a RBS score higher than the threshold value (S uzek et al. 2001). The result showed that the contribution varied considerably among various methods of SD MFE calculations, with MFE20_50 (20 refers to the length of 16S rRNA 3′ tail and 50 refers to the length of the upstream sequence of genes) explaining the highest variations in mRNA–protein correlation (up to 1.9–3.8% among various data sets, P-value <0.001 Table 1 , supplemental Table 1 at This result also showed that the SD effect calculated by the probabilistic method does not seem to explain much of the variation in mRNA–protein correlation (mostly ∼1.0%, supplemental Table 1). In addition, only a weak correlation between RBS score and MFE20_50 was observed (r = −0.27, P < 0.0001), suggesting that these two methods may be different in evaluating SD sequences. In this study hereafter, MFE20_50 was used in all analyses involved.

Contribution of various features to the total variation of mRNA–protein correlation

A recent study of E. coli showed that the SD MFE values in highly expressed genes (defined as genes whose protein product can be detected on 2D gels) were lower than in other genes (L ithwick and M argalit 2003). We previously found that the D. vulgaris proteins detected by LC–MS/MS represented the highly expressed genes (supplemental Figure 1 at Z hang et al. 2006c). When the MFE20_50 value of the detected proteins was compared with that of all proteins in the D. vulgaris genome, a similar frequency distribution pattern was found (Figure 2), suggesting that overall there was no strong evidence for lower MFE in proteins identified in LC–MS/MS.

Frequency distribution of SD–rRNA interaction minimum free energy for all D. vulgaris proteins and proteins identified in three growth conditions. LL, lactate-log phase FL, formate-log phase LS, lactate-stationary phase.

We previously found that mRNA–protein correlation may be different among various functional categories (N ie et al. 2006a). Given the fact that MFE values may be related to mRNA expression and protein abundance, one immediate question is whether MFE values also varied among functional categories of a given gene/protein. The results showed that while most of the functional categories shared similar levels of MFE values, genes from several functional categories, such as amino acid biosynthesis, central intermediary metabolism, energy metabolism, protein fate, and protein synthesis, had lower MFE values (Figure 3, category I). This pattern appears to be consistent when we used corresponding genes of the proteins identified under the three growth conditions examined in this study (Figure 3, categories II, III, and IV).

Minimum free energy of SD–rRNA interaction of genes belonging to various functional categories: (I) all genes, (II) lactate-log condition, (III) formate-log condition, and (IV) lactate-stationary condition. The cellular functional categories are: A, amino acid biosynthesis B, biosynthesis of cofactors and carriers C, cell envelope D, cellular processes E, central intermediary metabolism F, DNA metabolism G, disrupted reading frame H, energy metabolism I, fatty acid and phospholipid metabolism J, hypothetical proteins K, other categories L, protein fate M, protein synthesis N, biosynthesis of purines and pyrimidines O, regulatory functions P, signal transduction Q, transcription R, transport and binding proteins and S, unknown function. The functional categories without any proteins detected are left blank in II, III, and IV. The central, top, and bottom horizontal lines in the plots represent the mean, plus and minus the standard deviation for each function category.

Start codons:

In the D. vulgaris genome, the start codon ATG is the most frequent start codon, consistent with the early conclusion that ATG is a preferred start codon, independent of the G + C content (R ocha et al. 1999). Approximately 82% of the D. vulgaris genes start with this codon, while the less frequently used TTG and GTG codons are found in 5.4 and 13% of the genes (Figure 4A). This strong bias in start codons was also observed in the proteins detected in various growth conditions (Figure 4A). This observation confirms that the canonical start codon ATG is more translationally optimal than noncanonical start codons. However, overall, the start codon identity explained only 0.1–0.7% of the total variation in mRNA–protein relationship under the three growth conditions as indicated by regression analyses (Table 1).

(A) Frequency distribution of start codon usage in all D. vulgaris proteins and proteins identified from three growth conditions. (B) Frequency distribution of minimum free energy of predicted mRNA secondary structure at start codon context of all D. vulgaris proteins and proteins identified from three growth conditions.

Start codon context:

Studies showed that stem–loop structures formed at the start site can affect the accessibility of the SD sequence or start codon for ribosomal binding ( de S mit and V an D uin 1990, 1994 R ocha et al. 1999). To determine the potential effects of these mRNA secondary structures on protein translation, we computed the minimum free energy for the 60-base sequences spanning the start codon with RNAfold (H ofacker 2003 H ofacker and S tadler 2006). We found that proteins detected in our study tend to have relative high MFE values compared with all proteins in the D. vulgaris genome (Figure 4B), suggesting that avoidance of mRNA secondary structure might be a strategy for genes to achieve a high translation rate. Overall, the start codon context explains 0.3–2.5% of the variation in mRNA–protein correlation under the three growth conditions (Table 1), which is larger than that by start codons alone.

Sequence features related to translation elongation:

Codon usage pattern:

The codon usage in the G + C-rich D. vulgaris genome has not been fully investigated before (H eidelberg et al. 2004). In this study, two approaches have been employed to investigate the unequal codon usage in D. vulgaris. The first approach is to compare the extent of codon usage deviated from the expected frequency between detected proteins and all proteins to determine which codon is associated with protein translation. Due to the high GC composition of the D. vulgaris genome, the observed unequal frequency of synonymous codons can be simply a result of biased base composition due to mutational bias (K night et al. 2001). Therefore, we used POD to measure how much the observed codon frequency deviated from the expected frequency determined on the basis of the base composition alone. Briefly, a positive POD value suggests an overrepresentation of the codon, whereas a negative POD indicates an underrepresentation. If the codon usage is associated with gene expression level, we would expect to see significantly different POD values between detected proteins and all proteins. Indeed, we found that except for Glu and Val codon families, the POD values of at least 1 codon within all the other 16 sense codon families (Met and Trp are excluded because they have only 1 codon) were significantly different (P-values <0.0001) between detected proteins and all proteins (Figure 5, A and B). In most cases, the absolute values of POD are greater in detected proteins than in all proteins, suggesting a stronger bias for these codons in detected proteins. This observation is consistent with previous findings in E. coli and Saccharomyces cerevisiae (I kemura 1981, 1982, 1985). Taken together, it is obvious that codon usage in D. vulgaris is strongly associated with protein expression.

Comparison of the preferences of synonymous codon usage between all D. vulgaris proteins and proteins identified from three growth conditions. (A) Twofold codon families. (B) Multifold codon families.

The second approach is a correspondence analysis of the RSCU. The first four major trends in codon usage (CR1 to -4) determined by correspondence analysis accounted for 17.1, 4.4, 3.9, and 3.7% of the total variations in RSCU of all D. vulgaris genes (W u et al. 2006). We included the first four major axes in codon usage of the D. vulgaris genome in a multiple-regression analysis. The result showed that codon usage alone contributed to 5.3–15.7% of the total variation in mRNA–protein correlation under the three conditions (P-value <0.001) (Table 1).

Amino acid usage:

It was observed that the usage of some amino acids was correlated with gene expression (A kashi 2003 C handa et al. 2005 S chaber et al. 2005). To determine whether it is the same case for amino acid usage in D. vulgaris, we applied a similar approach to the one used for codon usage, as described in the previous section. The only difference was that this time the frequencies of all synonymous codons were summed. Therefore, if the usage of an amino acid is associated with protein expression, we expect the POD value of this amino acid in detected proteins to be different from that in all proteins. As expected, the POD values of amino acids Asn, Asp, Glu, Ile, Tyr, and Lys in detected proteins are significantly higher than those in all proteins (P-values <0.0001 for all these amino acids) (Figure 6), suggesting that these amino acids are preferred in more abundant proteins.

Comparison of the preferences of amino acid usage between all D. vulgaris proteins and proteins identified from three growth conditions.

Given the fact that amino acid usage in D. vulgaris does associate with protein expression, we employed a multiple-regression analysis to determine its relative contribution to mRNA–protein correlation. A correspondence analysis of the deduced protein sequences of 3522 genes in D. vulgaris was performed to reveal the major trends. The first four trends (AA1 to -4) identified by correspondence analysis explain 18.6, 11.5, 9.7, and 7.0% of the variability in D. vulgaris amino acid usage, respectively (W u et al. 2006). When we included the first four major axes in amino acid usage of the D. vulgaris genome in a multiple-regression analysis, the result showed that amino acid usage alone contributed to 5.8–11.9% of the total variation in mRNA–protein correlation under the three growth conditions (P-value <0.001) (Table 1).

Sequence features related to translation termination:

Stop codons:

In D. vulgaris, three types of stop codons are all commonly used, with ∼45, 41, and 14% among all genes for TGA, TAG, and TAA, respectively (Figure 7A). However, among the proteins detected under our three growth conditions, TAG seems to be the most preferred stop codon: ∼56–59% of protein-encoding genes use TAG as a stop codon (Figure 7A). Regression analysis showed that stop codon identity was an important feature affecting mRNA–protein correlation. It alone explained 1.3–2.3% of total variation in mRNA–protein correlation under the three growth conditions (Table 1).

(A) Frequency distribution of stop codon usage in all D. vulgaris proteins and proteins identified from three growth conditions. (B) Frequency distribution of tetranucleotide termination signals in all D. vulgaris proteins and proteins identified from three growth conditions.

Stop codon context:

The base immediately downstream of the stop codon was also investigated for its role in translation termination. In D. vulgaris, “C” appears to be preferred after all three stop codons (Figure 7B). The most frequent stop signal is TAGC, which is even higher in proteins identified under the three conditions, suggesting that it might be the optimal stop signal. Simple regression analysis confirmed that stop codon context was an important feature affecting mRNA–protein correlation. It alone contributed 3.7–5.1% of total variation in mRNA–protein correlation under the three growth conditions (Table 1).

Multiple-regression analysis:

To quantitatively determine the relative contribution of each translation efficiency-related feature on mRNA–protein correlation, all sequence features studied above were integrated into a multiple-regression analysis. The results showed that these features together accounted for ∼15.2–26.2% of the total variation in mRNA–protein correlation (Table 1). The P-values of this regression model, for all three conditions, are all <0.00001, which indicates that contribution of these features to the mRNA–protein variation is significant. In addition, the result is very consistent under all three growth conditions.

To further evaluate the multiple-regression model itself, and to verify the sincerity of the contribution resulting from this multiple regression, we ran two bootstrap tests by keeping sequence features unchanged for all genes, while randomly permuting their proteomic abundance among the genes so that the proteomic abundance of a given gene is randomly assigned to a different gene. The bootstrap tests were run by randomly selecting 1000 permutations for each test. For each permutation, a multiple regression was fitted and was reported as we did for the real data. The bootstrap P-value is reported as the probability that the simulated is larger than the associated with the real data. A smaller P-value suggests that the obtained for the real model is statistically more significant. The two null models for the bootstrap tests were that (1) the contribution by mRNA levels and all sequence features is not larger than the mRNA level alone and (2) the contribution by mRNA levels and all sequence features is no larger than that by mRNA levels and initiation-related sequence features (excluding elongation- and termination-related sequence features), respectively. The results showed that the P-values from the first bootstrap analysis are 0 for the contributions computed under all three growth conditions, and the P-value for the second null hypothesis is <0.0001 for all three conditions. The results demonstrated that correlation of mRNA expression and protein abundance was affected at a fairly significant level by multiple sequence features related to translational efficiency in D. vulgaris. Among them, the amino acid usage and codon usage are the top two factors, followed by stop codon context and SD sequences (Table 1).

9.6 | The Genetic Code

The cellular process of transcription generates messenger RNA (mRNA), a mobile molecular copy of one or more genes with an alphabet of A, C, G, and uracil (U). Translation of the mRNA template converts nucleotide-based genetic information into a protein product. Protein sequences consist of 20 commonly occurring amino acids therefore, it can be said that the protein alphabet consists of 20 letters (Figure 9.21). Each amino acid is defined by a three-nucleotide sequence called the triplet codon. Different amino acids have different chemistries (such as acidic versus basic, or polar and nonpolar) and different structural constraints. Variation in amino acid sequence gives rise to enormous variation in protein structure and function.

Figure 9.21 Structures of the 20 amino acids found in proteins are shown. Each amino acid is composed of an amino group ( NH + 3 ), a carboxyl group (COO – ), and a side chain (blue). The side chain may be nonpolar, polar, or charged, as well as large or small. It is the variety of amino acid side chains that gives rise to the incredible variation of protein structure and function.

The Central Dogma: DNA Encodes RNA RNA Encodes Protein

The flow of genetic information in cells from DNA to mRNA to protein is described by the Central Dogma (Figure 9.22), which states that genes specify the sequence of mRNAs, which in turn specify the sequence of proteins. The decoding of one molecule to another is performed by specific proteins and RNAs. Because the information stored in DNA is so central to cellular function, it makes intuitive sense that the cell would make mRNA copies of this information for protein synthesis, while keeping the DNA itself intact and protected. The copying of DNA to RNA is relatively straightforward, with one nucleotide being added to the mRNA strand for every nucleotide read in the DNA strand. The translation to protein is a bit more complex because three mRNA nucleotides correspond to one amino acid in the polypeptide sequence. However, the translation to protein is still systematic and colinear, such that nucleotides 1 to 3 correspond to amino acid 1, nucleotides 4 to 6 correspond to amino acid 2, and so on.

Figure 9.22 Instructions on DNA are transcribed onto messenger RNA. Ribosomes are able to read the genetic information inscribed on a strand of messenger RNA and use this information to string amino acids together into a protein.

The Genetic Code Is Degenerate and Universal

Given the different numbers of “letters” in the mRNA and protein “alphabets,” scientists theorized that combinations of nucleotides corresponded to single amino acids. Nucleotide doublets would not be sufficient to specify every amino acid because there are only 16 possible two-nucleotide combinations(4 2 ).

In contrast, there are 64 possible nucleotide triplets (4 3 ), which is far more than the number of amino acids. Scientists theorized that amino acids were encoded by nucleotide triplets and that the genetic code was degenerate. In other words, a given amino acid could be encoded by more than one nucleotide triplet. This was later confirmed experimentally Francis Crick and Sydney Brenner used the chemical mutagen proflavin to insert one, two, or three nucleotides into the gene of a virus. When one or two nucleotides were inserted, protein synthesis was completely abolished. When three nucleotides were inserted, the protein was synthesized and functional. This demonstrated that three nucleotides specify each amino acid. These nucleotide triplets are called codons. The insertion of one or two nucleotides completely changed the triplet reading frame, thereby altering the message for every subsequent amino acid (Figure 9.24). Though insertion of three nucleotides caused an extra amino acid to be inserted during translation, the integrity of the rest of the protein was maintained.Scientists painstakingly solved the genetic code by translating synthetic mRNAs in vitro and sequencing the proteins they specified (Figure 9.23.

Figure 9.23 This figure shows the genetic code for translating each nucleotide triplet in mRNA into an amino acid or a termination signal in a nascent protein. (credit: modification of work by NIH)

The genetic code is universal. With a few exceptions, virtually all species use the same genetic code for protein synthesis. Conservation of codons means that a purified mRNA encoding the globin protein in horses could be transferred to a tulip cell, and the tulip would synthesize horse globin. That there is only one genetic code is powerful evidence that all of life on Earth shares a common origin, especially considering that there are about 10 84 possible combinations of 20 amino acids and 64 triplet codons.

Transcribe a gene and translate it to protein using complementary pairing and the genetic code at this site ( .

Figure 9.25 The deletion of two nucleotides shifts the reading frame of an mRNA and changes the entire protein message, creating a nonfunctional protein or terminating protein synthesis altogether.

Degeneracy is believed to be a cellular mechanism to reduce the negative impact of random mutations. Codons that specify the same amino acid typically only differ by one nucleotide. In addition, amino acids with chemically similar side chains are encoded by similar codons. This nuance of the genetic code ensures that a single-nucleotide substitution mutation might either specify the same amino acid but have no effect or specify a similar amino acid, preventing the protein from being rendered completely nonfunctional.

Name two reasons why it is impossible to determine a gene's nucleotide sequence from the amino acid sequence of the polypeptide - Biology

This page looks at how the information coded in messenger RNA is used to build protein chains. It is designed for 16 - 18 year old chemistry students. If you are a biochemistry or biology student, you will probably find it a useful introduction, but will have to look elsewhere to find all the detail you need.

Note: If you have come straight to this page from a search engine, you should be aware that this is the fifth page in a sequence of pages about DNA and RNA. It will all make more sense if you start from the beginning of the sequence with the structure of DNA.

From messenger RNA to a protein chain

A quick overview of the process

You will remember that messenger RNA contains a sequence of bases which, read three at a time, code for the amino acids used to make protein chains. Each of the sets of three bases is known as a codon. The table below repeats one from the previous page:

You may also remember that three codons serve as stop codons, and one (AUG) codes for methionine, but also serves as a start codon. We'll look in some detail at how this works further down the page.

Translating the code into an actual protein chain is complicated by the fact that individual amino acids won't interact with the messenger RNA chain. The amino acids have to be carried to the messenger RNA by another type of RNA known as transfer RNA - abbreviated to tRNA (as opposed to mRNA for messenger RNA).

All of this is controlled by a ribosome - a hugely complicated structure involving protein molecules and yet another form of RNA (ribosomal RNA or rRNA).

This is going to be quite complicated. We'll take it gently and simplify it where possible.

Finding the start point of the messenger RNA

You may find descriptions of this process which imply (although without actually saying so directly) that the messenger RNA starts with the codon AUG - the start codon - at the 5' end. That's not so!

There is a length of RNA upstream of the start codon which isn't actually used to build the protein chain. So how does the system know where to start? How does it find the right AUG codon from all the ones which are probably strung out along the RNA to code for the amino acid methionine?

Note: If you don't know what I mean by the 5' end or by upstream, it is probably because you haven't read these pages from the beginning. Taking short cuts is rarely a good idea!

Ribosomes come in two parts - a smaller bit and a larger bit. The smaller bit is involved in finding the start point.

It attaches to the 5' end of the messenger RNA and moves along it until it comes to a particular pattern of bases which it can bind to. This pattern occurs just before the first occurrence of the AUG codon in the messenger RNA strand.

The ribosome now has to build the protein chain starting with a methionine at the AUG codon it has just found.

Before we can talk about that we have to introduce transfer RNA . . .

Transfer RNA (tRNA) is responsible for carrying amino acids to the messenger RNA and then holding them there in a way that enables them to join together.

Transfer RNA is a short bit of RNA containing about 80 or so bases. These are mostly the same bases as in messenger RNA (A, U, G and C), but it also contains some modified bases which won't concern us at this level. A model of a typical transfer RNA looks like this:

Note: This diagram comes from Wikipedia. Most of the colour coding is irrelevant to this discussion - but note the little bit in grey at the bottom which is where the anti-codon is (see below).

This model is quite difficult to follow, and I am going to simplify it down to pick out the two important bits of it

At the 3' end of every transfer RNA molecule, the chain ends with the sequence of bases C C A. Remember that the bases in RNA and DNA are attached to a backbone of alternating phosphate and sugar groups. At the very end of the chain is the -OH group on the 3' carbon of a ribose ring.

The amino acid gets attached to this by forming an ester link between this -OH group and the -COOH group of the amino acid. This is no different from the formation of an ester between, say, ethanol and ethanoic acid - except that it is carried out under the influence of an enzyme rather than the more fierce conditions used in the lab.

Note: How the amino acid attaches to the tRNA is just for interest. It is unlikely to be needed for exam purposes at this level.

The other important bit is at the bottom of the molecule, shown in grey in the model. This is known as an anti-codon. As you will see shortly, the anti-codon attaches the transfer RNA with its amino acid to the right place on the messenger RNA molecule.

For chemistry purposes, all we are interested in is the attached amino acid, and the anti-codon, so we can simplify the whole thing down. Here is a very simplified diagram showing the transfer RNA for methionine with the methionine attached:

In the diagram, the anti-codon is for the amino acid methionine. The messenger RNA code for methionine is AUG. If you look at the code in the anti-codon for methionine, it is UAC. That is exactly complementary to AUG. The U in the anti-codon will pair with the A in the messenger RNA the A in the anti-codon pairs with the U in the mRNA and the C in the anti-codon pairs with the G in the mRNA.

How does a transfer RNA molecule pick up the right amino acid? This is all under control of enzymes which recognise the shapes of the various amino acid and tRNA molecules and make sure that they pair up properly.

Translation is the name given to the process of turning the coded message in the messenger RNA into the final protein chain.

We left the messenger RNA a little while back with part of a ribosome attached to it at the AUG start codon. The diagram shows this, together with a small part of the RNA base sequence downstream of the start codon needed to make an imaginary protein chain. The bases upstream of the start codon aren't relevant to us once the ribosome has found the place to start from.

None of these diagrams are drawn to scale!

Now two things happen. The transfer RNA carrying a methionine attaches itself to the AUG codon by pairing its anti-codon bases with the complementary bases on the messenger RNA. And the second, bigger part of the ribosome attaches to the system as well.

Now another transfer RNA molecule with its attached amino acid binds to the next codon along the chain. The next codon on the messenger RNA is GGU which codes for glycine (Gly). The anti-codon would therefore have to be CCA. Remember that A pairs with U, and G pairs with C.

Next, the ribosome moves along the messenger RNA chain to the next codon. At the same time a peptide bond is made between the two amino acids, and the first one (the methionine) breaks away from its transfer RNA.

That transfer RNA molecule leaves the ribosome and goes off to pick up another methionine.

Now the process repeats. The next codon is GUA which codes for valine (Val). The anti-codon must be CAU. (If you can't see this at once, stop and think about it. Don't go on until you are happy that you could work out the anti-codon for every codon, and vice versa.)

And again, the ribosome moves forward one codon, a new peptide bond is formed, and the transfer RNA on the left breaks away to be used again later.

And the next transfer RNA with its amino acid comes along . . .

Eventually, of course, this will come to an end. How?

Eventually, the ribosome will come to a stop codon. The three stop codons don't code for any amino acids, and so the process comes to a halt. The protein chain produced up to that point is then released from the ribosome, and then folds itself up into its secondary and tertiary structures.

The final page in this sequence looks very briefly at what happens when the code in DNA becomes changed in some way . . .

Questions to test your understanding

If this is the first set of questions you have done, please read the introductory page before you start. You will need to use the BACK BUTTON on your browser to come back here afterwards.

Genetic code

The genetic code is the set of rules by which information encoded in genetic material (DNA or RNA sequences) is translated into proteins (amino acid sequences) by living cells.

Specifically, the code defines a mapping between tri-nucleotide sequences called codons and amino acids every triplet of nucleotides in a nucleic acid sequence specifies a single amino acid.

Because the vast majority of genes are encoded with exactly the same code, this particular code is often referred to as the canonical or standard genetic code, or simply the genetic code, though in fact there are many variant codes thus, the canonical genetic code is not universal.

For example, in humans, protein synthesis in mitochondria relies on a genetic code that varies from the canonical code.

The genome of an organism is inscribed in DNA, or in some viruses RNA.

The portion of the genome that codes for a protein or an RNA is referred to as a gene.

Those genes that code for proteins are composed of tri-nucleotide units called codons, each coding for a single amino acid.

Each nucleotide sub-unit consists of a phosphate, deoxyribose sugar and one of the 4 nitrogenous nucleotide bases.

The purine bases adenine (A) and guanine (G) are larger and consist of two aromatic rings.

The pyrimidine bases cytosine (C) and thymine (T) are smaller and consist of only one aromatic ring.

In the double-helix configuration, two strands of DNA are joined to each other by hydrogen bonds in an arrangement known as base pairing.

These bonds almost always form between an adenine base on one strand and a thymine on the other strand and between a cytosine base on one strand and a guanine base on the other.

This means that the number of A and T residues will be the same in a given double helix as will the number of G and C residues.

In RNA, thymine (T) is replaced by uracil (U), and the deoxyribose is substituted by ribose.