Richard Gayle
UTR Nonsense June 16, 2000
A lot of the biotech community's focus is on the complete sequencing of the human genome. Genomics deals with, not too surprisingly, genes. We are all searching the databases hoping to find novel genes that will be important therapeutics. And rightly so. But we must remember that genes mean nothing by themselves, at least for most living organisms. Their DNA is not useful unless it is transcribed into mRNA. The mRNA is generally useless unless it is translated into protein. Understanding the effects of the 5' and 3' non-coding regions on gene expression will become even more important.
Almost every mRNA made in a eukaryotic cell has some amount of 3' untranslated region (UTR). This is a portion that does not code for protein, but contains control regions dealing with such things as mRNA stability. We all know about poly-A addition signals but there are a large group of other controls that we know of today. I'm sure we will find more. There are ones that act in trans and others that act in cis. They are involved in a wide range of processes. But I want to talk about some human diseases that are a direct result of alterations in the 3' UTR.
Myotonic dystrophy (DM) occurs in about 1 in 8500 adults, resulting in such things as muscle weakness, endocrine dysfunction or respiratory distress. Its cause had been mapped to a gene that appears to code for a cAMP-dependant protein kinase called DMPK. But the mutation is not in the coding region. It is in the 3' UTR. In the normal population, there are 5-30 repeats of the CTG trinucleotide. But in people with DM, there can be over 1000 repeats of this trinucleotide. No one knows what is really going on. The normal physiological role of DMPK is still uncertain and the disease itself is quite complex. But there are some interesting aspects.
There is a protein that binds the CUG repeats in the mRNA, called CUG-binding protein (CUG-BP). CUG-BP is homologous to a large family of proteins called ELAV for embyonic lethal abnormal vision, which also bind mRNA. CUG-BP appears to be involved in regulating the proper splicing of DMPK and its transport out of the nucleus. DMPK also modulates CUG-BP activity in a feedback loop by phosphorylating the CUG-BP protein.
CUG-BP may also be involved in the processing of several other important proteins. Now what happens if there is a 20-50 fold increase in the number of binding sites of a protein in the 3' UTR of DMPK? It might just sequester away a lot of the cell's reservoir of CUG-BP. It would all bind to one mRNA and not be available for proper processing and transport of other mRNAs.
Now it is not that simple. Because there appears to be another gene downstream of DMPK whose expression may be altered by the CTG sequences. There may be other transcriptional effects. But something like this might explain why this sort of mutation is dominant. Even if you have a wild-type allele, the mutant allele would siphon off enough CUB-BP to have an effect.
Another aspect of the 3' UTR deals with mRNA stability. a-globin mRNA is extraordinarily stable, accumulating to more than 95% of the total mRNA in erythrocyte precursors. However, a single base change in the termination codon from TAA to CAA results in a tremendous reduction in the half-life of the mRNA. Turns out that there are C-rich regions in the 3' UTR that are bound by a ribonucleoprotein complex called the a complex. This complex appears to also bind the poly-A tail. Shortening of the poly-A tail, and subsequent loss in stability, only occurs when some of the protein components of the a complex are removed. Apparently, the presence of the ribosomes in the mutant a-globin mask the binding sites for this RNP, resulting in faster turnover of the mRNA.
The 3' UTR appear to be a region ripe for exploitation by the cell. There are not the evolutionary constraints put on them that the 5' ends have. Without a promoter, the gene would not work, right? Without the right control regions, it would not be properly expressed. But this leads to a further question. If this is true, how do new genes with new expression patterns ever arise? So you have a pair of duplicated genes. How does one evolve into a completely novel gene?
Again, some hints to possible processes are being seen. In Drosophila, there is a gene on the X-chromosome called Sdic (sperm-specific dynein intermediate chain) that is located between 2 other genes (Cdic -cytoplasmic-specific dynein intermdiate chain and AnnX - cell-adhesion protein annexin X). Sequencing of this region revealed something extraordinary.
The coding region of Sdic is made up of the last 5 exons of Cdic fused to an intron of AnnX (see figure). The proposed model starts with the Cdic gene followed by the AnnX gene (C--A). A tandem duplication (C--A--C--A) occurs followed by a deletion (C--A/C--A). This leaves the middle 4 exons of AnnX fused to the last 5 exons of Cdic to form Sdic. But that's not all. The first exon of Sdic is very unusual because it is derived from an intron. All it took was a handful of nucleotide changes to convert this intron into an exon. But,wait, there's more, as the infomercial would say.
Because a gene is not any good unless it has a promoter. Well, guess what, a promoter was formed by the juxtaposition of these AnnX and Cdic sequences, along with a fortuitous regulatory element that is testis specific. So, by the duplication and deletion of 2 genes, a promoter and its regulatory sequences were created that resulted in specific expression of this protein in testis. Wow, pretty jury-rigged but the DNA sequence similarities do not lie.
However, what is important about this gene is what is says about the formation of new genes. Because the promoter and control elements did not "evolve". They were already present in some form, just not in a position to be used. The plasticity of the genome allowed them to be juxtaposed to create a new gene that must be incredibly useful. Because the Sdic gene has been repeated almost 10 times.It is very highly expressed. And it spread throughout the Drosophila population very rapidly, in a mechanism called a selective sweep. Essentially, all the flies that lacked this gene were selected against.
This gene is only found in Drosophila melanogaster, no other closely aligned species. And it is found in every D. melanogaster species studied. But the degree of polymorphism in the gene is very small, indicating a recent origin. For a gene to have spread throughout a population as rapidly as Sdic did, it must exert tremendous positive selective pressure. The fact that it is specific to the testis invites the proposal that it might have been involved in speciation, affecting fertility in the populations bearing it.
So, we have 3' regions that bind proteins that are responsible for the expression of the coding region. We have genomic regions slamming into one another and creating fortuitous promoters and regulatory elements. One of the major consequences of the sequencing projects seems to be the demonstration of just how chaotic the genome is. How important serendipity is. For anyone who thinks that sequencing the genome will be the pinnacle of biological research...Well, check out the title of this column.