Pacific Symposium on Biocomputing
January 5 Report

Today has the session that I hope will have some of the most interesting presentations, at least for someone like me. It is on DNA Structure, Protein/DNA Interactions and DNA/Protein Expression. I work daily with DNA and proteins. While I understand (kind of) and enjoy (mostly) informatics, I live for DNA and proteins.

This is the closest I got to the water during the day. Back to the show.

I did my post-doc studying protein DNA interactions, investigating the amino acid residues in the l phage protein, cro, and its the operator DNA. There was a hope that we could decipher the key that determined which amino acids interacted with which nucleotides. Reality made this much more complex than we thought.

The first talk made my day. It was by Craig Benham at Mount Sinai. He discussed the effect of torsional stress on DNA structure and on transcription. There are 2 main ways to alter gene expression: (1) binding of a protein to the DNA; and, (2) torsional stress imposed on the DNA. The former has a slow response time, since it depends on changing protein concentrations, and is spatially determinant. That is, the protein must find the right spot on the DNA. It can, however, create highly specific patterns of gene expression. The other mechanism for gene control has a fast response time. Locally unwinding a supercoiled double helix rapidly transmits the 'signal' throughout a great distance. So the action at one location can have effects far away from it. What it loses in specificity it makes up for in speed.

Benham's work examined Stress-Induced DNA Duplex Destabilization (SIDD). You have to have an acronym. In the action of E. coli RNA polymerase, the DNA double helix must be opened up and one strand moved out of the way, so that the other strand can be transcribed. The E. coli chromosome, and most DNA, is supercoiled (i.e. It has added twists to it. Take a rubber band between your two index fingers. Twist one finger around. At some point, the rubber band will start developing extra kinks in it. This is supercoiling.) So, underwinding the DNA can relieve some of the supercoiling and can separate strands.

The energy required to do this is sequence dependent so destabilization depends on the DNA sequence. In addition, once one part is separated, it makes it easier to separate surrounding base pairs. Now, he presented a lot of math next (Everyone here has to have one slide showing extensive mathematical formulas with self-defined greek letters representing different values. I always ignore this slide.). But what it all reduces down to is that the only thing you need to know to calculate the probability of a base pair being open, or even the free energy of unwinding, is the nucleotide sequence. Every other possible parameter is already known.

To determine its veracity, he ran his algorithm on pBR322. It found 2 regions with high probability of strand separation. These corresponded exactly with the start and stop of transcription of the b-lactamase gene. In fact, he was able to confirm this experimentally by cleaving pBR322 with a single-stranded endonuclease (it cleaves single stranded regions of double stranded DNA). It cleaved at the exact spots expected and with the same relative frequency as the program predicted.

So, you have the potential for finding some promoter sites by simply running the sequence through the program. He then examined the region around the promoter of the ilv gene. Ninety base pairs upstream is the binding site for a protein, integration host factor, (IHF.) Transcription is only turned on if the DNA is torsionally stressed with about 3% underwinding. This action is not IHF specific, since replacing its binding site with other sites that bind protein also activate the promoter.

So, guess what happens when Benham's program is run on the sequences around the ilv promoter? The most destabilized region is exactly at the IHP binding site. Using procedures that cleave single stranded DNA, it was shown that, in the absence of IHF, there was lots of cleavage at the IHF binding site. Increasing the torsional stress increased the amount of cleavage. So the DNA is open in this region, just as the program predicted. Now, add IHF and the open DNA disappears! There is no sign of any cleavage of single stranded DNA. IHF appears to bring the DNA back into double stranded, B-form.

But the torsional stress has to go some where. Think of a balloon. If you squeeze one end, the other end will bulge. Same thing happens here. If the normally open area is closed, the torsional stress is transferred down the double helix So, if you attempt to cleave the single strands in the presence of IHF, you find a new site of cleavage – 80 base pairs away from the IHF site, exactly at the -10 of the promoter. So, binding IHF causes the promoter to be opened up, allowing RNA polymerase to bind and start transcription. Pretty nifty. Binding of a protein at one region of the DNA stabilizes that region but destabilizes another at a distant location.

This approach can be applied to larger regions. They looked at the entire genome for E. coli. It took them 10 days on an R1000 chip. Not too optimized. They found 125 open reading frames that had similar properties to the ilv promoter region. They are in the process of examining these sites to determine whether they do indeed act like the ilv promoter.

SIDD is seen is several other processes, such as nuclear matrix attachment sites, retroviral integration sites, translocation. He looked at a region upstream from the c-myc genes. He showed that as the torsional stress increased, they will get one region that is opened up. As the stress increases, another, adjacent sequence begins to unwind. Both of these sites have been shown to have regulatory proteins bound to them, depending on the cellular environment.

During the question and answer period, he showed an overhead that I found fascinating. They have started examining the yeast genome. Upstream from the CYC1 gene is a very strong SIDD site. There is nothing else with any reasonable free energy anywhere along the DNA for quite some distance. Now, if they remove some of these residues, not only does the main SIDD site disappear, but a whole group of other sites ranging on all sides of the gene now have significant free energies of destabilization. So, mutation at a SIDD site may have far-reaching effects all along the DNA, with huge ramifications for the appropriate regulation of genes.

There was further discussion in the afternoon roundtable. Benham talked quite a bit more about this system. He was asked how universal the stress-induced transcription system was. In E. coli, this is pretty easy to answer. The torsional stresses on the DNA can be strongly controlled by a variety of enzymes. In E. coli DNA gyrase increases stress by underwinding DNA. Topoisomerase I relaxes stress. The rate constants of the two enzymes are very sensitive to the environment. If the oxygen tension is reduced, DNA gyrase activity increases and topoisomerase I activity decreases. The consequence is that in less than 1 minute the entire chromosome is more stressed, turning on a wide range of genes responsible for anaerobic metabolism and turning off unnecessary ones. So stress-induced regulation of transcription can respond in seconds, much faster than any other type of gene regulation

In collaboration with David Clark, they looked at creating a minimal transcription module for the yeast gene, CUP1. This is a very ancient gene that acts to protect the organism from copper. It normally has TATA-binding sites, operators, etc. But, it is possible to remove all of these and control expression simply by modulating the torsional stress. Unwind the DNA and off the RNA polymerase goes. Examining this region of the chromosome found a very strong SIDD site. So, perhaps the primordial RNA polymerase just looked for single stranded regions of DNA and off it went.

If true, then everything else is simply there as refinements either to prevent inappropriate binding of RNA polymerase to transiently unwound regions or increase the binding to the appropriate region, and direct the correct positioning of the molecule. Maybe torsional stress is part of every transcription event, either in a positive or negative fashion.

Another person asked if torsional stress could be important in promoters with poor or absent TATA boxes. This is an open question, as is how well this approach could be used to actually predict promoters. Access to eukaryotic promoters by RNA polymerase requires the disassociation of histones, which by itself can result in large amounts of torsional stress being propagated down the chromosome. The program as it is now set up is not built to identify or utilize specific motifs, but they are working hard to find out how well they correlate.

There was a nice talk from Xiaole Liu from Stanford. They are working on algorithms to properly align motifs upstream from the start site of transcription. This program, called BioProspector, has previously been found to work well. It essentially starts with random alignments and slides the sequences over one another to increase a scoring function. She described some recent improvements they have made.

The location of one particular base often influences the probability of the identity of surrounding bases, over and above any motifs present. So, they need to include a background probability by using Markov dependencies (Her big math slide). This helps their discrimination a lot.

They originally required 100% conservation of the bases in a particular motif. As we all know, 100% conservation is hardly ever seen. So they have worked out a method to include thresholds for scoring a motif. Finally, they can now look for 2 block motifs and palindromes separated by varying gaps.

But they still need the DNA sequences. They will be using genes that appear to be co-expressed in DNA arrays. The expectation would be that these genes have some sort of DNA sequence that is responsible for their co-expression. Time will tell if they find anything.

Following the afternoon session, we had poster sessions. And great poster sessions they were. Potato chips, soft pretzels, sandwiches, beer and wine. Brought back nice memories of the first meeting I ever attended as a graduate student. Then, I gorged myself on the free food and beverages. Unfortunately, my body can no longer do this. So I actually looked at most of the posters. Maybe I'll talk about them later.

The first evening session dealt with the most recent Critical Assessment of Protein Structure (CASP4). Here is proof that Randy was eager to hear about this. He wanted to get a really good seat.

There are 200 new genes sequenced every day but only 2 new protein structures. The hope is that modeling will overcome this 2 order of magnitude difference. So about every 2 years, there is a competition to predict, using computational methods, the 3-dimensional structure of a group of proteins whose structure is close to being determined (i.e. by NMR or X-ray crystallography).

The first CASP in 1994 had 100 entries. This one had over 50 times as many. A big problem now is evaluating the predictions. Quite a bit was presented about the proper approach for doing this. While human evaluations worked in the old days, there is no way humans can evaluate the huge number seen today.

In addition, it is becoming quite difficult to determine the exact algorithm for properly align the sequences. Some predictions may be perfectly accurate over a short region while others are somewhat off but over a larger area. Which is better? There was a lot presented discussing this and I am sure Randy will be happy to explain them to you. I'm a biologist. The numbers go into a black box and I want the answer. I'm not interested in what happens in the black box. Thank goodness some people are or we would probably never see any advancement in structure prediction.

The best comment of the session was that the purpose of the prediction committee is not to delineate what the predictions did right to get the correct structure. They are trying to understand why the prediction methods make mistakes, so that they can fix them. So the nitpicking over the best alignment algorithm and best approach for assessing the fit is not as important as finding the areas the predictions got wrong.

The results will appear in Proteins in July 2001. Forty-three targets were used in this competition. Eight of these had totally new folds and were the most difficult. Eleven were structures with homolgous members already identified, making their prediction the easiest.

They did pretty good with the easiest ones using comparative modeling. Much worse on the most difficult ones with new folds. But there were several that made very good fits. Mostly the mistakes were in the loops. Most interesting were the ab initio approaches. These create a model using just basic principles. There is no looking at structural databases. There is no threading. This is actually trying to solve the protein folding problems purely computationally.

Some of these were amazingly close. Often the core of the strusture was predicted pretty well in the best approaches. Sometimes it was just a problem with connectivity or a loop would be projected in the wrong direction. But this is a lot better than in CASP1 where ab initio methods were not even close on pretty small proteins.

But no one method actually worked well all the time. It is still impossible to know beforehand which prediction will be correct. What this is doing is helping each approach to become better. As one participant said,"No method works perfectly, YET!!" The hope is that as they get better ideas of what went wrong, someone will eventually get a methodology correct. At the moment, although huge strides have been made, nothing looks like it is breaking away from the pack now.

Well, the session with the most interesting title is coming up. Bioethics, Fiction Science and the Future of Mankind. Hope it is as fun as the title implies.

It wasn't.

The day started at 8:15 am and I left the last session at 10:30 pm before it was over. A long but fun day.

_____________________________
Presenters - 28. Presentation Methods - Windows with Powerpoint - 15 Overheads - 10. Macs with Powerpoint - 2. Started up in Linux then rebooted into Win2000 with Powerpoint - 1.