Pacific Symposium on Biocomputing
January 4 Report

We had the official start to the conference today. They had planned on 225 people but there are over 400 here. Most seem to be from bioinformatic companies. It is a lot harder to find biologists here this year in the mass of people with computer science degrees.

The morning sessions focused on novel computing solutions for working with the huge amount of data being generated. Instead of using hardware specifically designed to do alignments, most of these presentations dealt with using low cost processors in parallel to accomplish these tasks. Two of the talks dealt with SIMD. One by Leslie Grate and the other by Eric Rice, both at UC Santa Cruz.

I love computer jargon. Trying figure out what SIMD really means beforehand is totally impossible. So, here it is. Single Instruction Multiple Data. That helps a lot, doesn't it? If I understand it properly, it involves providing the exact same instructions to an array of processors. Each processor is then given different data to work on.

KESTREL is not only a raptor, it is a small hardware card designed to fit into the PCI slot of most desktop computers. It has 512 very simple processors on it that are connected together. Each reads data from the left and writes data to the right. This is a very simple arrangement and cannot do very complicated work. It does not deal with conditional instructions well (i.e. if-then) because not all processors will be working at the same time.

But it is really good at some things, like doing sequence alignments. What they do is put the query sequence in the processor array - one amino acid per processor. Now, the Smith-Waterman algorithm is something I do not fully understand. They had a great slide of complex math but what I remember was this. The scoring function is dependent on two amino acids, i and j, and adjacent ones i-1 and j-1. Well, i-1 is in the neighboring processor and j-1 is from a previous step that is stored in memory. (Okay, Bob Dubose will probably correct me but this is what I got from the slide.)

Anyway, what I was struck by was that it took only 19 instructions to code for a Smith-Waterman alignment. This is an incredibly small number. It allows very fast alignments to be performed. Get this. They use 20 mHz processors (the technology is really simple and almost 5 years old). They could search a database of 10 million bases using query sequences less than 512 bases in 13 seconds. This is about 20 times faster than using a single, superfast processor. The cost of this card is about $8000 but could be about $1000 if made in large quantities. And they are using 4 year old technology. They plan on upgrading to 1000 processors, a tenfold faster clock (200mHz vs. 20 mHz), and better I/O.

There was also a nice presentation on a method for aligning the three-dimensional structure of proteins by Babu Guda from UC San Diego. Protein structure can provide more concise information than linear alignments. Often motifs can only be seen by examining multiple alignments. There are over 14,000 of these sequences in the PDB that can be used.

The problem with most approaches is that they are usually master-slave. That is, one sequence is the top protein and everything else is aligned to it. This can present biases in the alignment. Well, the way this approach worked is to compute a distance score based on the position of the backbone carbons of the proteins. Then it will make a small, random change (like an insertion) and see if the score improves. If it does, the change is made permanent. If not, it does some computations to determine whether to keep the change. It works best with very large families but comes up with very good alignments.

The presentation by Allison Waugh from Stanford looked at distributed computing. This uses a group of computers connected by a network. Each computer can act separately to work on parts of a problem. In this case, they used a program that identified calcium binding sites in proteins. They examined all the 14,000 proteins in the PDB in a little over 10 hours. There were 55 proteins that had a high score but had nothing in the annotations regarding cation binding sites. Several of those look particularly interesting and are being investigated for cation binding.

We then had the Keynote Lecture by David Haussler. he discussed the current status of the public draft of the human genome. About 1/3 of the genome is finished, meaning that the sequence has been covered 12-fold or more. The rest is in draft form, meaning the coverage is about 4-fold. One interesting fact was that when the draft sequence was released in July, over half a trillion bytes were transferred that day over the Internet without a crash. Maybe Amazon.com should talk with their server group.

He had some caveats about the draft sequence. About 5% is duplicated due to misassembly. So be careful about finding duplications. It may not really be duplicated. But there are some interesting things that can seen even in the draft sequence. GC rich areas are very gene rich, although there are genes even in AT rich areas. Genes found in AT rich regions tend to have very large introns. The record intron so far is over 500,000 bases long. One gene has 178 exons.

We then had lunch in honor of Dr. Haussler. I have run into the other Immunoids here. Eban Calhoun and Randy Ketchem ate lunch with me. Carl March is here also. It was a nice outdoor meal and got us really charged up for the afternoon session. Unfortunately, as most of you know, sessions after lunch are always tougher. I just wanted a nap.

The afternoon sessions were about methods to examine protein evolution. By looking at the substitution rates at various sites in a protein, you can get an idea of the selective pressures evolution puts on a protein. This approach is just getting going in a big way because we are getting much more complete genomic sequences of a lot of different organisms. One group led by Rich Goldstein at the University of Michagan found that fast changing sites were at hydrophobic regions inside a protein or hydrophilic regions on the outside. Slow changing sites were at hydrophobic regions on the outside or hydrophilic on the inside. Makes sense. If a hydrophobic group is on the outside it is probably there for an important reason and Nature wants it there.

The final talk I will mention was one of the most entertaining. It was given by Gavin Naylor of Iowa State University. Many models of nucleotide change recognize that there are different rates of transitions and transversions, as well as rates for the creation of gaps and deletions. Using these rates, along with the actual data (i.e. aligned sequences), phylogenetic trees have been constructed that attempt to display the genetic relationships of a group of organisms. A problem could arise if the models and rates chosen are wrong for a particular protein. Just as bones can develop into cervical vertebrae or into phalanges, perhaps different proteins have different 'evolvabilites'.

So, what to do? Well, we have phylogenetic trees developed for vertebrates by independent means. At least independent from genetic analysis (e.g. fossils). How about we use this tree along with the sequence alignments, and try to determine the model of nucleotide change? That is, let's work backwards and determine what rates and constraints of nucleotide change can produce the sequences that are related by the known phylogenetic tree.

Well, following the analysis of a group of mitochondrial genes, they found that the wobble codon could change a lot, particularly at codons that were 4-fold or 6-fold redundant. No surprise here. These are neutral changes, producing no changes in the protein sequence. Codon position and degeneracy were the biggest influences. Hydrophobicity was somewhat important but surprisingly secondary structure did not seem to be important. That is, maintenance of an alpha helix was really not that important in determining the rate of mutation. This is pretty unsettling to a structure/function guy like me. I mean, how important can site-directed mutagenesis be if structure is not really important? Luckily, there were enough caveats to give me breathing room. But a very interesting talk.

Well, enough for today. I did not see any wine at the evening sessions but they were pretty short tonight. Maybe tomorrow.

_____________________________
Presenters - 18. Presentation Methods - Windows with Powerpoint - 11. Overheads - 5. Macs with Powerpoint - 2. Sony Vaio is a very popular model. Everyone from Stanford had one. I also discovered why some people bring laptops to sessions. I saw several playing Solitaire.