This day looked to be a mixed bag. The first morning session was on Natural Language Processing for Biology. They have had something like this almost every year. It always deals with trying to use computational methods to extract information from scientific papers. Automating this would greatly simplify a lot of data entry.
So I always am hopeful that talks on Information Extraction (IE) will produce something useful. Wouldn't it be nice to run a program that could accurately distill an entire paper for you? Abstracts are supposed to do that but how many times have you found out something really important in the body of a paper that was never mentioned in the abstract? And a perfect extraction program would write the perfect abstract for your paper. Anything that makes it easier to write a paper is fine with me.
Unfortunately, IE is much like solving the protein folding problem. Everyone has their own approach that works okay in their trial environment but does not come close to working in the real world. For example, the talks today only tried to extract information from abstracts. They would best be used for Medline search engines. Good for librarians not for scientists. And they did not even work too well on these.
I guess it would help if we wrote in a simpler fashion. We just love to create complex sentences, with lots of clauses, written in the passive voice, filled with indefinite articles. It makes it very hard for the computer to find the direct object, the correct verb, etc. I mean, sometimes there are sentences we have trouble parsing (e.g. Expression of X inhibits the suppression of Y while activating the inhibition of Z for AA and suppressing the activation of C.).
So part of the problem comes from poorly written sentences. This is pretty common since for some authors English is not their first language. Also, few of us have ever been taught how to write properly. In fact, my alma mater, CalTech, has recently added a requirement for graduation that involves writing a popular style article, suitable for publication in something like Scientific American. The students take a class on scientific writing and have to produce something. Some have even been of high enough quality to be published. So, perhaps the next generation of scientists will write gooder ...er, better, and make it easier for programs to properly extract information.
The next session was another one I had held out hope for. It was on using computational methods to construct genetic networks. With all the microarray data coming, it makes sense that computers would be used to try and figure out how everything fits together. Some models work well for very specific systems. There are basic problems with some of these approaches though. Microarray data can only examine populations of cells and average all the values. However, as I have discussed in my column, recent work indicates that many times gene expression is stochastic. So, different cells are expressing different genes, but we only see the average. There was a nice example demonstrating how the expression patterns for two genes whose expression is not directly linked could give you the impression that they were induced at the same time.
So a real caveat with these approaches is that they can infer a pathway but they can not predict one. That is, the programs can take the data and come up with an answer but it may not be correlated with reality. There was one talk that tried to compare the different approches for constructing genetic networks and found that none of them are really all that great at the moment.
But there are some things that can be done. One report, presented by Lingchong You from the University of Wisconsin at Madison, discussed modeling the life cycle of the T7 bacteriophage. This works because we know what every protein in the system is , when it is expressed, and what it controls. So it is simply a matter of putting the parameters into a program and seeing if it matches reality. Actually, it is not that simple because I could not do it but someone else has.
But this presentation displayed the major problem of this year's meeting. So many more people submitted papers than in years past, they now have more speakers. So they are only permitted to speak for 15 minutes with 5 minutes for questions. This does not allow enough time for adequate background, time for adequate perspective or time for adequate results. Often any real biology is left out or shortened because it comes at the end of the talk and people are rushed for time.
You's presentation included a lot of slides that were just thrown up there without any real time to take them in. It appeared that they came pretty close to matching real world data but there looked to be some differences. You (the presenter, not you, the reader) really had no time to explain why. It looked like an interesting approach. But if we can not correctly model a simple genetic network for which we know every step, what hope do we have for correctly constructing a model of a more complex one?
Pedro Romero from SRI had a similar problem with his presentation. They appear to have the complete intermediary metabolism of E. coli in the computer. I am sure all of us remember seeing that huge chart of all the metabolic pathways hanging on a lab wall. Well, Romero seemed to be working with them on the computer. Again, the slides went by too fast but it appears that they had all the reactions coded into a model. They also knew what metabolites were manufactured by the bacterium and which were essential ingredients. Or so they thought. The program was run using the ingredients of a media, M61, as the input. They wanted to see if the simulated bacterium could metabolize and manufacture everything else.
Well, he had to go through it so fast it is hard to say. For instance, he showed one slide that said a necessary metabolite, UDP-galactose, could only be created from galactose and could never be made from glucose. Yet, glucose was the only sugar in the medium. So how did the simulated cell grow? I had to go and look in the paper in the hard-bound proceedings to find out. But if I need to read the proceedings, why give the talk? The presentation should help explain the paper not the other way around.
So, overall the genetic networking talks were useful. But what I found interesting is that none of them include any biochemistry. That is, the intermediary metabolism one only looked at reaction pathways but did not include any equilibrium or kinetic constants. Same with the T7 bacteriophage simulation. Nothing about kcat or Vmax or anything. Same with the T7 group. In fact, all of the gentic networking simply looked at connections, yes or no, and whether there was any inhibition or activation. There was nothing about binding constants or enzymatic reactions or metabolite transport.
Now maybe they can ignore these in their computations. Maybe everything just divides out and you don't really need to know these things but as a biochemist, I sure hope not. Otherwise, what have we been doing the last century or so? I'm still hoping that these will be necessary and that we can spend quite a few more years working with Michaelis-Minton kinetics and doing binding curves to find the KA of a receptor.
The meetings are over tomorrow. with two potentially interesting sessions: Human Genome Variation and Phylogenetics in the Post-Genomic Era. I have been told that I am a weeny for going to all the sessions. Maybe so but I am going to go out tomorrow afternoon and look for whales. Wish me luck.
_____________________________
I am no longer going to keep score. I asked and found out that the organizers advised everyone to do their presentations in Powerpoint, in order to facilitate the proceedings. Since they provided a Compaq computer at the front, everyone either used their own laptop or just moved their file over via a floppy. So it is not really possible to determine who created their presentation on what platform. I have to say that several of the Linux gurus were more hacked than any Mac fan.