Why bioinformatics !== compstructbio, or why Rosetta is not a glorified SVM

I got a comment on this blog that's been eating away at me for the last few months. It was from a post about developments in compstructbio over the last decade, where I had placed the program Rosetta as the biggest development in the last decade. The offending comment was from a student who said,

"Interestingly, when my bioinformatics lecturer talked about rosetta, I was expecting to say that it was amazing after reading your post. He put its success down to the ridiculous amount of computing power it has and said that we have no idea if the algorithm is any good."

This is almost as wrong as a pork bun vendor hawking steaming hot pork buns in a synagogue on a sabbath. Indeed, it's several pork buns worth of wrong. Now normally I'd let these kind of buns slide into the non-kosher basket, but I've talked to enough bioinformaticians out there to realize that it's a pretty widespread attitude. So here, I hope to set the record straight.

So what does our Bioinformaticsexpert claim? Essentially this:

  1. We have no idea if Rosetta is any good
  2. Rosetta relies on a ridiculous amount of computing power
  3. Rosetta's success is not surprising

The short answer is that

  1. we have an extremely clear idea of how good Rosetta is,
  2. Rosetta owes its success to a rather elegant package of ideas, and in an absolute sense, it has actually quite modest computing requirements and
  3. Rosetta's success was a big shot in the arm for the structural biology community.

Now the long answer, Mr Bioinformaticsexpert. It's clear Mr. Bioinformaticsexpert knows little about the history of the protein folding problem. Because the biggest thing to happen to protein folding is the CASP competition, which was first organized a decade ago. The goal of CASP is to define an accuracy test for any protein-folding algorithm. In CASP, predictors are given a sequence to predict a structure, and only after are the predictions submitted, will the actual structure be deposited in the PDB. Indeed, when CASP was first announced, it was a kick in the face to a whole bunch of protein-folding theorists who had claimed to have "solved" the "protein" "folding" problem "in principle".

The measurement of the entries is a very serious business indeed. Indeed, designing a measure of accuracy is considered a very important job, and some of the biggest names in structbio (Jane Richardson, Manfred Sippl) have served as judges in CASP. Still, the entries in the first year were so bad, that the judges awarded mercy points to entries that were even vaguely globular. So yes, Mr. Bioinformaticsexpert, we have very clear criteria about the accuracy (and difficulty) of protein-folding.

Since CASP3, David Baker's predictions, based on Rosetta (and I say based on because there's a certain amount of manual tweaking), have consistently placed first in CASP. Now there's a distinction between placing first and solving the problem. They haven't solved it yet, but they have gotten closer than anyone else. We know this because we can put numbers on the accuracy of their predictions. They were good but you wouldn't bet your first born on it.

Now strictly speaking, Rosetta is not an algorithm, but a mix of algorithms, statistical potentials, databases and heuristics. Serious protein folding researchers have long given up the idea that there is a single killer algorithm that can solve the problem. To even talk about a single algorithm is so 1993. Might as well turn up to the party with fingerless gloves, do a moonwalk, and sing Purple Rain. At its coarsest level, the protein folding problem can be broken into two parts: developing a good scoring function, and finding ways to search through conformational space. Fundamentally there is no more to it than these two problems. But practically speaking, there are many different ways to break these problems down into tractable programs in action.

Most people have long since given up on using chemically realistic atomic force-fields to score conformations. Once you get away from the shackles of atomic interactions, you can get quite creative in thinking up new force-field terms such as statistical potentials, and heuristics based on phenomenological criteria.

Searching through the space is an even more creative act. The tricks is to collapse your search space as much as possible by making educated guesses. One thing many people do is to identify secondary structure using bioinformatics means and then fix those parts to helix and strand conformation.

Many of the innovations in Rosetta was in finding clever ways to cut down the search space, which actually made Rosetta more efficient, and less computationally intensive than its competitors. An example: one early innovation of Rosetta was to use database-derived local fragments. Rosetta would look up the protein sequence against its database of short fragments, and if there was a hit, then that part of the protein would be fixed to that piece of local structure. Honestly, no one before this had taken local fragment sequence/structure relationships seriously. This brilliant insight alone made Rosetta faster than all its competitors, as these local fragments massively cut down on the search space.

Was the fact that Rosetta was successful, surprising? Here, I have to be somewhat anecdotal. Having first gotten into the protein folding problem in the late 90's, I can tell you that I knew several people (Prof Y) working in the area who were seriously thinking of moving on to other areas, simply because no serious progress had been made for a very long time. I was looking around for a postdoc at the time and I had just interviewed in the Baker lab. After coming back from an (ultimately unsuccessful) interview, at the next joint group meeting, Prof Y, who himself had just come back from CASP3, looked straight at me with a strange look of shock and disbelief, said that the guy I had just interviewed with had managed to predict an unbelievable structure of an entirely new fold. I swear he was visibly shaking. It is hard to overestimate how much ju-ju that this achievement had at the time but yes, the success of Rosetta was incredibly surprising, and exciting.

I realize that I probably got a little too worked up over this topic. But it touches on something that matters to me: I think that there is a great divide between bioinformatics and compstructbiology, and it's something that people should be aware of. I'd be the first to admit that my bioinformatics is poor; my knowledge of statistics is laughable; and I can barely construct a phylogeny tree. Yet I'd challenge a bioinformatician to explain the difference between a free-energy minimum of a canonical ensemble from the energy minimum conformation of a micro-canonical ensemble; or what is a covalent bond; and why that matters to protein folding. In protein folding, there are many subtle biophysical considerations over and above pure computational grunt.

Still in some ways, this lack of perspective is encouraging, because it means that protein folding is nothing like the trash-talking big-hair mess that it used it be. And people have moved on. Kind of like anything from the 1980's. And that's probably a good thing.