A Fundamental Breakthrough in Protein Folding

In my humble opinion, the biggest paper in protein folding from the last few years just got published in the wee hours of 2011. It is Protein 3D Structure Computed from Evolutionary Sequence Variation from Debora Marks, Lucy Colwell and colleagues (and when I say colleague I mean Chris Sander, which you should all know as a co-author of DSSP). This paper proves the tremendous result that the key structural contacts in a protein structure can be derived from a multiple sequence alignment. And that these contacts are sufficient to generate reliable structures of the protein. And big proteins at that.

I heard on the grapevine that this paper had a difficult passage to publication. I was both surprised and not surprised by this. I was surprised because this is such a fundamental result, it should have have been published in a top-tier journal.

But I was also not surprised because this paper pushes into a direction different to the mainstream of protein folding. The salient point of the paper is that you should be able to predict a reliable lo-res structure for any protein sequence with only a few hours computation on a standard desktop computer. No wonder some of the referees got hot-and-bothered and proved an obstacle to publication. (I can sort of understand this as the accepted test of protein-folding algorithms is CASP. Nevertheless, I've read many protein-folding papers and few offer novel fundamental approaches such as this). After all, wasn't protein folding supposed to be a computationally difficult problem requiring massive computational resources? Still I am glad it got published in PLoS ONE (arguably the top open-access journal), as anybody can now read it [flips bird at closed-wall publishing].

What I am particularly excited about in this paper, is that it brings together quite a few different strands of research in protein folding into one powerful theorem, some of which I have worked in. The paper invokes results from contact analysis of protein structures, phylogenetic analysis of multiple sequence analysis, measures of coevolution, and the practical problem of generating structures from distance constraints.

The real breakthrough in this paper is the identification of a sufficiently robust measure of coevolution in a structural alignment of a protein family. Although coevolution in structural alignments had been studied before, it was really the work of Rama Ranganathan that got people excited by showing that coevolution analysis could predict mutations that had genuine experimental ramifications. Using his SCA measure to identify key contact pairs, Ranganathan identified position pairs in the PDZ domain that, when mutated, generated measurable experimental changes. Nevertheless, many of the other predicted top correlated pairs from SCA were ambiguous, with no easy interpretation. There appeared to be a lot of noise in the SCA measure.

Since then, many different coevolution measures have been proposed, each with their own points of ambiguities. The measure used in this paper has appeared to resolve the ambiguities of all previous measures. The origin of the measure used in the current paper was proposed in 2009 by Martin Weigt and colleagues, called Direct Coupling Analysis. This was a much more sophisticated measure of covevolution than previous measures. It defined the observed correlations as the result of much richer probability model of pair couplings which involve the calculation of a set of hidden parameters. Solving this model involved the use of some heavy-duty machine-learning techniques. And it was slooooow. This work was used to analyse inter-domain coupling between two protein families that were known to interact.

Finally, in mid 2011, Weigt released a new version of his measure which improved the performance of DCA by orders of magnitude. This was used to show that the pairs identified by DCA were in fact, native contact pairs in the corresponding crystal structure. Let me pause right here. This is a massive result. The impressive singal-to-noise of the DCA measure is a huge improvements over all previous measures of coevolution.

Still, this begs the question as to how significant were the contacts identified by DCA. Are they sufficient to define the structure of the protein? This is where the current paper comes in, and the answer is a resounding YES. This consitutes the great finding in this paper. Fortunately for Marks, Colwell and friends, all the technology needed to prove this result had already been developed. For instance, the work of two labs I've worked in has been instrumental in honing the insights of contact analysis. Ken Dill has shown contacts can define a well-defined protein-folding landscape. Michael Lappe has shown that only a paltry 8% of native contacts can be used to reconstruct the structure of a protein to 5 Å. As well, the NMR community have developed robust algorithms to generate structures from distance constraints, such as that in the venerable program CNS.

Marks, Colwell and friends showed that the highly correlated pairs identified with the DCA measure of coevolution, when used as native-contact distance-constraints, generates structures that are within 3-5 Å of the native structure for a diverse bunch of protein families, some as long as ~250 amino acids. (You should read the paper to get a bigger handle on the accuracy) This is a stunning accuracy compared to other ab-initio protein folding algorithms, given that it works across a diverse bunch of proteins, requries only a few hours of calculations on a single machine, and is incredibly robust.

Of course, the lower limit of accuracy of 3 Å is probably the best that one would expect from this method. Since coevolution identifies contacts that are common in a protein family, you would only expect native contacts to be derived from this method that is common to all the proteins in a family. Highly variable regions will not leave a sufficently strong evolutionary trace.

This method is incredibly exciting because it has worked with longer proteins than other methods (~250 amino acids), with a greater chance of attacking even longer proteins. Much as I admire the work of David Baker's Rosetta for fragment-based folding and D. E. Shaw's Desmond for brute-force MD based folding, these methods have been restricted to small proteins of 100 amino acids or less, with no clear indication that these methods can elegantly break beyond the complexity wall to larger proteins.

Besides, both these systems are based on old insights. The breakthrough of Rosetta's fragment analysis is 10 years old. Desmond is a technical marvel but rests on a modeling approach that's more than 30 years old. I've been working on coevolution analysis for the last 5 years, so I know well the morass of statistical aberrations in the analysis of covariation of multiple sequence alginments. The clarity of the DCA measure of coevolution represents a genuine modern breakthrough using highly sexy machine-learning techniques, and the robust nature of the results should make it the canonical method for generating accurate lo-res structures of any protein sequence.