A Decade in Computational Structural Biology

I shake my head and suddenly realize a decade has almost passed since I first became a computational structural biologist. Feeling that I've earnt some kind of perspective, I hereby present to you a list of significant developments in computational structural biology of the naughties:

  1. Rosetta. ===========

There is no doubt in my mind that the greatest contribution to computational structural biology in this decade is the suite of programs known as Rosetta, written by David Baker and his lab.

Look at the accomplishments. Rosetta has consistently won the CASP competition for protein folding (see next item) in the new fold category. Nothing else comes close in deducing the structure of a protein from an arbitrary protein sequence. Rosetta was used to create a protein with the first artificial fold. Rosetta was used to design the first artificial enzymatic reaction in a protein.

Each of these accomplishments are staggering in themselves, but to find a computational core across all three, makes Rosetta the crowning jewel of computational structural biology. I've said it before and I'll say it again, David Baker is on the fast-track to Stockholm.

  1. Formally Organized Competitions ==================================

Every 2 years, a whole bunch of computational structural biology labs effectively shut down for business, and throw every man, woman and workstation together to attempt to crack a set of problems. This same set of problems is simultaneously being attempted in labs all around the world, as researchers race against a clock to predict the 3 dimensional atomic structure of protein sequences published at the CASP protein-folding competition website.

We often say that science is a competition but is is astonishing to me how the computational structural biology community has embraced formally organized competitions such as CASP. Here, we have pure naked competition, complete with a scoring system, judges, and rankings that determine winners and losers. It has all the drama that you'd expect from a reality TV show: recriminations, anger and tears. And it has taken the field of protein folding much farther than anyone would have imagined 10 years ago.

I still remember the late 90's when the field of protein folding was an epistemological joke. Every odd month, some theorist would publish an article claiming that they had solved the protein folding "in principle" yet none of them actually predicted the structure of a sequence with no known structure. Of course, no one was actually stupid enough to make such a prediction in print. Others, more pessimistic, were even saying that it's an unsolvable problem, like the theory of nuclear energy shells. Indeed, the results of the very first CASP showed just what a disaster that the protein folding field was, at the time. A lot of prominent theorists got egg on their face.

In hindsight, we can see that CASP worked along a very darwinian idea: that of defining a fitness landscape. In the field of genetic algorithms, you can use random mutations to find efficient algorithms, but only if you have a robust fitness function. The field of protein folding had been drifting along in some kind of crappy fitness valley and it wasn't until CASP came along, that we could even define what protein folding was, in a concrete definitive way. In terms of protein folding, the targets of CASP could be used to define a good fitness function, which was enough to spur the field to scramble out of the valley and up a fitness peak. This approach has proven so successful that we have now embraced formal competitions for other problems of similar difficulty: CAPRI for protein-protein interaction, and SAMPL for ligand binding.

  1. Distributed computation. ===========================

First there were Beowolf clusters on commodity linux boxes, and now we have cloud computing with Folding@Home and Amazon. Either way, it's become standard practice for computational structural biologists to think beyond a single workstation, and fall in love with distributed computing.

As the hardware has gone parallel, so have our simulations. All the standard molecular-dynamic packages are locked in an arms race to squeeze the best performance out of parallel clusters. It is hoped one day that they will actually scale. As the capacity of our calculations has gone up, so has the quality of our simulations. It is virtually di rigeur these days to use explicit solvent in molecular-dynamics simulations.

Nevertheless, there is still a huge way to go before we can happily sample large protein systems to the point where we can merrily pluck out interesting but rare conformational fluctuations from an untainted molecular dynamics simulation. Until that time comes, be prepared to apply a whole cottage industry of thermodynamic techniques that trick the system into generating those precious rare conformational changes at a much faster simulation time.

These techniques break up into two groups. The first group of techniques, such as replica-exchange, are used to set up the simulations with weird temperature distributions so that large motions can be generated easily. The second group of techniques, such as thermodynamic averaging and master equations, tries to normalize these weird thermodynamic distributions back down to well-behaved free-energy equilibrium distributions, with which one can compare to experiments. Either way, it's wise to brush up on your math before you wade into these extremely mathemagical waters.

  1. Elastic Network Models =========================

One running theme in computational structural biology is the search for simple models of proteins. These range from single-bead models to Gō models, to 2 dimensional lattice models. Ultimately, the litmus test of simple models is that they provide useful informational for researchers in other areas of computational structural biology.

Over the last decade, I've found that the only simple model that provides concrete information for researchers outside the field of simple models is the Elastic Network Model. It is the model where, given a protein structure, you replace all atomic contacts and covalent models with a simple harmonic spring. You will then obtain a network of springs. From this network, it is trivial to extract the dominant vibrational mode.

The motions deduced from these vibrational modes turn out to be exceeedingly useful, and these has been shown to reproduce the motions from much longer molecular-dynamics simulations. Elastic Network Models have been used to deduce virus capsid maturation and ribosome motions, systems that are way to large to study in any other way. The speed of the calculation and the fidelity of these results makes this an essential tool for computational structural biologists.

  1. Fragment binding to proteins ===============================

Ligand binding is the great computational problem of the pharmaceutical industry. We already have enough solved structures to provide plentiful targets for drug activity, but without a way to calculate exactly how ligands bind, this doesn't get us very far. Given the commercial nirvana lying on the other side of solving the ligand binding problem, there has been a ton of work in improving ligand-binding prediction systems. Still, the improvements in the field have proceeded at a glacial place, where the only novel ideas seems to be to get faster computers and to add more expensive dynamics to the simulations.

However, I've found one approach that seems to be genuinely different. This is the idea of using chemical fragments. Instead of studying how a ligand binds, you break up the ligand into its constituent chemical fragments, which are conformationally exceedingly simple. Then you study how the fragments bind to a protein. By studying the binding pattern of fragments, you can back-track to work out what ligand could have bound to it. Although this approach hasn't yet lead to a radical overhaul of the ligand-binding problem, I am tipping this as a method to watch.

  1. NMR measurements of residue dynamics =======================================

One of the reasons we do molecular-dynamics simulations is that these simulations provides the only way of seeing how proteins move on the atomic-detailed level.

Nevertheless it is a massive leap of faith to trust only in the simulations. Over the last decade, although NMR has fallen out of favor for solving protein structures, NMR methods have proven their worth in providing detailed information about the motions of proteins. These include the S order parameters for nanosecond motions, chemical shifts to detect secondary structure, and exchange parameters to measure microsecond transitions in chemical shifts.

Since careful NMR experiments can resolve signals to individual residues, these mesurements can give exceptionally precise dynamic measurements to proteins in solution. Recently, NMR Residual Dipolar Couplings were used to measure microsecond motions for every single residue in ubiquitin. We now have a very rich set of measurements that capture dynamic motions of proteins. These provide virtually the only checks on the wild world of long-time moleuclar-dynamics simulations.

  1. The inability to go beyond partial charges. ==============================================

In molecular-dynamics, we've been promised more sophisticated force-fields for a very long time now. Whilst there is no doubt that things such as dipoles and quadropoles can be programmed, it turns that there is a very steep speed tradeoff. It may be time to give up on the dream and accept that maybe, single partial charges may be sitting at some kind of sweet spot of simulation realism versus efficiency.

  1. Conformational changes of single molecules =============================================

Experiment has raced far ahead of simulation in recent years. In particular, two techniques have been developed that directly measures conformational changes of individual protein molecules.

Flouresence Reseonance Energy Transfer (FRET) can measure the distance between chromophore pairs simply by the amount of fluoresence emitted by the chromophores. If you can attach the chromophore to two residues in a protein, you can measure their distance distributions in solution.

In Atomic Force Microscopy (AFM), sophisticated servo-motors can gently tease apart singles protein using exqusitely tiny forces. Incredibly detailed force-extension profiles of individual proteins can be measured.

These measurements have provided incredibly rich insights especially to systems where these large conformational changes can be linked to biologically-relevant processes.

  1. Reconstruction of long dead ancestral proteins =================================================

For me, the most imaginative technique developed in the last decade is the black art of reviving ancestral proteins.

If we look at the spectrum of living things before us, we see only the furtherest leaves in the tree of life. However bioinformatics has given us techniques to extract the historical traces of evolution from our knowledge of protein sequences. By comparing homologous proteins across species, we can use sequence alignment comparisons to construct phylogeny trees between the species.

The principal insight of groups who study ancestral proteins is that, with our sophisticated sequence alignment programs, not only can we construct phylogeny trees, but we can construct the actual sequence of the last common ancestors at each fork of the tree.

These ancestral proteins can be cloned and crystallized. These groups have focused on proteins that bind to different ligands in different species. By carefully studying the changes in binding and in the structure over the phylogeny tree, we can figure out how nature has sculpted the structure of a protein to accomodate the speciation of different organisms over millions of years of evolution.