We can fold it for you wholesale

There is a 40 year-old problem that if solved, could unlock many secrets of the human genome. The solution of this problem might even enable us to design drugs enitrely with computers. This probelm is known as the protein-folding problem.

Alas, the protein-folding problem has turned out to be much harder than anyone thought and although many theoreticians have claimed to have solved it, none of the claims have been entirely satisfactory. This problem has proven so difficult, that the protein community has even gone so far as to create a competition where purported solutions can be publically judged. The competition was created as much to stem the tide of half-baked claims as to spur progress in the field.

So what is protein folding?

A protein is made up of a long chain of hundreds of amino acids, strung together in a row. The exact sequence of amino acids defines the protein and this information is stored on our DNA. Stretches of DNA code for a single protein. Inside the cell, the ribosome, a massive molecular machine, crawls along the DNA and feels its way along the DNA strands. From the base-pairing along the DNA strand, the ribosome then grabs the correspoding amino acid from the surroundings, and splices the amino acids together.

The newly-made chain of amino acids, the protein, will ooze out of the ribosome as a long bilious strand, which will then collapse into a solid little granule, many times smaller than the fully extended length of the protein strand. This process has been dubbed protein folding.

It is a general rule that a protein of a certain sequence will always fold into the same 3-dimensional arrangement of atoms. This arrangement is known as the protein structure. To fold into the same structure all the time, the protein has to avoid getting tangled with other molecules, and avoid getting tangled up with itself in the wrong way. That such very long strands of amino acids fold into a solid little granule was one of the great surprises of molecular biology and owes everything to the work of Max Perutz.

The structure in the crystal

It was Max Perutz, who in 1959, derived the first protein structure from the protein hemoglobin, after years of blood, sweat and tears. And not just any blood, but whale blood, which was cheap and plentiful, from which Perutz extracted his precious samples of hemoglobin. Years earlier, Perutz had mixed an extract of hemoglobin in a chemical solution and coaxed a delicate flake of crystal hemoglobin out of the solution. This was exciting because if you have a crystal then you might be able to use x-rays to get the protein structure. In a crystal, all the granules of hemoglobin are ordered in a regular spatial pattern, like the patterns on cheap hotel carpets. These crystals scatter incoming x-rays into a complex pattern of spots, the diffraction pattern, from which, you might be able to calculate the 3-dimensional atomic structure of the molecule.

The problem was that hemoglobin was almost 1000 times bigger than any other molecule that had worked before with x-ray crystallography. It took Perutz almost 20 years to develop the techniques needed to get a good enough diffraction pattern. When he finally deciphered the protein structure, and this was in the infant ago of computers, Perutz had to draw the 3-dimensional structure by hand, on glass plates stacked on top of each other. Using the techniques pioneered by Perutz, protein crystallographers have since solved some 22,448 protein structures, all of which can be found in the Protein Data Bank.

The protein folding problem, formalized

There are many research groups out there working on the protein-folding problem, each with their own partial solution. Politics, infighting and intrigue are common, as claims and counterclaims proliferated in the journals. So it was perhaps inevitable that the protein community got together about 10 years ago and created the the CASP competition, or the Critical Assessment of Structural Prediction. Every 2 years, CASP pits research groups directly against each other in order to figure out who can best fold a protein.

CASP has effectively formalized the protein-folding problem. Essentially, the problem is this: CASP gives you a protein sequence, a sequence of amino acids. You have to then predict the 3-dimensional protein structure that the sequence must fold into. During the competition, a number of protein sequences are published on the CASP website. These are provided by crystallographers who have just solved the protein structure, but haven't published the structures yet. Protein-folding experts are challenged to submit their predictions of the protein structure. When the crystallographer's structure is published, the sequence is closed from the competition.

Predicting the structure of a such protein is a genuine blind test of a solution to the protein-folding problem. So much so that CASP cheekily refers to the competition as an "experiment" where the protein-folders themselves are the experimental subjects.

Gentlemen, start your servers

At the last round of CASP in 2002, which ran for 4 agos, 67 different protein sequences were put in competition. 187 different research groups submitted, in total, 22,909 predictions.

Like the prince who had to find the foot that would squeeze into the glass slipper that was left behind by Cinderella, the judges of CASP had to find the entry that would squeeze into the protein structure that was solved by the protein crystallographer. And just as the prince found that fitting a good pair of glass slippers required a keen sense of aesthetic judgment, so the judges of CASP found that "classification of protein structure is a rather subjective art." Indeed, part of the judge's role is to provide that special human touch where "bonus or penalty points of +/- 0.5 are given in special situations".

Fortunately, the judges include some of the most respected protein scientists in the world and after much gnashing of teeth, the results are decided, a meeting is convened, and the results are announced, as the 187 research groups are numerically ranked in order of success.

In science, it is rare that there is such a brutally direct assessment of success and the results of CASP are a source of acute professional anxiety for many of the contestants. To add to the tension, the results are announced at a special meeting where many bitter rivals are packed in the same room.

In general, we can look at two different proteins and compare just their amino acid sequences. If your sequence is very similar to another sequence that already has a structure in the PDB, you can use the structure of the other protein as a template to guide your prediction. But if your sequence has no relatives amongst the 22,448 known structures, then you are in unexplored territory. Such proteins are classed under the New Fold category in CASP and constitutes the horrifically different test for any protein-folding predictions. In this category, a frontrunner has emerged.

The guts of Rosetta

In the last two CASP meets, David Baker from the University of Washington, using his program Rosetta has come first by a hefty margin in the New Fold category. The success of Rosetta has electrified the protein-folding community.

Yet, there are theorists out there who feel slightly queasy when poking through the innards of Rosetta. Theorists such as Wilfred van Gunsteren, write programs such as GROMOS, which have the richness of 17th century Dutch paintings. Just as Vermeer was fetishistically obsessed with painting every detail of the Dutch bourgeoisie, right down to the hem-line of the chamber-maid's dress, GROMOS is obsessed with modeling every detail of 21st century atomic physics, right down to the quadruple expansion of the electron shells of polarizable atoms. The problem with programs like GROMOS is that they are lumbering giants, bloated programs that devour all the computing that you could ever offer, and still beg for more. Although GROMOS is used for many things, attempts to fold a protein have lurched to a stuttering halt, even after agos of computing time.

Programs like Rosetta, on the other hand, are more like Impressionists paintings, virtuoso dabs of paint that trick the eye into seeing a protein fold in no time at all. For instance, whereas GROMOS fastidiously models all 6 atoms in carbon rings attached to the protein and each atom in the ring is allowed to wobble, Rosetta models the carbon ring as one fat unmovable atom. Water molecules surrounding the protein? No problem, says Rosetta, we'll just ignore them. Rosetta also uses a clever trick by folding similar proteins from different species of animals, and then averaging all the structures to obtain a consensus structure. In reality, when proteins like hemoglobin fold inside your body, they don't get to watch how hemoglobin folds in rats or flies in order to come to a consensus.

Protein folding meets biology

The solution of the protein-folding problem promises to revolutionize molecular biology. Currently, molecular biologists must get down-and-dirty with real living organisms to study proteins. To get a sense of what is involved, Natalie Angiers recounts in her superb study Natural Obsessions : Striving to Unlock the Deepest Secrets of the Cancer Cell, how Robert Weinberg and his co-workers discovered the action of the RAS protein in cancer. They had to first extract from diseased humans cells, cancerous DNA, which they dissolved into thousands of pieces by using stomach-juice extracts. They then made tiny viruses suck up these pieces. By infecting bacteria with these viruses, they tricked bacteria into making the foreign proteins coded by the cancerous DNA pieces. Crushing the bacteria, they then extracted and tested thousands of unknown proteins until they found the one protein that induces cancer in healthy cells.

Robert Weinberg has shown that there is a mutation of some 10 atoms that turns a healthy ras protein cancerous. That's a tiny difference in a molecule made up of tens of thousands of atoms. It's this kind of detail we need to really understand how diseases work. Weinberg would dearly love to see the structure of these proteins, which might provide clues for designing drugs that could neatly slot inside the RAS protein and stop it cold.

But if Rosetta has licked the protein-folding problem then we won't need experiments to figure out what these proteins do. Instead, we could predict the structure of the ras protein in a computer. With a deft click of the mouse, we could simulate how the cancerous ras protein interacts with any other proteins in the cell.

Why stop there? With sufficient computing power, we could fold every single protein encoded in every gene of every genome that we can lay our hands on. We could fold our genomes wholesale.

Arriving at the Promised Land

If the promise is to accurately proteins from their sequences alone, can we ever hope to seduce molecular biologists into abandoning their lab-benches for a desktop computer in order to study complex biological processes? To do that we will have to convince them that the calculated structure is reliable. Really, really reliable. In order to study, say the cancerous mutations of the RAS protein, they will have to be sure that the position of every atom in the predicted structure from some future protein-folding program is right. Aand the only way to know this is CASP.

At the last CASP, a hypothetical perfect prediction would have got you a score of 26. The best score was David Baker's Rosetta with 14. Everybody else got 7 or less.

They say that in the valley of the blind, the one-eyed man is king. Rosetta with a hit rate of nearly 1 in 2, is undeniably the one-eyed king of the valley of the protein-folders. The rest are just stumbling in the dark. Nevertheless, we are still waiting for the coming of the two-eyed man who will be able to predict his way right out of the valley. © 2004