How the hell do I model this thing?

A couple of months ago, I had a great email exchange with Andrey Voronkov from Moscow State University about simulation methods. Andrey provided a lot of great stuff (and handy references) and was happy enough to make it public. Now that I have a little time in my job-hunting hell, I've put it up for your reading pleasure.

Bosco: Hi Andrey, just to start off, what's your system?

Andrey: Hi Bosco, well, I have a number of molecular models derived by homology based modeling from X-ray template with 1.36 A resoulution. These models are extracellular dimeric domains of GPCR. No natural or synthetic ligands for these domains is know ntil now, but one small molecule ligand and one peptide (beta-amyloid) are supposed to bind to these domains.

Bosco: GPCR is a very popular domain indeed. That's a very smart move to choose to work on one of these proteins, as it's been estimated that 70% of drug targets in pharmaceutical companies aim for one of these babies. This is a project with EMPLOYMENT POTENTIAL written all over it!

Andrey: In my studies I have determined that these dimeric domains structures have several potential binding sites both for small organic molecules and for beta-amyloid. But structures of these dimeric CRD-domains have become rather different from template after homology based modeling and some of them lack the binding sites mentioned above. One of the binding sites (the biggest and maybe most important) is on the dimeric interface.

Bosco: It looks like you've done some nice experimental work as membrane proteins are extremely tricky to work with. Yes, homology models only go so far. They are in the end, heuristically based, and most homology modeling systems assume that you're dealing with monomers. If you're worried about dimers, you'll have use some form of molecular dynamics, where you can lean on atomic force-fields to give you some kind of semi-reliable physics to go beyond the homology assumptions.

Andrey: So specifically, I'd like (1) refine the dimeric domains of these homology models in explicit solvent; (2) evaluate the dimers stability perhaps calculating the free-energy; (3) identify the binding site of my beta-amyloid (17 aminoacids) ligands; (4) refine the whole GPCR assembly in a lipid bilayer.

Bosco: So unpacking all this: you're working on the protein-protein problem, the ligand-binding problem, and running MD on a huge transmembrane assembly. The first two problems are theoretical, and the last problem is purely about computing grunt. How much grunt do you have?

Andrey: Unfortunately, even with supercomputing power at Moscow State University I can reach only maximum 50-100 nanoseconds depending on how big is the target. This maybe can work for small molecule ligands or for structures refinement, which have good amino acid identity with the template (for example >70-80%), but not for dimer stability evaluation, not for beta-amyloid drift on the surface of protein, not for the full receptor dynamics and not for the models with 30-50% of aminoacid identity which very likely to have serious structural changes and require partial folding simulations.

Bosco: Yes, now I understand your interest in appoximation models.

Andrey: So the methods which I'm considering, in order of familiarity, is first, implicit solvation models. This can increase timescale in order of magnitude. But I've already tried it and some binding sites undergo serious deformations even in X-ray structure. Maybe I should use another force field (AMBER99SB but not AMBER03) for this purpose.

Bosco: Oh, I've made a cottage industry of using implicit solvent models. It's mainly because I haven't worked on a project that requires high accuracy in sidechain conformation. In the interior, the tight packing of residues, and the shielding from water, makes it less of a problem. But for electrostatic residues on the surface, the implicit solvent model is absolutely awful. Wrong salt-bridges are formed as vacuum electrostatics can't really be eliminated. I wouldn't trust surface residue conformations in implicit solvent.

Andrey: Second, increase the computing power by usage of distributed computing of Drugdiscovery@home project in which I am participating right now – can achieve about 4-5 Tflops corresponding to 2-3 thousands of PC. This can enable to make MD simulations with explicit solvent up to 1 microsecond for 1 biotarget. But is 1 microsecond enough?

Bosco: I've never played around with those screen-saver type things, but I can imagine that the social problems are as much as the technical problems. Still, does this mean you have to phrase your simulation in an ultra-parallel manner? Curious to hear more when you get the chance.

Andrey: Third, coarse-grained molecular-dynamics, followed by more precise calculations. But I am affraid that in this case the deformations of binding sites can be even much more than in implicit solvent. Coarse grained calculations will give very serious errors which will not allow me to understandd the exact positions of ligands, peptide and protein structures. But these may be only reasonable for monomerization/dimerization studies. Here are some references:

  • Liwo A., Czaplewski C., Ołdziej S., Scheraga H.A., Computational techniques for efficient conformational sampling of proteins. // Curr. Opin. Struct. Biol., 2008, V.18., P.134-139.
  • Sansom M.S., Scott K.A., Bond P.J., Coarse-grained simulation: a high-throughput computational approach to membrane proteins. // Biochem. Soc. Trans., 2008, V.36, P.27-32.

Bosco: Yeah, I'm not really sure about the utility of coarse-grained molecular-dynamics. In papers on coarse-grained dynamics, it seems the generated conformations are quite crude, and the rates are out by orders of magnitude. I think coarse-grained dynamics were used to take a peek at long time-scales, back in the day when computing resources were very limited. It's definitely worth a shot, but only to see if your candidate dimerization poses fall apart immediately or not. It can't tell you whether you're in the ball park. But it can definitely tell whether you're in the car park.

Andrey: Fourth, then there's your method: Rotamerically Induced Perturbations. It's a very interesting method for assessment of protein motions as far as I don't need the precise timescale for my tasks. But as I understand this methods assess only motions and can't be used for structure refinement and for prediction of the structures with best stability in the dynamic environment.

  • Ho B.K., Agard D.A., Probing the flexibility of large conformational changes in protein structures through local perturbations. // PLoS Comput. Biol., 2009, V. 5, e1000343.

Bosco: You're right. RIP doesn't generate accurate alternate conformations (although, whilst developing the method, I had crossed my fingers, and prayed to the great beings in the sky that it would). The distorted conformations are quite unrealistic. Still, I believe that RIP provides the cleanest way to identify surface loops/helices/hairpins that are mobile on the microsecond timescale. It is complete in the sense that the results are not dependent on the time of simulation, the protein either distorts to the perturbation, or not. Now one thing that I have found, is that RIP seems to be very good at finding dimer-interface loops when RIP is applied to the corresponding monomer structures. There is a body of theory that says that some dimerization occurs through the stabilization of flexible surface loops in the monomer. This stabilization could be the mechanism that explains the stability of the dimer. I am currently looking into this for a bunch of other proteins.

Andrey: As I understand in RIP, it is also possible to use explicit solvent, which can influence on the χ angles of side chains, not just the implicit solvent. But during the dynamic motions not only side chains can move and influence backbone, the solvent can induce also some significant changes in backbone structure, as backbone movement by itself, right?

Bosco: Yep. I've managed to get RIP working on explicit solvent, and the results are kind of discouraging. What ends up happening is that the water absorbs the perturbation from RIP'ed residue. And if the residue does hit a loop, the energy that is transferred to the loop immediately gets absorbed by the water surrounding the loop. There is practically no motion. In some sense, I was lucky that I developed RIP in an implicit solvent model. Because a surface-area approximation is used to model the water, there is no viscosity in the system. Thus a RIP perturbation in implicit solvent transfers all the added energy to the flexible loop, causing it to move.

Andrey: But at the same time with exact knowledge of binding sites of beta-amyloid and small ligands, is it possible to predict resulting protein structure movements upon ligand binding (if there are any)? Also is it possible to predict dynamics of the elements of full structure of GPCR receptors, especially if explicit solvent and lipid bilayer can be used?

Bosco: Alas, this is the frontier of the ligand-binding field. Researchers are reasonably confident that they can identify good poses for ligands, but only in the cases where the protein does not deform. There's a whole bunch of systems, specifically devoted to lignad-binding with small protein deformations. But progress is very slow.

Andrey: Okay, let's move on to something that's not molecular-dynamics methods CONCOORD. CONCOORD is a monte-carlo method that randomly generates structures which fulfil the distance constraints. Then essential dynamics analysis gives the data on main protein motions. For CONCOORD simulations some reference to the experimental results is required to chose the best structures if we want to use CONCOORD for protein structures refinement. What criteria should I use for example in dimerization and monomerization studies? Can it be the ratio of monomeric/dimeric forms in generated structures. CONCOORD doesn't count for the solvent molecules, which can stabilize some dynamic parts of the protein, right? To get stability information we should have some time-dependent distribution of conformations which is not accessible by CONCOORD (even if the timescale doesn't correspond to a real one). From another point of view some averaged structures can be generated from CONCOORD which can give a picture of protein structures with the highest stability.

  • de Groot B.L., van Aalten D.M., Scheek R.M., Amadei A., Vriend G., Berendsen H.J., Prediction of protein conformational freedom from distance constraints. // Proteins, 1997, V.29, P.240-251.

Bosco: I really like CONCOORD. It's a very simple but elegant idea. In some ways, I see CONCOORD as the algorithmic inverse of Elastic Network Models (ENM). In both CONCOORD and ENM, you start with the contact map of a protein. The difference is, in CONCOORD, a contact is turned into a distance constraint, whilst in ENM (see below), the contact is turned into a spring. With CONCOORD, the structure is allowed to rattle around within these distance constraints. Clearly, tightly packed regions with lots of mutually reinforcing distance constraints won't rattle very much, but loosely packed region will. Playing around with CONCOORD, I think CONCOORD captures the nanosecond regime of flexibility well, as CONCOORD generates as good a match with NMR S order parameters as any other system out there. It cannot really capture loop motions that RIP can. Indeed, you can see that by definition, neither CONCOORD nor ENM will be able to capture loop motions that move independently of the body of the protein. Such loop motions must go from a conformation where contacts are made with the body of the protein to one where the contacts are broken. In both CONCOORD and ENM, contacts are fundamental in their model. As such, I can't see how CONCOORD could help in studying the stability of a dimer, as any contacts in the dimer-interface would be converted into a distance constraint.

Andrey: Yes, I am also interested in Elastic Network Models (ENM). It requires MD simulations anyway or X-ray structure with derived B-factors. Anyway the structure should be refined before the calculation of forces. This method seems to be really interesting and, as I understand it, can simulate large protein motions. But it seems to have some opposite features to RIP because ENM counts only Cα dynamics whereas RIP makes emphasis on side chains, right? There is also a ENM variation that deals with explicit solvent models (REACH method).

  • Moritsugu K., Kurkal-Siebert V., Smith J.C., REACH coarse-grained normal mode analysis of protein dimer interaction dynamics. // Biophys J., 2009, V.97, P.1158-1167.
  • Moritsugu K., Smith J.C., REACH coarse-grained biomolecular simulation: transferability between different protein structural classes. // Biophys J., 2008, V. 95, P.1639-1648.
  • Moritsugu K., Smith J.C., Coarse-grained biomolecular simulation with REACH: realistic extension algorithm via covariance Hessian. // Biophys J., 2007, V.93, P.3460-3469.
  • Su J.G., Li C.H., Hao R., Chen W.Z., Wang C.X., Protein unfolding behavior studied by elastic network model. // Biophys J., 2008, V.94, P.4586-4596.

Bosco: I think I've said it before, but ENM has the most bang-for-bucks in the world of computational structural biology. Use of ENM just requires a structure. From the structure, you derive a contact map, and from the network of contacts, you create a network of springs. So what does ENM model? There's a whole bunch of literature that shows that the collective motion of the ENM network of springs gives a very good approximation of the collective motions of a domain structure. For accuracy, ENM performs as well as normal mode analysis in short-run molecular dynamics in, but in just a fraction of the time. Whilst there is some overlap of ENM with B-factors, and S order parameters, I am convinced that ENM explores very long timescale motions, such as domain rearrangements. I've seen some nice studies that show large differences in the collective modes between the monomer and its dimer. These have been used to understand dyanics of active, especially after the effects of dimerization.

Andrey: As to Monte Carlo methods Rosetta is a very interesting solution. But recently I've read a paper on Hamiltonian Replica Exchange Monte Carlo which seems to outperform significantly the Monte Carlo Simulated annealing of Rosetta and even Temperature Replica Exchange method. I think about usage of this method. But of course one of the interests is to model proteins with explicit water models influence.

  • Shmygelska A., Levitt M., Generalized ensemble methods for de novo structure prediction. // Proc. Natl. Acad. Sci., 2009, V.106, P. 1415-1420.

Bosco: Rosetta is amazing, and it's grown into a beast of a program. Rosetta is definitely best of breed in regards to protein structure prediction, from an ab-initio point of view. However, you have a structure already, so I don't know how far Rosetta gets you. Since the Rosetta scoring function is totally heuristic, designed for predicting monomer structures, it's not going to be very useful for stability studies in dimerization studies. At best, I imagine it might improve some sidechain orientations.

Andrey: This e-mail sounds like a methods review. We can make this discussion public somewhere as far as these questions can be interesting to other researchers.

Bosco: Why, of course. Done.