If you believe what they say in newspapers, you might get the impression that science is a fundamentally atheist institution, irrevocably opposed to religious organizations of any sort. After all, if you wandered down the halls of a typical research institution and conducted a straw poll, chances are, you will encounter a fervent atheist, a hardened warrior against the irrationality of religion.
Indeed, it might easily be argued, given the scientist’s supposed penchant for rational scepticism, that science, as an institution, is fundamentally opposed to religious behavior. Of course, there are high profile religious mavericks such as Francis Collins, but one could argue that the liberal attitude of science to social mores means that there will always be room for eccentrics.
However, if you the scratch surface and look carefully around you, you will find that a surpising number of scientists in the middle to junior ranks are religious, often fully committed members of a local church. I was made aware of this several years ago as I got in the elevator with a an older, conservative colleague from China. As I greeted him, I noticed that he was listening to his brand new iPod, which at the time was a shiny new thing. I had badly wanted one then, and found it hard to imagine that my colleague was such a fashion junkie. So, curious to know more about his uber-desirable gadget, I asked him what music he was listening to. He answered, “oh, I am not listening to music, I am listening to my pastor’s sermons.”
I believe that the existence of this enclave of religion in scientific institutions is not by accident. Rather, it is a direct result of how modern science is organized as a social institution, especially in the United States. Here’s the reason: science, as it’s practised in the United States, is like a factory. It demands of individuals to work like a pluggable creative unit with few concession for the fact that they are human beings with desires, needs and human frailties.
Science is a hard industry to crack, harder than say, accounting, but still considerably easier than acting or professional sport. Many years ago, it used to be the case that most people who obtained a Phd inevitably found a position as an academic. But after the massive expansion of the higher education system after the GI Bill, we have found ourselves with a glut of postdocs compared to the number of position that allows them to move on up. An army of postdoc researchers that serve a much smaller number of professors. Indeed in elite colleges, superhuman abilities to work hard are desirable, a process that grinds through postdocs like meat.
Structurally, science is an anti-family institution. Graduate students from all around the world are brought into labs all around the United States, where one is expected to slave away for the prime years of your lives working on extremely difficult research projects. Job security is poor. The pay is low and one is expected to move at the drop of a hat, or in most cases, at the drop of a a grant. It demands of a researcher a monastic like commitment at an age when people in other industries are starting families. People who do start families in science normally rely on a spouse who has regular employment.
Science also attracts a certain type of personality. The obsessed and somewhat socially awkward type of person. These are not people who travel well. They are not people who love to master new languages, try out new foods, learn the music of foreigners, learn foreign ritual. Life becomes a rich diet of work but a social desert. Many of these individuals have traveled half-way around the world to live in foreign country – devoid of social support and other such niceties, and unable to enjoy the cultural comforts of a foreign culture. It is assumed that with their devotion to science, they are expected to rise above the frailties of, well, being human.
Except many don’t: they get depressed, lonely, and culturally isolated. They have few friends outside lab, no family or spouses. On weekends, instead of looking for fun and fullfilling things to do, they sit in the lab. The experiments are bound to fail at some point and it is precisely then that a tight-knit support of a social group keeps one sane. If they had one.
Is it any surprise then that these types of scientists find solace in religion? The work of David O. Wilson has really shed a lot of light onto the evolutionary function of religions. The point of religious beliefs is not so much that their beliefs are objectivably believable in themselves, but the beliefs are necessary to shape how a particular community behaves towards its members. The true function of religions is to provide a social structure. The church groups that spring around scientists don’t just provide a belief system, but they provide a massive infracture of support – things to do, a social support group, a ready pool of friends, people who care about you, and more often than not, potential spouses.
It is exactly this that the institution of science does not provide. Nature, one might say, abbhors a vacuum, even a social one. In the psychic void found in the the academy, religion has rushed in to fill the void. Ironically, it is the social support provided by these church groups that maintains the sanity of these religious scientists. As they cope with their horredously difficult scientific projects, the frontiers of science advance once again.
A couple of months ago, I had a great email exchange with Andrey Voronkov from Moscow State University about simulation methods. Andrey provided a lot of great stuff (and handy references) and was happy enough to make it public. Now that I have a little time in my job-hunting hell, I’ve put it up for your reading pleasure.
Bosco: Hi Andrey, just to start off, what’s your system?
Andrey: Hi Bosco, well, I have a number of molecular models derived by homology based modeling from X-ray template with 1.36 A resoulution. These models are extracellular dimeric domains of GPCR. No natural or synthetic ligands for these domains is know ntil now, but one small molecule ligand and one peptide (beta-amyloid) are supposed to bind to these domains.
Bosco: GPCR is a very popular domain indeed. That’s a very smart move to choose to work on one of these proteins, as it’s been estimated that 70% of drug targets in pharmaceutical companies aim for one of these babies. This is a project with EMPLOYMENT POTENTIAL written all over it!
Andrey: In my studies I have determined that these dimeric domains structures have several potential binding sites both for small organic molecules and for beta-amyloid. But structures of these dimeric CRD-domains have become rather different from template after homology based modeling and some of them lack the binding sites mentioned above. One of the binding sites (the biggest and maybe most important) is on the dimeric interface.
Bosco: It looks like you’ve done some nice experimental work as membrane proteins are extremely tricky to work with. Yes, homology models only go so far. They are in the end, heuristically based, and most homology modeling systems assume that you’re dealing with monomers. If you’re worried about dimers, you’ll have use some form of molecular dynamics, where you can lean on atomic force-fields to give you some kind of semi-reliable physics to go beyond the homology assumptions.
Andrey: So specifically, I’d like (1) refine the dimeric domains of these homology models in explicit solvent; (2) evaluate the dimers stability perhaps calculating the free-energy; (3) identify the binding site of my beta-amyloid (17 aminoacids) ligands; (4) refine the whole GPCR assembly in a lipid bilayer.
Bosco: So unpacking all this: you’re working on the protein-protein problem, the ligand-binding problem, and running MD on a huge transmembrane assembly. The first two problems are theoretical, and the last problem is purely about computing grunt. How much grunt do you have?
Andrey: Unfortunately, even with supercomputing power at Moscow State University I can reach only maximum 50-100 nanoseconds depending on how big is the target. This maybe can work for small molecule ligands or for structures refinement, which have good amino acid identity with the template (for example >70-80%), but not for dimer stability evaluation, not for beta-amyloid drift on the surface of protein, not for the full receptor dynamics and not for the models with 30-50% of aminoacid identity which very likely to have serious structural changes and require partial folding simulations.
Bosco: Yes, now I understand your interest in appoximation models.
Andrey: So the methods which I’m considering, in order of familiarity, is first, implicit solvation models. This can increase timescale in order of magnitude. But I’ve already tried it and some binding sites undergo serious deformations even in X-ray structure. Maybe I should use another force field (AMBER99SB but not AMBER03) for this purpose.
Bosco: Oh, I’ve made a cottage industry of using implicit solvent models. It’s mainly because I haven’t worked on a project that requires high accuracy in sidechain conformation. In the interior, the tight packing of residues, and the shielding from water, makes it less of a problem. But for electrostatic residues on the surface, the implicit solvent model is absolutely awful. Wrong salt-bridges are formed as vacuum electrostatics can’t really be eliminated. I wouldn’t trust surface residue conformations in implicit solvent.
Andrey: Second, increase the computing power by usage of distributed computing of Drugdiscovery@home project in which I am participating right now – can achieve about 4-5 Tflops corresponding to 2-3 thousands of PC. This can enable to make MD simulations with explicit solvent up to 1 microsecond for 1 biotarget. But is 1 microsecond enough?
Bosco: I’ve never played around with those screen-saver type things, but I can imagine that the social problems are as much as the technical problems. Still, does this mean you have to phrase your simulation in an ultra-parallel manner? Curious to hear more when you get the chance.
Andrey: Third, coarse-grained molecular-dynamics, followed by more precise calculations. But I am affraid that in this case the deformations of binding sites can be even much more than in implicit solvent. Coarse grained calculations will give very serious errors which will not allow me to understandd the exact positions of ligands, peptide and protein structures. But these may be only reasonable for monomerization/dimerization studies. Here are some references:
- Liwo A., Czaplewski C., Ołdziej S., Scheraga H.A., Computational techniques for efficient conformational sampling of proteins. // Curr. Opin. Struct. Biol., 2008, V.18., P.134-139.
- Sansom M.S., Scott K.A., Bond P.J., Coarse-grained simulation: a high-throughput computational approach to membrane proteins. // Biochem. Soc. Trans., 2008, V.36, P.27-32.
Bosco: Yeah, I’m not really sure about the utility of coarse-grained molecular-dynamics. In papers on coarse-grained dynamics, it seems the generated conformations are quite crude, and the rates are out by orders of magnitude. I think coarse-grained dynamics were used to take a peek at long time-scales, back in the day when computing resources were very limited. It’s definitely worth a shot, but only to see if your candidate dimerization poses fall apart immediately or not. It can’t tell you whether you’re in the ball park. But it can definitely tell whether you’re in the car park.
Andrey: Fourth, then there’s your method: Rotamerically Induced Perturbations. It’s a very interesting method for assessment of protein motions as far as I don’t need the precise timescale for my tasks. But as I understand this methods assess only motions and can’t be used for structure refinement and for prediction of the structures with best stability in the dynamic environment.
- Ho B.K., Agard D.A., Probing the flexibility of large conformational changes in protein structures through local perturbations. // PLoS Comput. Biol., 2009, V. 5, e1000343.
Bosco: You’re right. RIP doesn’t generate accurate alternate conformations (although, whilst developing the method, I had crossed my fingers, and prayed to the great beings in the sky that it would). The distorted conformations are quite unrealistic. Still, I believe that RIP provides the cleanest way to identify surface loops/helices/hairpins that are mobile on the microsecond timescale. It is complete in the sense that the results are not dependent on the time of simulation, the protein either distorts to the perturbation, or not. Now one thing that I have found, is that RIP seems to be very good at finding dimer-interface loops when RIP is applied to the corresponding monomer structures. There is a body of theory that says that some dimerization occurs through the stabilization of flexible surface loops in the monomer. This stabilization could be the mechanism that explains the stability of the dimer. I am currently looking into this for a bunch of other proteins.
Andrey: As I understand in RIP, it is also possible to use explicit solvent, which can influence on the χ angles of side chains, not just the implicit solvent. But during the dynamic motions not only side chains can move and influence backbone, the solvent can induce also some significant changes in backbone structure, as backbone movement by itself, right?
Bosco: Yep. I’ve managed to get RIP working on explicit solvent, and the results are kind of discouraging. What ends up happening is that the water absorbs the perturbation from RIP’ed residue. And if the residue does hit a loop, the energy that is transferred to the loop immediately gets absorbed by the water surrounding the loop. There is practically no motion. In some sense, I was lucky that I developed RIP in an implicit solvent model. Because a surface-area approximation is used to model the water, there is no viscosity in the system. Thus a RIP perturbation in implicit solvent transfers all the added energy to the flexible loop, causing it to move.
Andrey: But at the same time with exact knowledge of binding sites of beta-amyloid and small ligands, is it possible to predict resulting protein structure movements upon ligand binding (if there are any)? Also is it possible to predict dynamics of the elements of full structure of GPCR receptors, especially if explicit solvent and lipid bilayer can be used?
Bosco: Alas, this is the frontier of the ligand-binding field. Researchers are reasonably confident that they can identify good poses for ligands, but only in the cases where the protein does not deform. There’s a whole bunch of systems, specifically devoted to lignad-binding with small protein deformations. But progress is very slow.
Andrey: Okay, let’s move on to something that’s not molecular-dynamics methods CONCOORD. CONCOORD is a monte-carlo method that randomly generates structures which fulfil the distance constraints. Then essential dynamics analysis gives the data on main protein motions. For CONCOORD simulations some reference to the experimental results is required to chose the best structures if we want to use CONCOORD for protein structures refinement. What criteria should I use for example in dimerization and monomerization studies? Can it be the ratio of monomeric/dimeric forms in generated structures. CONCOORD doesn’t count for the solvent molecules, which can stabilize some dynamic parts of the protein, right? To get stability information we should have some time-dependent distribution of conformations which is not accessible by CONCOORD (even if the timescale doesn’t correspond to a real one). From another point of view some averaged structures can be generated from CONCOORD which can give a picture of protein structures with the highest stability.
- de Groot B.L., van Aalten D.M., Scheek R.M., Amadei A., Vriend G., Berendsen H.J., Prediction of protein conformational freedom from distance constraints. // Proteins, 1997, V.29, P.240-251.
Bosco: I really like CONCOORD. It’s a very simple but elegant idea. In some ways, I see CONCOORD as the algorithmic inverse of Elastic Network Models (ENM). In both CONCOORD and ENM, you start with the contact map of a protein. The difference is, in CONCOORD, a contact is turned into a distance constraint, whilst in ENM (see below), the contact is turned into a spring. With CONCOORD, the structure is allowed to rattle around within these distance constraints. Clearly, tightly packed regions with lots of mutually reinforcing distance constraints won’t rattle very much, but loosely packed region will. Playing around with CONCOORD, I think CONCOORD captures the nanosecond regime of flexibility well, as CONCOORD generates as good a match with NMR S order parameters as any other system out there. It cannot really capture loop motions that RIP can. Indeed, you can see that by definition, neither CONCOORD nor ENM will be able to capture loop motions that move independently of the body of the protein. Such loop motions must go from a conformation where contacts are made with the body of the protein to one where the contacts are broken. In both CONCOORD and ENM, contacts are fundamental in their model. As such, I can’t see how CONCOORD could help in studying the stability of a dimer, as any contacts in the dimer-interface would be converted into a distance constraint.
Andrey: Yes, I am also interested in Elastic Network Models (ENM). It requires MD simulations anyway or X-ray structure with derived B-factors. Anyway the structure should be refined before the calculation of forces. This method seems to be really interesting and, as I understand it, can simulate large protein motions. But it seems to have some opposite features to RIP because ENM counts only Cα dynamics whereas RIP makes emphasis on side chains, right? There is also a ENM variation that deals with explicit solvent models (REACH method).
- Moritsugu K., Kurkal-Siebert V., Smith J.C., REACH coarse-grained normal mode analysis of protein dimer interaction dynamics. // Biophys J., 2009, V.97, P.1158-1167.
- Moritsugu K., Smith J.C., REACH coarse-grained biomolecular simulation: transferability between different protein structural classes. // Biophys J., 2008, V. 95, P.1639-1648.
- Moritsugu K., Smith J.C., Coarse-grained biomolecular simulation with REACH: realistic extension algorithm via covariance Hessian. // Biophys J., 2007, V.93, P.3460-3469.
- Su J.G., Li C.H., Hao R., Chen W.Z., Wang C.X., Protein unfolding behavior studied by elastic network model. // Biophys J., 2008, V.94, P.4586-4596.
Bosco: I think I’ve said it before, but ENM has the most bang-for-bucks in the world of computational structural biology. Use of ENM just requires a structure. From the structure, you derive a contact map, and from the network of contacts, you create a network of springs. So what does ENM model? There’s a whole bunch of literature that shows that the collective motion of the ENM network of springs gives a very good approximation of the collective motions of a domain structure. For accuracy, ENM performs as well as normal mode analysis in short-run molecular dynamics in, but in just a fraction of the time. Whilst there is some overlap of ENM with B-factors, and S order parameters, I am convinced that ENM explores very long timescale motions, such as domain rearrangements. I’ve seen some nice studies that show large differences in the collective modes between the monomer and its dimer. These have been used to understand dyanics of active, especially after the effects of dimerization.
Andrey: As to Monte Carlo methods Rosetta is a very interesting solution. But recently I’ve read a paper on Hamiltonian Replica Exchange Monte Carlo which seems to outperform significantly the Monte Carlo Simulated annealing of Rosetta and even Temperature Replica Exchange method. I think about usage of this method. But of course one of the interests is to model proteins with explicit water models influence.
- Shmygelska A., Levitt M., Generalized ensemble methods for de novo structure prediction. // Proc. Natl. Acad. Sci., 2009, V.106, P. 1415-1420.
Bosco: Rosetta is amazing, and it’s grown into a beast of a program. Rosetta is definitely best of breed in regards to protein structure prediction, from an ab-initio point of view. However, you have a structure already, so I don’t know how far Rosetta gets you. Since the Rosetta scoring function is totally heuristic, designed for predicting monomer structures, it’s not going to be very useful for stability studies in dimerization studies. At best, I imagine it might improve some sidechain orientations.
Andrey: This e-mail sounds like a methods review. We can make this discussion public somewhere as far as these questions can be interesting to other researchers.
Bosco: Why, of course. Done.
I would love to have hung out with some of you guys in San Francisco at the 2010 BPS at the Moscone center. But alas, my postdoc contract has ended (send job tips for a computational structural biologist/scientific programmer my way!) and I am now kicking back with family and old friends in Australia. The downtown of San Francisco is tricky to navigate if you want the good stuff (I am skipping the obvious stuff like 3D-Imax Avatar at the Metreon), so here’s a list of my favorites within striking distance of the Moscone center:
- Blue Bottle Cafe for some of the best coffee in the city, and my favorite interior. It’s also a very nice place to hang-out
- the Matador, the best hipster bar downtown
- right next door is Tu Lan, a formidably cheap greasy-spoon Vietnamese that’s been around probably since the 49’ers.
- Katana-ya, best ramen in the city. The soup is glorious, but alas, not good for vegetarians. This place gets packed so go outside normal eating hours.
- Osha for fast and delicious late night Thai food
- there are thousands of indian restaurants in the Tenderloin, but Little Delhi is the best. And very comfortable. I love their okra.
- another cafe, Epicenter, which is probably a little less busy, and better for some wifi laptop last-minute working.
I shake my head and suddenly realize a decade has almost passed since I first became a computational structural biologist. Feeling that I’ve earnt some kind of perspective, I hereby present to you a list of significant developments in computational structural biology of the naughties:
1. Rosetta.
There is no doubt in my mind that the greatest contribution to computational structural biology in this decade is the suite of programs known as Rosetta, written by David Baker and his lab.
Look at the accomplishments. Rosetta has consistently won the CASP competition for protein folding (see next item) in the new fold category. Nothing else comes close in deducing the structure of a protein from an arbitrary protein sequence. Rosetta was used to create a protein with the first artificial fold. Rosetta was used to design the first artificial enzymatic reaction in a protein.
Each of these accomplishments are staggering in themselves, but to find a computational core across all three, makes Rosetta the crowning jewel of computational structural biology. I’ve said it before and I’ll say it again, David Baker is on the fast-track to Stockholm.
2. Formally Organized Competitions
Every 2 years, a whole bunch of computational structural biology labs effectively shut down for business, and throw every man, woman and workstation together to attempt to crack a set of problems. This same set of problems is simultaneously being attempted in labs all around the world, as researchers race against a clock to predict the 3 dimensional atomic structure of protein sequences published at the CASP protein-folding competition website.
We often say that science is a competition but is is astonishing to me how the computational structural biology community has embraced formally organized competitions such as CASP. Here, we have pure naked competition, complete with a scoring system, judges, and rankings that determine winners and losers. It has all the drama that you’d expect from a reality TV show: recriminations, anger and tears. And it has taken the field of protein folding much farther than anyone would have imagined 10 years ago.
I still remember the late 90’s when the field of protein folding was an epistemological joke. Every odd month, some theorist would publish an article claiming that they had solved the protein folding “in principle” yet none of them actually predicted the structure of a sequence with no known structure. Of course, no one was actually stupid enough to make such a prediction in print. Others, more pessimistic, were even saying that it’s an unsolvable problem, like the theory of nuclear energy shells. Indeed, the results of the very first CASP showed just what a disaster that the protein folding field was, at the time. A lot of prominent theorists got egg on their face.
In hindsight, we can see that CASP worked along a very darwinian idea: that of defining a fitness landscape. In the field of genetic algorithms, you can use random mutations to find efficient algorithms, but only if you have a robust fitness function. The field of protein folding had been drifting along in some kind of crappy fitness valley and it wasn’t until CASP came along, that we could even define what protein folding was, in a concrete definitive way. In terms of protein folding, the targets of CASP could be used to define a good fitness function, which was enough to spur the field to scramble out of the valley and up a fitness peak. This approach has proven so successful that we have now embraced formal competitions for other problems of similar difficulty: CAPRI for protein-protein interaction, and SAMPL for ligand binding.
3. Distributed computation.
First there were Beowolf clusters on commodity linux boxes, and now we have cloud computing with Folding@Home and Amazon. Either way, it’s become standard practice for computational structural biologists to think beyond a single workstation, and fall in love with distributed computing.
As the hardware has gone parallel, so have our simulations. All the standard molecular-dynamic packages are locked in an arms race to squeeze the best performance out of parallel clusters. It is hoped one day that they will actually scale. As the capacity of our calculations has gone up, so has the quality of our simulations. It is virtually di rigeur these days to use explicit solvent in molecular-dynamics simulations.
Nevertheless, there is still a huge way to go before we can happily sample large protein systems to the point where we can merrily pluck out interesting but rare conformational fluctuations from an untainted molecular dynamics simulation. Until that time comes, be prepared to apply a whole cottage industry of thermodynamic techniques that trick the system into generating those precious rare conformational changes at a much faster simulation time.
These techniques break up into two groups. The first group of techniques, such as replica-exchange, are used to set up the simulations with weird temperature distributions so that large motions can be generated easily. The second group of techniques, such as thermodynamic averaging and master equations, tries to normalize these weird thermodynamic distributions back down to well-behaved free-energy equilibrium distributions, with which one can compare to experiments. Either way, it’s wise to brush up on your math before you wade into these extremely mathemagical waters.
4. Elastic Network Models
One running theme in computational structural biology is the search for simple models of proteins. These range from single-bead models to Gō models, to 2 dimensional lattice models. Ultimately, the litmus test of simple models is that they provide useful informational for researchers in other areas of computational structural biology.
Over the last decade, I’ve found that the only simple model that provides concrete information for researchers outside the field of simple models is the Elastic Network Model. It is the model where, given a protein structure, you replace all atomic contacts and covalent models with a simple harmonic spring. You will then obtain a network of springs. From this network, it is trivial to extract the dominant vibrational mode.
The motions deduced from these vibrational modes turn out to be exceeedingly useful, and these has been shown to reproduce the motions from much longer molecular-dynamics simulations. Elastic Network Models have been used to deduce virus capsid maturation and ribosome motions, systems that are way to large to study in any other way. The speed of the calculation and the fidelity of these results makes this an essential tool for computational structural biologists.
5. Fragment binding to proteins
Ligand binding is the great computational problem of the pharmaceutical industry. We already have enough solved structures to provide plentiful targets for drug activity, but without a way to calculate exactly how ligands bind, this doesn’t get us very far. Given the commercial nirvana lying on the other side of solving the ligand binding problem, there has been a ton of work in improving ligand-binding prediction systems. Still, the improvements in the field have proceeded at a glacial place, where the only novel ideas seems to be to get faster computers and to add more expensive dynamics to the simulations.
However, I’ve found one approach that seems to be genuinely different. This is the idea of using chemical fragments. Instead of studying how a ligand binds, you break up the ligand into its constituent chemical fragments, which are conformationally exceedingly simple. Then you study how the fragments bind to a protein. By studying the binding pattern of fragments, you can back-track to work out what ligand could have bound to it. Although this approach hasn’t yet lead to a radical overhaul of the ligand-binding problem, I am tipping this as a method to watch.
6. NMR measurements of residue dynamics
One of the reasons we do molecular-dynamics simulations is that these simulations provides the only way of seeing how proteins move on the atomic-detailed level.
Nevertheless it is a massive leap of faith to trust only in the simulations. Over the last decade, although NMR has fallen out of favor for solving protein structures, NMR methods have proven their worth in providing detailed information about the motions of proteins. These include the S order parameters for nanosecond motions, chemical shifts to detect secondary structure, and exchange parameters to measure microsecond transitions in chemical shifts.
Since careful NMR experiments can resolve signals to individual residues, these mesurements can give exceptionally precise dynamic measurements to proteins in solution. Recently, NMR Residual Dipolar Couplings were used to measure microsecond motions for every single residue in ubiquitin. We now have a very rich set of measurements that capture dynamic motions of proteins. These provide virtually the only checks on the wild world of long-time moleuclar-dynamics simulations.
7. The inability to go beyond partial charges.
In molecular-dynamics, we’ve been promised more sophisticated force-fields for a very long time now. Whilst there is no doubt that things such as dipoles and quadropoles can be programmed, it turns that there is a very steep speed tradeoff. It may be time to give up on the dream and accept that maybe, single partial charges may be sitting at some kind of sweet spot of simulation realism versus efficiency.
8. Conformational changes of single molecules
Experiment has raced far ahead of simulation in recent years. In particular, two techniques have been developed that directly measures conformational changes of individual protein molecules.
Flouresence Reseonance Energy Transfer (FRET) can measure the distance between chromophore pairs simply by the amount of fluoresence emitted by the chromophores. If you can attach the chromophore to two residues in a protein, you can measure their distance distributions in solution.
In Atomic Force Microscopy (AFM), sophisticated servo-motors can gently tease apart singles protein using exqusitely tiny forces. Incredibly detailed force-extension profiles of individual proteins can be measured.
These measurements have provided incredibly rich insights especially to systems where these large conformational changes can be linked to biologically-relevant processes.
9. Reconstruction of long dead ancestral proteins
For me, the most imaginative technique developed in the last decade is the black art of reviving ancestral proteins.
If we look at the spectrum of living things before us, we see only the furtherest leaves in the tree of life. However bioinformatics has given us techniques to extract the historical traces of evolution from our knowledge of protein sequences. By comparing homologous proteins across species, we can use sequence alignment comparisons to construct phylogeny trees between the species.
The principal insight of groups who study ancestral proteins is that, with our sophisticated sequence alignment programs, not only can we construct phylogeny trees, but we can construct the actual sequence of the last common ancestors at each fork of the tree.
These ancestral proteins can be cloned and crystallized. These groups have focused on proteins that bind to different ligands in different species. By carefully studying the changes in binding and in the structure over the phylogeny tree, we can figure out how nature has sculpted the structure of a protein to accomodate the speciation of different organisms over millions of years of evolution.
It’s sad to say this but most people I work with are just not that into science. It’s not that they’re not good at it – they clearly are – it’s just that they’re interested only in their little niche, and are thoroughly satisfied with all that it entails.
Case in point: recently, I was at work looking up a biology textbook on basic paleontology, spurred on by the intriguing hypothesis that the Cambrian explosion was induced by the evolution of eyes. My buddy S wanders by and notices that I am reading a first-year biology textbook. Being a curious guy, he asks me why. That’s when, whilst explaining to S the whole theory about eyes and the Cambrian explosion, we were interreputed by a third guy J, who burst out with “Ha ha, that’s funny. You never hear words like Cambrian used on the other side of the lab.”
This struck me as profoundly weird. First, the word “Cambrian” is not a particularly unusual word, especially in a biology environment. An average 6th grader interested in science would have heard of it. Yet, in a lab that does fundamental biology research, it struck the ears of J as strange. Second, J’s reaction was mystifyingly inert. Although the word Cambrian had piqued J’s curiosity, he wasn’t interested in finding out what we were talking about. Instead, he pointed out how strange it was to hear words of science spoken in a laboratory of science.
I think this is emblemic of a certain attitude in science. Many researchers become masters of their own research field but have minimal interest in areas outside their own. For such scientists, it is the qualification of knowing that counts, not the knowing itself. We do research so that we can publish prestigious research so that we can be recognized as prestigious scientists. Anything other than this is a sign of amateurism. Perhaps it’s a way for socially awkward people who played too much dungeons-and-dragons in high-school to claw back some kind of respect in a hostile world. There are many people in science like this, and they are not infrequently assholes (although J is a pretty nice guy).
This attitude induces a type of scientific myopia. To whit, a couple of weeks later, I found J sitting in the lunch room staring blankly at a draft of his article. He was stuck in writing a synopsis that would explain to lay readers why his research was important. J thought this was a pointless exercise dreamed up by PR lackeys at the journal he was going to send it. But it’s not. Being able to explain why our research is important reflects an important synthetic ability to see how different bits of science fit together. This requires a vision beyond our own little niche. It seemed to me that J knew everything about his research topic except for why it’s important.
In contrast, there are those of us who are in science because we are irresistibly curious about how the world works. This curiousity is a habit of mind and it’s not something you can switch off. I had a really busy year this year (worked on 4 articles, wrote a job application, released a piece of software) yet I read a programming textbook (Programming Language Pragmatics), a neuroscience pop-sci book (The Brain that Changes Itself), a particle physics pop-sci (In search of the Ultimate Building Blocks) and a stat-mech textbook (Noise and Fluctuations). I feel stuck if I am not finding about some area of science that is outside my comfort zone. I think you can tell which scientists have this kind of curiosity simply by whether they are reading popular science books. These scientists will find time to read whether they have time or not. They have no choice even though they’re running 9 hour experiments on their supercali-fragilistic-fractionator.
The difference to the other type scientist is stark. One of the grad students in my lab told me that once, when she was talking about this great pop-sci book she was reading, another postdoc D interrupted her with a withering look, “so you read populular science books do you?” As if it we were a bad thing. Yet, I have a feeling that non-curious scientists are in the majority. Doing a totally unscientific straw poll, out of about 17 postdocs and grad students in my lab, I can think of about 7 who I’ve talked to at some point about some kind of awesome science outside our lab’s principal interests.
Recently, I’ve read all sorts of reports that the scientific body is in general, not interested in social media, even amongst younger scientists. All sorts of explanations have been proffered as to why there is such a low adoption rate but the one explanation that hasn’t been offered is that perhaps, there are not that many engaged scientists in the scientific community. Perhaps, most people working in science are just not that into science.
William Faulkner’s “The Sound and the Fury” is one of most reader-unfriendly book I’ve ever read. Written in 1929, it’s clear that Faulkner was enamoured with the new innovations in prose flowing out of Europe such as that diarrhea of words known as stream of consciousness.
But there’s good stream of consciousnessness and there’s bad stream of consciousness. Good stream of consciousness conjures up a heightened state of mind. By writing from the inner psyche of the protaganist, we get a better sense of the character through free associations and displacements of languge. Still, stream of consciousness should flow, ratcheting tension through a logic of comprehensible symbolism.
Bad stream of consciousness grates on the nerves like an open-mike session in a second-rate college town. And the opening section of “The Sound and the Fury” is as bad as anything I’ve read. Written from the point of view of a retard, the thoughts are jumbled up with such a profusion of indistinct pronouns that it’s hard to figure out who is talking about what. Characters are introduced willy-nilly. They appear and disappear like a bad itch, and since we’re travelling backwards and forwards in time at will, most characters don’t even have the privilege of having the simple soliditiy of existence over a paragraph.
It’s the kind of chapter where book critics suggests that you read it several times with colored markers so you mark out what time period each paragraph belongs to. Meaning that this is an incoherently written piece of writing, where it’s up to you to figure out how to make it coherent. It’s like stumbling on the scribbled notes of an unfinished novel. Given the effort needed to figure out what is happening, the final result is stunningly not worth the effort. The characters are not that interesting. The language not particularly rivetting, and the dialogue is barely serviceable.
The chapters do get more comprehensible as we go along. And I was almost ready to forgive Faulkner when I got to the final chapter. Here, Faulkner introduces a new character called Quentin, who is in fact the niece of a man called Quentin from the second chapter. Given that it was written in that spasmodic style where pronouns are thrown around like confetti, and characters are described with as much care as pigs eat slop, this took me a good hour to figure out that Quentin was not actually Quentin. There’s no real good reason for this, and it’s just a sign of the sloppiness in the writing. And of course, like any piece of literary schlock, the denouement with Quentin involved incest.
Ever since I brought home my large 22” monitor, I’ve been trying to read more PDF files on the computer instead of printing them out on paper first. But there is something lacking in reading from a computer screen. The thing I miss most is that I lose a sense of place in a document. With the document printed on pages of paper, I get a physical sense of where I am simply from the feel of the thickness of the pages and from the physical act of turning a page. I can scan quickly through the document and find certain sections much faster than reading on the computer.
In comparison, a typical PDF viewer only gives you a very poor sense of place in a document via a scrollbar. This does not provide you with any way to scan ahead:

However, many PDF viewers also provide a thumbnail drawer, which allows you to scan ahead several pages:
![]()
However, this particular arrangement ends up with a rather convoluted user interface where you get 3 inconsistent markers for your position in the document: 1) the original scrollbar with full-view document, 2) the highlighted page in the thumbnail drawer and 3) the scrollbar for the thumbnails. Also, the thumbnail view only shows one highlighted page whereas the full view often falls between two pages.
Dragging the markers on the scrollbars provides a rather confusing and fragmentary experience. I have no idea where the thumbnail scrollbar is, or where the highlighted thumbnail will pop up, after dragging the main page scrollbar. This makes it impossible to use these scrollbars to give a visual sense of your place in the document
Then I realized that you can solve this problem by combining the scrollbars with the thumbnails together to give you something like this mockup:

The idea is that we can use the thumbnail view itself as the scrollbar, which I call the thumbnail bar. With this arrangement, the part of the thumbnail displayed in the full page view becomes the marker of the scrollbar. This provides a clean and simple integration between the full page view and the thumbnails.
Nevertheless, you don’t want the thumbnails to be too small, so you won’t be able to display a lot of thumbnails. To compensate for this, I add a third bar, which I call the document bar, which provides an ultra-compressed view of all the thumbnails in the document. By using appropriate coloring or shading, we can color the part of the document shown in the thumbnail bar and the full view. This combined-thumbnail view can become the marker for the document bar.
Finally, moving around the document is simply a matter of dragging the smaller versions of the page view around in the thumbnail bar, or dragging the combined-thumbnail view in the document bar.
In both the document bar and the thumbnail bar, we keep the contiguous vertical arrangement of the individual pages to convey the spatial sense of page in a nice linear fashion. And if we can animate all three together when dragging, then we can viscerally relate the document at all three levels of detail – page, thumbnail and document.
I recently went to the San Francisco Public Library Book Fair. In an aircraft hangar of a warehouse, billions and billions of used books were arranged over miles and miles of old folding tables. Readers from all over the bay area (and some beyond I might imagine) pored over the dessicated remains of someone else’s poor reading habits. Some even pushed trolley carts to carry all of their booty.
Over the years I’ve gotten picky with my reading and I prefer to buy wanted books instead of reading what lays at hand. But still after an afternoon of triage at the book fair, I found some out-of-print gems.
The first was a ratty old Dover book complete with a sickly green cover titled “Aquatic Mammals: Their Adaptations to Life in the Water” by a former president of the Society of Mammals, A. Brazier Howell. It’s a pleasant surprise to read a book by a scientist who has clearly studied his material over a lifetime of idosyncratic cogitation. So many aspects are considered in this book, from swimming mechanics, to breathing changes. The writing is clean and soberly unlike today’s flimsy attention-grabbing scientific chaff that cry out with breathless titles such as “Secret Life of This” and “Secret Life of That”. This is a straight up biology book, albeit softened a bit for the lay reader.
But I share in the author’s delight over aquatic mammals. My favorite part of zoos is still the sea mammal sections. If you have ever watched a walrus swim right up to the pyrex glass and reverse with a slap of the tail onto the glass, you will know the heft and majesty of aquatic mammal life. I must admit, one of the reasons I am reading this is in light of the Aquatic Ape theory of the origin of humans from the monkey line.
The second is a collection of short stories by one of China’s foremost women writers, Zhang Jie in a collection titled “Love Must Not Be Forgotten”. These are strange stories, telling of thwarted loves and ruined lives. It charts how the irrepressible tides of love course even under the bureacratic juggernaut of communist China. These are frigid stories illuminated with little shards of emotion, and for me, an alienable yet recognizable landscape.
I got a taste for French music living in Belgium almost ten years ago. Before living in francophile Brussels, I had formed, in my adolescent mind, a world populated by languorous women with a detached air of cultured sensuality, whispering an unending stream of soft slurred vowels running over each other like waves on a soft beach. The sound of French does not posses the cut and thrust of English, where the profane and the sexual are cleaved into the puritan and the lustful. In French, every act is a sensual act.
But they never tell you how difficult it is to explore the musical landscape of another language. Indeed, I had forgotten my own experience as a unmanned adolescent trying to get to grips with the mysteries of the Top 40 and other such things. Figuring out the landscape of the music of someone else’s culture is just as hard. Although I could understand the finer distinctions between grunge, metal and indie rock, this did not help me negotiate the differences between nouvelle chanson, les rocks, eurovision, and star academy.
The classics are easy to find, colossal singers etched in good Parisian sandstone such as Edith Piaf and Jacques Brels. But I was more interested in French artists who are still alive. Singers who I may have a chance to see in person, to watch their bodies sway through the cigarette smoke as their voice purrs through cracked-out speakers in a dank hall in park slope. Of course, you say, I could just ask someone. But it’s not that easy. It’s hard enough finding people who like the same stuff as you in your own language. But in another language you have to get over the double barrier of meeting someone who can translate and who also shares your taste.
Over the years, I’ve managed to scrape together a small but durable list of great french singers, but I am always on the lookout for more. My prayers were finally answered when I recently stumbled onto a treasure trove of french music, a music blog called Filles Sourires. Their tagline “Fragile girls. Gainsbourgian guys. Singing in French. Making me sigh. Any questions?” That’s certainly an aesthetic I could buy into.
Most useful was that “Filles Sourrires” recently asked their regular contributions for their top 10 lists of French albums released in the last ten years. This list of 100 albums (1 2 3 4 5 6 7 8 9 10) provides a comprehensive survey of the terrain of modern french music. I am gratified to see that many of my favorites are here, but there are also many that I have not heard off. For these of you who are Franco-curious, I suggest that you check these out. Do not be surprised if you have a sudden urge to buy a beret and a book a one-way ticket to Paris.
The real bioinformatics journal has not been created yet. Wait, let me explain. What we currently have are traditional journals that deal with articles on bioinformatics. The journal themselves, though they may provide digital PDF files, are still stuck in the paper paradigm. The journal itself is not bioinformatic.
What crack am I smoking? In a recent article on the future of science publishing, Michaell Nielsen speculates that the crunch in scientific publishing will not simply lead to an age of open-access science publishing, but in all likelihood, will lead to new forms of publishing centered around services.
Well there’s one service that I wish bioinformatics journals provide. How about actually providing what you promise? Here’s the situation. You find an 6 year old article in “BMC Bioinformatics” about some clever little algorithm. Tucked away at the end of the abstract is a link http://some.uni.edu/old_group/people/~postdocx/project69/index.html. You click on the link and several things may happen: 1) if you’re lucky, yet get a butt-ugly website with a working program or 2) you get a program but it doesn’t work with your configuration, or in all likelihood, 3) the website is not found anymore. Without the program, the article is next to useless.
The thing is, the majority of bioinformatics papers are either description of programs, or description of data-sets. Unlike traditional experimental or theory papers, code and data are things that grow with time. Programs can be continually improved – they gather bugfixes, and may even be rewritten with better algorithms. Datasets are often regenerated with the latest updates of standard databases, where many genomic analyses are often expanded to include more organisms.
You might think that dealing with the changes is going to require a lot of work. But the maintainence of a changing program is actually a solved problem. Most people who are serious about software use Github, or Google Code or any number of software repositories that gracefully handle mutating software. These websites provide excellent integration services for downloading the program, registering and tracking bugs, discussion pages, with a solid admin interface, and allows you to look at the history of the program. They make it easy to do all this stuff. Moreover, these repositories provide a permanent location for the program with a clean url.
Nevertheless, the way that academia works is that if you have written a neat little program that solves a bioinformatic problem, you must get it peer-reviewed and published in an academic journal in order to be recognized for official academic business. Thus your program will live on two different websites, where the journal website links to your program website. Of course, one could include the download as a supplementary file on the journal website, but then it’s only a single file download without any kind of proper software infrastructure.
So here’s my idea, why don’t we setup up a specialized bioinformatic journal that is tightly integrated to a software repositry? Let’s call it Biohub.org or something like that. Users are first encouraged to set up projects in biohub.org like they already do in Google Code or Github, as a software repository. There is nothing particularly cutting edge about that, as there is plenty of existing software to facilitate the construction of a site like that.
Then when a project reaches maturation, an article is written and sent to the editors who run the website. If the article passes peer-review then the project will be registered on the front page of Biohub.org as a peer-reviewed project, and the link will be directly link to the project page on the very same website. A new tab will pop up with the .html of the article, formatted in a way that is consistent with the rest of the software. The article and the software project will be one and the same.
The editors will be responsible for making a printable PDF that goes with the article, and these can be linked to a journal page of the website, which can be made to look like any other academic journal website. You can even slap an ISBN number and register these articles with the relevant scientific literature databases. More importantly, the journal runs the software repository so that the software will be there as long as the journal exists.
This actually makes sense for datasets as well. Datasets change overtime just like programs, and if stored in a software repositry, it provides an easy way to look at older versions of the database, as well as a provide a place for others to submit useful scripts, and start discussions (especially since most repositories provide wiki and forum services for every project). Furthermore, if the repositry as run as an open-source code repositry, then it makes it quite easy for collaborators to be added to projects.
If we get into the habit of setting up bioinformatic software and data in these centralized hubs, we can stitch a truly bioinformatic journal. No more will the work of some postdoc die on some long forgotten server unplugged in the back of someone’s lab. Our collective scientific output will live in an organic fusion of prose, code and data.
Update: great discussion on friendfeed
