I want to write about possibly the ugliest hack that I have written in bio-informatics. It's ugly because the problem it solves is ugly. But because the problem is so ugly, it's possibly the most beautiful hack that I've ever written.
Here's the problem. One of the standard tools in the arsenal of a computational biologist is the molecular dynamics (MD) simulation. MD packages simulate a molecule using Newtownian mechanics and atom force-field to generate 3-dimensional trajectories of biological molecules. There are a number of standard MD packages out there that everybody uses (AMBER, CHARMM, NAMD and GROMACS). Each package generates the trajectories in its own special format. Surprise, surprise, they don't play well with each other. My problem is: how do you translate from one data format to another (or in my case, from a CHARMM trajectory to an AMBER trajectory)?
It's not uncommon for your average researcher to switch between different MD packages. But switching MD packages is about as painful as stabbing your own eyes out. There is a myriad variety of details that you need to get right just to run an MD package, let alone do any kind of meaningful analysis (ever tried to get the physical units of an AMBER velocity file?). The data-formats between the packages are startlingly different to each other. Really different. These differences will never go away since they are the result of horrendous legacy formats (FORTRAN variable names anyone, anyone)?
The standard molecular dynamics trajectory comes in two files: a topology file and a coordinate file. The coordinate file is the easy one. It is just a list of numbers – for each atom, you have 3 numbers for the position and 3 for the velocity, and these are repeated ad nauseum for each frame in the trajectory. In AMBER, this is stored as a text file. In CHARMM, this is stored in binary.
The coordinate file is utterly useless unless you have the topology file, with which you interpret the coordinate file. In the topology file, you can find all the attributes of the atoms in your molecule for your MD package to work. These include the names, charges, masses of all the atom, but most important of all, it includes the order of atoms listed in the coordinate file. Each package generates its own topology file, and they are deeply incompatible with those of other packages.
I am sure there are people who have contemplated directly translating from one topology file to another but I can assure you that therein lies the way to hell, paved in a thousand bond length parameters.
The problem is that each package makes different assumptions about the chemistry of the molecules. For instance, the names of atoms in a molecule are named differently. In some packages, ambiguous sidechains, such as Histidine are given completely different names depending on its protonation states. Different packages treat charged terminals differently, and some can cope with methyl caps. And each program orders the atoms in completely different ways.
Directly translating from one format to another will require a careful dissecting of the logic in generating the topology files, which is embedded deep in the bowels of the source code of the program. I don't think there is anyone really stupid enough to do that. Oh, the humanity.
So here was my problem. I had written a bunch of analysis scripts with PYMOL, which analyzed AMBER trajectories. PYMOL however, chokes on CHARMM trajectories. I had some CHARMM trajectories from a previous postdoc in the lab that I wanted to analyze. He wasn't around, so I didn't know how to regenerate them with AMBER. I had no option but to attempt to translate the old CHARMM trajectories into AMBER trajectories. And therein, I learnt of the tranlsation problem.
There is, of course, light at the end of the tunnel, and I figured out a hack to translate from CHARMM to AMBER without needing to dig into the source code.
But first, some prequisites. I had already written a PYTHON library that can
- Read both CHARMM and AMBER trajectories.
- Save a snapshot of a trajectory to a PDB file.
- Convert a PDB file to a CHARMM or AMBER topology file with an initial coordinate file.
- String initial coordinate files together into a complete AMBER trajectory.
There's not a great deal of magic in these functions. In many cases, I use PYTHON to generate config files on-the-fly and run the command-line utilities provided by CHARMM and AMBER to do the translations. You should be able to do this without any problems.
So here's the hack. First, let's assume we have the CHARMM trajectory files:
:::bash charmm.psf # the topology file charmm.dcd # the coordinate file
And we want to generate the amber trajectory files:
:::bash amber.top # the topology file amber.trj # the coordinate file
We first open the namd trajectory to read:
:::python namd_traj = NamdTrajectory('namd.psf', 'namd.dcd')
Then we open a new file handle for the output to the amber trajectory:
:::python amber_traj_file = open('amber.trj', 'w')
And here's the main loop:
:::python for i in n_frame: namd_traj.load_frame(i) namd_traj.write_pdb('frame.pdb') convert_pdb_to_amber('frame.pdb', 'amber.top', 'amber.crd') crd = read_amber_crd('amber.crd') amber_traj_file.write(crd)
The key step here is that for each frame, we take the intermediate PDB file, and generate a completely new AMBER topology file and starting coordinate file. My PYTHON library specifically calls an AMBER command-line utility TLEAP to do this.
The ugliest of this hack is in the fact that we have to generate a topology file for every frame. But therein lies the beauty: we've delegated the nasty work of translating to TLEAP, by sending the intermediate PDB file to TLEAP.
We can safely assume that TLEAP generates the same topology file for each frame. The "starting coordinates" for the different frames are actually evolving, and stringing them together results in a brand new spanking AMBER trajectory.
Before I finish, I just have to talk about one subtlety. Converting an arbitrary PDB file to an AMBER topology file, requires some quick-and-dirty field surgery. Typically, I remove all waters and prosthetic groups and make sure the terminii are charged. Otherwise, TLEAP will choke.
So there you have it. My rosetta stone for AMBER and CHARMM.