|
The other day, I was yabbing to Albion, another postdoc in our lab, about the PDB format, which stores the atomic coordinates of protein and DNA molecules. It’s a crusty old format, hobbled mainly by the fact that it uses a fixed field text format. We both bitterly complained about it like two chinese women buying vegetables at the local market, but what struck me about our tête-à-tête was that we both felt it was necessary to write our own PDB parser. Why did I think that surprising? There are over 30,000 PDB protein structures deposited in the public database of PDB structures. The PDB format is one of the most heavily used data format in protein research. Surely, there must be good parsers of PDB structures written for it in every major computer language (and there is). Isn’t one of the rules of writing software is to not reinvent the wheel? Here’s the rub. Unless you’re doing something that was envisaged by the writers of the generic parser, it will break. The uses of the PDB format vary enormously. For instance, Albion uses the PDB to store huge models of mictrotubles. A mictrotuble is made up of dozens if not hundreds of individual sperate protein chains. The PDB format allows only 26 chains – it reserves a single character field to denote a chain. The PDB format only allows 99999 atoms in a single format because it reserves 5 characters to store the atom number. Albion must use a non-standard extension to store a microtuble made up of hundreds of chains of hundreds of thousands of atoms. On the other hand, I only look at small proteins, typically one or two chains, with a ligand or two. However, I do want to be able to read any PDB fromat from the database, and ignore anything that I don’t care about – waters, prosthetic groups, alternate conformations. Of course, both Albion and I could have downloaded a generic parser. But no generic parser can anticipate all the uses for it. By design, they must be strict (BioPerl, BioPython) in order to cover even a semi-reasonable set of possible uses. This makes the code brittle. It will break if your PDB format is not exactly as it expects. But if either of us had decided to use a generic parser, we would have spent as much time writing code dealing with the parser as we actually spent writing our own parser. And instead of knowing exactly what our own parsers at edge cases, we would have spent hours pouring over the source code of the generic parser. In Albion’s case, he would have to handcode the extensions into the format after the generic parser had read in the PDB file. When he saved his files to disk, he would have to do some nasty name-mangling so that the generic parser could read it back later. Aaaargh. As for me, since I only need a tiny set of PDB features, I would have to load the heavy-duty library for the generic parser, then translate their data-structure into my data structures, going through 2 or 3 onerous and unnecessary steps. Instead, I now have a very small routine that reads the PDB directly into my data structures, skipping pretty much everything that does not interest me. It pleases me that someone out there has spent the time to write a PDB parser. It’s great for the day-tripper but if you work with the PDB format every day, it’s time to roll your own.
Writing parsers is good for your inner computer scientist’s soul. I’ve written them for email, web scraping, functional languages, medical datasets and a wiki. There’s something satisfying about figuring out exactly what it is you need to find in a text file and all the possible edge cases that might need dealing with. What tools did you use to write the parser? Was it a roll-your-own (i.e., a sequence of python functions) or did you write a grammar and have it compiled down to a state machine?
I once tried to go the generic route. Over time, the PDB folks have released several different versions of the PDB format. I used Andrew Dalke’s Martel package to write some code that read the format definition files and wrote out a parser. Then I added in lots of special cases to fix all of the common places where people ignore PDB standards. In the end, I had something that fully parsed just about every PDB file I could find. The trouble was, after all of the overhead (parsing a format definition file to write out a parser that parses files that don’t follow the format sucks), it was so slow as to be useless. These days, I basically just read ATOM and HETATM records and try to be smart when the columns overlap.
@Mark, nothing particularly fancy — just reading in a text file, and loading in values one line at a time. Like I said, the parsing method is gratifyingly short. and fast. @Michael, I basically read ATOM and HETATM, with some parsing for chains and residues. The other information is sometimes useful. I’ve written a script that just extracts useful information, like the full name of the protein, its resolution and the organism.
Your friend Albion has my sympathies. I think most people who work with PDB files end up writing thier own parsers – but it’s easier now, what with the remediation of the PDB. At least the non-standard amino acids and ligands have been standadised and they’ve added a chain id for every chain. Not very many edge cases left to worry about :) for which I am profoundly thankful. |
||