Parsing PDB files: sometimes you really should reinvent the wheel

01 Oct 2007 // protein

The other day, I was yabbing to Albion, another postdoc in our lab, about the PDB format, which stores the atomic coordinates of protein and DNA molecules. It's a crusty old format, hobbled mainly by the fact that it uses a fixed field text format. We both bitterly complained about it like two chinese women buying vegetables at the local market, but what struck me about our tête-à-tête was that we both felt it was necessary to write our own PDB parser.

Why did I think that surprising? There are over 30,000 PDB protein structures deposited in the public database of PDB structures. The PDB format is one of the most heavily used data format in protein research. Surely, there must be good parsers of PDB structures written for it in every major computer language (and there is). Isn't one of the rules of writing software is to not reinvent the wheel?

Here's the rub. Unless you're doing something that was envisaged by the writers of the generic parser, it will break. The uses of the PDB format vary enormously. For instance, Albion uses the PDB to store huge models of mictrotubles. A mictrotuble is made up of dozens if not hundreds of individual sperate protein chains. The PDB format allows only 26 chains – it reserves a single character field to denote a chain. The PDB format only allows 99999 atoms in a single format because it reserves 5 characters to store the atom number. Albion must use a non-standard extension to store a microtuble made up of hundreds of chains of hundreds of thousands of atoms. On the other hand, I only look at small proteins, typically one or two chains, with a ligand or two. However, I do want to be able to read any PDB fromat from the database, and ignore anything that I don't care about – waters, prosthetic groups, alternate conformations.

Of course, both Albion and I could have downloaded a generic parser. But no generic parser can anticipate all the uses for it. By design, they must be strict (BioPerl, BioPython) in order to cover even a semi-reasonable set of possible uses. This makes the code brittle. It will break if your PDB format is not exactly as it expects.

But if either of us had decided to use a generic parser, we would have spent as much time writing code dealing with the parser as we actually spent writing our own parser. And instead of knowing exactly what our own parsers at edge cases, we would have spent hours pouring over the source code of the generic parser.

In Albion's case, he would have to handcode the extensions into the format after the generic parser had read in the PDB file. When he saved his files to disk, he would have to do some nasty name-mangling so that the generic parser could read it back later. Aaaargh.

As for me, since I only need a tiny set of PDB features, I would have to load the heavy-duty library for the generic parser, then translate their data-structure into my data structures, going through 2 or 3 onerous and unnecessary steps. Instead, I now have a very small routine that reads the PDB directly into my data structures, skipping pretty much everything that does not interest me.

It pleases me that someone out there has spent the time to write a PDB parser. It's great for the day-tripper but if you work with the PDB format every day, it's time to roll your own.