19 November 2007

match.py - calculating the RMSD of PDB structures

Have you ever wanted a simple program to calculate the RMSD between two protein structures? Well, if you're clever, you might figure out how to do it in PyMOL or VMD, but sometimes you just want to do it from the command line.

Well wait no more, here's my RMSD program, written in Python 2.4, and since it's interpreted, you can look at the source-code to see how it's done. The heart of the RMSD calculation is just 9 lines of Python code. The code is short because it uses the wonderful linear algebra from the widely used numpy module (dmg for macosx10.4 & python2.4). The program is written with Python 2.4 and NumPy 1.0.

download: match.zip [16 kb]

The command inputs includes a string that defines the segments with which you want to align your pdb files. The format is essentially a python list of 2-tuples.

You can use match.py to calculate the RMSD, generate a structure with the optimal rotation, or calculate just the raw RMSD without rotation. The program match.py can also be imported into Python as a module if you need to chain your RMSD calculations.

Interpreting the segment string parameters

For starters, you can consult the rcsb.org website to get a complete description of a PDB file. But if you open a PDB file with a text editor, you will see lines like this:

ATOM     37  O   TYR A   3      -5.392  -1.551   4.271  1.00  0.20           O

'TYR' refers to the residue type in a polymer (protein, DNA) or the name of a small molecule.

The 'A' after TYR refers to the chain identifier if it is a polymer.

In a small molecule, usually, no chain is given, and a small molecule is refered by the residue number '37'

indeed, '37' is the residue number.

So, let's say you have two structures of the same protein with one chain, called chain 'A'. They are in test1.pdb and test2.pdb

The chain has residues 1-10

Then the command line would be

python match.py test1.pdb test2.pdb "[('A:1', 'A:10')]"

If on the other hand your pdb files has two chains, and chain "B" has 20 residues numbered from 30 to 50 then you

python match.py test1.pdb test2.pdb "[('A:1', 'A:10'), ('B:30', 'B:50')]"

If for some reason, the test2.pdb uses chain "E" instead of "A", and "F" instead of "B", then you can do:

python match.py test1.pdb test2.pdb "[('A:1', 'A:10'), ('B:30', 'B:50')]" "[('E:1', 'E:10'), ('F:30', 'F:50')]"