What is it to be alike? When you’re talking about protein structures, you need to align the structures as best you can before you can talk about similarity. This is not a trivial matter. The standard approach is to use the Kabsch least-squares algorithm to find the optimal superposition between two sets of coordinates.
But if you’ve had any experience with real structures, or molecular dynamics trajectories, this is not a straightforward process. This is because some regions of the protein may vary much more than other regions. To get a better fit, you may need to apply “human intuition” to manually select a region that appear to be more stable, and then align the structures only to these regions, letting the rest flop away. Manually selecting the stable region is, to say the least, somewhat arbitrary.
Except you don’t need to fudge it anymore! Douglas Theobald and Deborah Wuttke have provided a new way of aligning structures THESEUS that automagically detects stable regions from variable regions, and aligns to the more stable regions, instead of the everything-in-the-blender of the Kabsch least-squares approach. It’s a beautiful algorithm, and here’s their example of the spectacularly better alignment of NMR models of the Kunitz domain 2sdf where LS is the Kabsch least-squares approach and ML refers to the Maximum Likehood approach of Theseus:

The killer application is in aligning crystal structures of different proteins in the same domain family. Typically, you first select a set of “conserved” residues and then do a Kabsch least-squares to just the “conserved” residues. This is normally an iterative process where you manually refine just which residues are conserved. Instead, Theseus hooks nicely into multiple-sequence alignment programs, and using the sequence alignment, it can produce a totally automated structual alignment of the structures that is every bit as good, if not better than a manual approach.
As an example, I’ve been working on the PDZ domains, an example 1×8s:

And using Theseus hooked to Clustalw (plz don’t hit me Paulo) for a bunch of PDZ domains (1be9, 1bfe, 1d5g, 1m5z, 1nf3, 1ry4, 1rzx, 1vj6, 1×8s, 2qkt-a, 2qku-a, 2qku-b, 2qku-c, 2qkv-a, 3pdz), generates this alignment (yellow is non-aligned residues, and green are non-protein ligands):

I never once had to apply “human intuition”. Aligning structures has never been this easy!
I’ve played around with this a little. The big thing that this program is missing is the ability to read trajectory files like dcd or xtc, and write out the corresponding aligned trajectories.
The other neat thing that would be useful is a way of taking the alignment that THESEUS kicks out and developing a version of the standard alignment algorithm that would allow you to do these alignments on-the-fly by optimizing a weighting of the atoms. This would allow you to use the ‘corrected’ alignment as an order parameter in MD simulations.
@John, neat idea! Give me a spare weekend and I can whip up a python script that should do this.
@Bosco – which point were you referring to? For the first point, I think it would be fairly simple now that gromacs has released a stand-alone portable library to read and write xtc/trr formatted files. See the front page of the Gromacs webpage or:
ftp://ftp.gromacs.org/pub/contrib/xdrfile-1.0.tar.gz
As far as the second point goes, you would have to have run enough MD (or have several xtal structures) that you have some confidence in the flexibility of protein. Maybe you could use b-factors as a guide or the results of an elastic network model, but I’d feel less confident with these.
A quick search found that someone has done something along these lines (weighting the rmsd alignment) already:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1471868
Hey @Bosco, how about using this utility :
http://pdbjs6.pdbj.org/MAFFTash.3/
If you want to use BlastP, just click on “Need help picking PDB IDs? Use Prep-MAFFTash.” on the right of the panel.
Thanks.
