RMSD: Root Mean Square Deviation

If it's your day job to push proteins in silico then you will one-day have to calculate the RMSD of a protein. You've just simulated protein GinormousA for a whole micro-second but you don't even know if GinormousA is stable. You could load up a protein viewer and eyeball the stability of the protein throughout the simulation. But eye-balling is just not going to cut it with your boss. You need to slap a number on it. This is where RMSD comes in.

A protein conformation is basically a set of 3-dimensional coordinates. This means that a conformation is a set of n vectors, xn, where each xn has three components. As the protein evolves along the trajectory, the conformation changes, resulting in a new protein conformation and a new set of vectors yn. We capture the difference between conformations through the two sets of vectors xn and yn. To make life easier, we shift the centre-of-mass of yn and xn to the origin of the coordinates (we can always shift them back afterwards). Then the difference can be captured by the square distance:

E = Σn |xn - yn|2

The root mean-square distance (RMSD) is then

RMSD = √ E / N

But hola, you say, the mean-square measure doesn't measure the similarity very well. Just a little rotation of the set of yn, which doesn't change the internal arrangement of yn, would distort the RMSD. What we really need, is to find the best rotation of yn with respect to xn before taking the RMSD. To rotate the vectors yn, we apply a rotation matrix U to get y'n = U yn. We can then express E as a function of the rotation U:

E = Σn |xn - U yn|2

To find the RMSD, we must first, find the rotation Umin that minimizes E.

Matching sets of vectors

The classic papers of Wolfgang Kabsch showed how to minimize E to get the RMSD. The algorithm described in those papers have since become part of the staple diet of protein analysis. As the original proof for the Kasch equations using 6 different Lagrange multipliers is rather tedious, we'll describe a much simpler proof using standard linear algebra.

First we expand E (which is a function of U)

E = Σn ( |xn|2 + |yn|2 ) - 2 Σn xn U yn

E = E0 - 2 Σn xn U yn

The first part E0 is invariant with respect to a rotation U. The variation resides in the last term L = Σn xn U yn. The goal to minimize the RMSD is to find the matrix Umin that gives the largest possible value of Lmax from which we get:

RMSD = √ Emin / N = √ ( (E0 - 2 Lmax) / N )

The trick is to realize that we can promote the set of vectors xn and yn into the {N x 3} matrices X and Y. Then we can write L as a trace

L = Tr( Xt U Y )

where Xt is a {3 x N} matrix and Y is a {N x 3} matrix. Juggling the matrices inside the trace, we get:

L = Tr( Xt U Y )    trace of a {N x N} matrix
   = Tr( U Y Xt )     trace of a {3 x 3} matrix
   = Tr( U R )

We don't know what U is, but we can study the other matrix R = Y Xt, which is the {3 x 3} correlation matrix between X and Y. By invoking the powerful Singular Value Decomposition theorem, we know that we can always write R as:

R = V S Wt

and obtain the {3 x 3} matrices V and W (which are orthogonal) and S (which is positive diagonal, with elements σi). We substitute these matrices into the troublesome L term of the square-distance equation:

L = Tr( U V S Wt ) = Tr( S Wt U V )

As S is diagonal, if we group the matrices Wt U V together into the matrix T, we can get a simple expression for the trace in terms of the σi:

L = Tr( S T ) = σ1 T11 + σ2 T22 + σ3 T33

Writing L in this form, we can figure out the maximum value of L. The secret sauce is in T. If we look at the definition of T, we can see that T is a product of only orthogonal matrices, namely the matrices U, Wt and V. T must also be orthogonal.

Thus, the largest possible value for any element of T is 1, or Tij ≤ 1. Now L is a product of terms linear in the diagonal elements of T, thus the maximum value of L occurs when the Tii = 1,

Lmax = Tr( S Tmax ) = σ1 + σ2 + σ3 = Tr ( S )

Plugging this into the equation for Lmax into the equation for RMSD, we get

RMSD = √ [ ( E0 - 2 ( σ1 + σ2 + σ3 ) ) / N ]

From this analysis, we can also get the rotation that gives the optimal superposition of the vectors. When Tii = 1, then T = I, the identity matrix. This immediately gives us an equation for the optimal rotation Umin in terms of the left and right orthogonal vectors of the matrix R:

Tmax = Wt Umin V = I
Umin = W Vt

How to solve for R

In the above section, we showed that the RMSD can be expressed in terms of the singular values of R (σi) but we didn't say how to find them. Kabsch, who first derived the solution, showed that to solve for σi, we should solve the matrix Rt R, which is a nice symmetric real matrix, and much easier to solve. Expanding, we get a relation to the S matrix of R:

Rt R = ( Wt St Vt )( V S W ) = Wt S2 W.

From this expansion, we can see that the singular value term of Rt R is S2. In other words, the singular values of Rt R (λi) are related to the singular values of R (σi) by the relation λi = σi2. And also, the right singular vector W is the same for both R and Rt R.

Since Rt R is a symmetric real matrix, the eigenvalues are identical to the singular values λi. Thus we can use the standard Jacobi transform algorithm to find the eigenvalues of Rt R, which are identical to the λi. We substitute λi = σi2 into the RMSD formula:

RMSD = √ [ ( E0 - 2 ( √λ1 + √λ2 + √λ3 ) ) / N ]

The curse of chirality

Unfortunately, there's one more ambiguity that we will have to deal with before we get to the Kabsch equation for the RMSD. The ambiguity arises in a degeneracy in the resulting Umin from the SVD theorem. Umin is guaranteed to be orthogonal but for our RMSD calculation, we don't just want orthogonal matrices, we want rotation matrices. Orthogonal matrices can have determinants of +1 or -1. We want rotational matrix, which have determinants of +1 and not reflected rotational matrices, which have determinants of -1. A reflection in Umin will scramble the internal arrangement of our set of coordinates.

To restrict the solutions of Umin = W Vt to reflections, we check the determinants of W and V:

ω = det( W ) * det( Vt )

If ω = -1 then there is a reflection in Umin , and we must get rid of the reflection. The easiest way to get rid of the reflection is to reflect the smallest principal axes in Umin. We can then simply reverse the sign of the third (smallest) eignevalue √λ3 of the RMSD, which, finally, gives Kabsch formula for RMSD in terms of λi and ω:

RMSD = √ [ ( E0 - 2 ( √λ1 + √λ2 + ω √λ3 ) ) / N ]

Implementations of RMSD in code

In C, see rmsd.c, rmsd.h. If using C++, don't forget to include 'extern "C" {}'

Here's my Python program match.py to calculate the RMSD between PDB structures.

It uses the very excellent numpy library. The heart of the program is 9 lines of code:

import numpy
import vector3d, util, molecule, soup

def rmsd(crds1, crds2):
  """Returns RMSD between 2 sets of [nx3] numpy array"""
  assert(crds1.shape[1] == 3)
  assert(crds1.shape == crds2.shape)
  n_vec = numpy.shape(crds1)[0]
  correlation_matrix = numpy.dot(numpy.transpose(crds1), crds2)
  v, s, w_tr = numpy.linalg.svd(correlation_matrix)
  is_reflection = (numpy.linalg.det(v) * numpy.linalg.det(w_tr)) 
  if is_reflection:
    v[:,-1] = -v[:,-1]
  return numpy.dot(v, w_tr)
comments powered by Disqus