If it's your day job to push proteins *in silico* then you will one-day
have to calculate the RMSD of a protein. You've just simulated protein
GinormousA for a whole micro-second but you don't even know if
GinormousA is stable. You could load up a protein viewer and eyeball the
stability of the protein throughout the simulation. But eye-balling is
just not going to cut it with your boss. You need to slap a number on
it. This is where RMSD comes in.

A protein conformation is basically a set of 3-dimensional coordinates.
This means that a conformation is a set of n vectors, x_{n},
where each x_{n} has three components. As the protein evolves
along the trajectory, the conformation changes, resulting in a new
protein conformation and a new set of vectors y_{n}. We capture
the difference between conformations through the two sets of vectors
x_{n} and y_{n}. To make life easier, we shift the
centre-of-mass of y_{n} and x_{n} to the origin of the
coordinates (we can always shift them back afterwards). Then the
difference can be captured by the square distance:

E = Σ

_{n}|x_{n}- y_{n}|^{2}

The root mean-square distance (RMSD) is then

RMSD = √ E / N

But hola, you say, the mean-square measure doesn't measure the
similarity very well. Just a little rotation of the set of
y_{n}, which doesn't change the internal arrangement of
y_{n}, would distort the RMSD. What we really need, is to find
the best rotation of y_{n} with respect to x_{n}
*before* taking the RMSD. To rotate the vectors y_{n}, we apply
a rotation matrix U to get y'_{n} = U y_{n}. We can then
express E as a function of the rotation U:

E = Σ

_{n}|x_{n}- U y_{n}|^{2}

To find the RMSD, we must first, find the rotation U_{min} that
minimizes E.

## Matching sets of vectors

The classic papers of Wolfgang Kabsch showed how to minimize E to get the RMSD. The algorithm described in those papers have since become part of the staple diet of protein analysis. As the original proof for the Kasch equations using 6 different Lagrange multipliers is rather tedious, we'll describe a much simpler proof using standard linear algebra.

First we expand E (which is a function of U)

E = Σ

_{n}( |x_{n}|^{2}+ |y_{n}|^{2}) - 2 Σ_{n}x_{n}U y_{n}E = E

_{0}- 2 Σ_{n}x_{n}U y_{n}

The first part E_{0} is invariant with respect to a rotation U.
The variation resides in the last term L = Σ_{n} x_{n} U
y_{n}. The goal to minimize the RMSD is to find the matrix
U_{min} that gives the largest possible value of L_{max}
from which we get:

RMSD = √ E

_{min}/ N = √ ( (E_{0}- 2 L_{max}) / N )

The trick is to realize that we can promote the set of vectors
x_{n} and y_{n} into the {N x 3} matrices X and Y. Then
we can write L as a trace

L = Tr( X

^{t}U Y )

where X^{t} is a {3 x N} matrix and Y is a {N x 3} matrix.
Juggling the matrices inside the trace, we get:

L = Tr( X

^{t}U Y )trace of a {N x N} matrix

= Tr( U Y X^{t})trace of a {3 x 3} matrix

= Tr( U R )

We don't know what U is, but we can study the other matrix R = Y
X^{t}, which is the {3 x 3} correlation matrix between X and Y.
By invoking the powerful Singular Value Decomposition theorem, we know
that we can always write R as:

R = V S W

^{t}

and obtain the {3 x 3} matrices V and W (which are orthogonal) and S
(which is positive diagonal, with elements σ_{i}). We substitute
these matrices into the troublesome L term of the square-distance
equation:

L = Tr( U V S W

^{t}) = Tr( S W^{t}U V )

As S is diagonal, if we group the matrices W^{t} U V together
into the matrix T, we can get a simple expression for the trace in terms
of the σ_{i}:

L = Tr( S T ) = σ

_{1}T_{11}+ σ_{2}T_{22}+ σ_{3}T_{33}

Writing L in this form, we can figure out the maximum value of L. The
secret sauce is in T. If we look at the definition of T, we can see that
*T is a product of only orthogonal matrices*, namely the matrices U,
W^{t} and V. T must also be orthogonal.

Thus, the largest possible value for any element of T is 1, or
T_{ij} ≤ 1. Now L is a product of terms linear in the diagonal
elements of T, thus the maximum value of L occurs when the
T_{ii} = 1,

L

_{max}= Tr( S T_{max}) = σ_{1}+ σ_{2}+ σ_{3}= Tr ( S )

Plugging this into the equation for L_{max} into the equation
for RMSD, we get

RMSD = √ [ ( E

_{0}- 2 ( σ_{1}+ σ_{2}+ σ_{3}) ) / N ]

From this analysis, we can also get the rotation that gives the optimal
superposition of the vectors. When T_{ii} = 1, then T = I, the
identity matrix. This immediately gives us an equation for the optimal
rotation U_{min} in terms of the left and right orthogonal
vectors of the matrix R:

T

_{max}= W^{t}U_{min}V = I

U_{min}= W V^{t}

## How to solve for R

In the above section, we showed that the RMSD can be expressed in terms
of the singular values of R (σ_{i}) but we didn't say how to
find them. Kabsch, who first derived the solution, showed that to solve
for σ_{i}, we should solve the matrix R^{t} R, which is
a nice symmetric real matrix, and much easier to solve. Expanding, we
get a relation to the S matrix of R:

R

^{t}R = ( W^{t}S^{t}V^{t})( V S W ) = W^{t}S^{2}W.

From this expansion, we can see that the singular value term of
R^{t} R is S^{2}. In other words, the singular values of
R^{t} R (λ_{i}) are related to the singular values of R
(σ_{i}) by the relation λ_{i} =
σ_{i}^{2}. And also, the right singular vector W is the
same for both R and R^{t} R.

Since R^{t} R is a symmetric real matrix, the eigenvalues are
identical to the singular values λ_{i}. Thus we can use the
standard Jacobi transform algorithm to find the eigenvalues of
R^{t} R, which are identical to the λ_{i}. We substitute
λ_{i} = σ_{i}^{2} into the RMSD formula:

RMSD = √ [ ( E

_{0}- 2 ( √λ_{1}+ √λ_{2}+ √λ_{3}) ) / N ]

## The curse of chirality

Unfortunately, there's one more ambiguity that we will have to deal with
before we get to the Kabsch equation for the RMSD. The ambiguity arises
in a degeneracy in the resulting U_{min} from the SVD theorem.
U_{min} is guaranteed to be orthogonal but for our RMSD
calculation, we don't just want orthogonal matrices, we want rotation
matrices. Orthogonal matrices can have determinants of +1 or -1. We want
rotational matrix, which have determinants of +1 and not reflected
rotational matrices, which have determinants of -1. A reflection in
U_{min} will scramble the internal arrangement of our set of
coordinates.

To restrict the solutions of U_{min} = W V^{t} to
reflections, we check the determinants of W and V:

ω = det( W ) * det( V

^{t})

If ω = -1 then there is a reflection in U_{min} , and we must
get rid of the reflection. The easiest way to get rid of the reflection
is to reflect the smallest principal axes in U_{min}. We can
then simply reverse the sign of the third (smallest) eignevalue
√λ_{3} of the RMSD, which, finally, gives Kabsch formula for
RMSD in terms of λ_{i} and ω:

RMSD = √ [ ( E

_{0}- 2 ( √λ_{1}+ √λ_{2}+ ω √λ_{3}) ) / N ]

## Implementations of RMSD in code

In C, see rmsd.c, rmsd.h. If using C++, don't forget to include 'extern "C" {}'

Here's my Python program match.py to calculate the RMSD between PDB structures.

It uses the very excellent numpy library. The heart of the program is 9 lines of code:

import numpy import vector3d, util, molecule, soup def rmsd(crds1, crds2): """Returns RMSD between 2 sets of [nx3] numpy array""" assert(crds1.shape[1] == 3) assert(crds1.shape == crds2.shape) n_vec = numpy.shape(crds1)[0] correlation_matrix = numpy.dot(numpy.transpose(crds1), crds2) v, s, w_tr = numpy.linalg.svd(correlation_matrix) is_reflection = (numpy.linalg.det(v) * numpy.linalg.det(w_tr)) if is_reflection: v[:,-1] = -v[:,-1] return numpy.dot(v, w_tr)