*Originally written 1/7/2005.*

*Get my PDB RMSD tool pdbrmsd in the pdbremix package.*

*I had previously mixed up some matrix dimensions, thx to readers CY L & toto*

If it's your day job to push proteins *in silico* then you will one-day
have to calculate the RMSD of a protein. You've just simulated protein
GinormousA for a whole micro-second but you don't even know if
GinormousA is stable. You could load up a protein viewer and eyeball the
stability of the protein throughout the simulation. But eye-balling is
just not going to cut it with your boss. You need to slap a number on
it. This is where RMSD comes in.

A protein conformation is basically a set of 3-dimensional coordinates.
This means that a conformation is a set of n vectors, x_{n},
where each x_{n} has three components. As the protein evolves
along the trajectory, the conformation changes, resulting in a new
protein conformation and a new set of vectors y_{n}. We capture
the difference between conformations through the two sets of vectors
x_{n} and y_{n}. To make life easier, we shift the
centre-of-mass of y_{n} and x_{n} to the origin of the
coordinates (we can always shift them back afterwards). Then the
difference can be captured by the square distance:

E = Σ

_{n}|x_{n}- y_{n}|^{2}

The root mean-square distance (RMSD) is then

RMSD = √ E / N

But hola, you say, the mean-square measure doesn't measure the
similarity very well. Just a little rotation of the set of
y_{n}, which doesn't change the internal arrangement of
y_{n}, would distort the RMSD. What we really need, is to find
the best rotation of y_{n} with respect to x_{n}
*before* taking the RMSD.

To rotate the vectors y_{n}, we apply a rotation matrix U to get
y'_{n} = y_{n} U. There is a choice to represent a rotation of
y_{n} as y_{n} U, or as U y_{n}. This depends as to whether
we think of the vectors y_{n} as a {1 x 3} matrix or a {3 x 1} matrix.
Here, we choose {1 x 3} matrix, as it will match better with the Python code
referenced below.

We can then express E as a function of the rotation U:

E = Σ

_{n}|x_{n}- y_{n}U|^{2}

To find the RMSD, we must first, find the rotation U_{min} that
minimizes E.

## Matching sets of vectors

The classic papers of Wolfgang Kabsch showed how to minimize E to get the RMSD. The algorithm described in those papers have since become part of the staple diet of protein analysis. As the original proof for the Kasch equations using 6 different Lagrange multipliers is rather tedious, we'll describe a much simpler proof using standard linear algebra.

First we expand E (which is a function of U)

E = Σ

_{n}( |x_{n}|^{2}+ |y_{n}|^{2}) - 2 Σ_{n}x_{n}y_{n}UE = E

_{0}- 2 Σ_{n}x_{n}y_{n}U

The first part E_{0} is invariant with respect to a rotation U.

The variation resides in the last term L = Σ_{n} x_{n}
y_{n} U. The goal to minimize the RMSD is to find the matrix
U_{min} that gives the largest possible value of L_{max}
from which we get:

RMSD = √ E

_{min}/ N = √ ( (E_{0}- 2 L_{max}) / N )

The trick is to realize that we can promote the set of vectors
x_{n} and y_{n} into the {N x 3} matrices X and Y, and apply
the {3 x 3} rotational matrix U to the vectors of Y. Then we can write L as a trace

L = Tr( X

^{t}Y U )

where X^{t} is a {3 x N} matrix and
the transformed values of Y U is a {N x 3}{3 x 3}={N x 3} matrix.

Juggling the matrices inside the trace, we get:

L = Tr( X

^{t}Y U )trace of a {3 x N}{N x 3}{3 x 3} matrix

= Tr( R U )trace of a {3 x 3}{3 x 3} matrix

We don't know what U is, but we can study the other matrix R =
X^{t} Y, which is the {3 x 3} correlation matrix between X and Y.
By invoking the powerful Singular Value Decomposition theorem, we know
that we can always write R as:

R = V S W

^{t}

and obtain the {3 x 3} matrices V and W (which are orthogonal) and S
(which is positive diagonal, with elements σ_{i}). We substitute
these matrices into the troublesome L term of the square-distance
equation:

L = Tr( V S W

^{t}U ) = Tr( S W^{t}U V )

As S is diagonal, if we group the matrices W^{t} U V together
into the matrix T, we can get a simple expression for the trace in terms
of the σ_{i}:

L = Tr( S T ) = σ

_{1}T_{11}+ σ_{2}T_{22}+ σ_{3}T_{33}

Writing L in this form, we can figure out the maximum value of L. The
secret sauce is in T. If we look at the definition of T, we can see that
*T is a product of only orthogonal matrices*, namely the matrices U,
W^{t} and V. T must also be orthogonal.

Thus, the largest possible value for any element of T is 1, or
T_{ij} ≤ 1. Now L is a product of terms linear in the diagonal
elements of T, thus the maximum value of L occurs when the
T_{ii} = 1,

L

_{max}= Tr( S T_{max}) = σ_{1}+ σ_{2}+ σ_{3}= Tr ( S )

Plugging this into the equation for L_{max} into the equation
for RMSD, we get

RMSD = √ [ ( E

_{0}- 2 ( σ_{1}+ σ_{2}+ σ_{3}) ) / N ]

From this analysis, we can also get the rotation that gives the optimal
superposition of the vectors. When T_{ii} = 1, then T = I, the
identity matrix. This immediately gives us an equation for the optimal
rotation U_{min} in terms of the left and right orthogonal
vectors of the matrix R:

T

_{max}= W^{t}U_{min}V = I

U_{min}= W V^{t}

## How to solve for R

In the above section, we showed that the RMSD can be expressed in terms
of the singular values of R (σ_{i}) but we didn't say how to
find them. Kabsch, who first derived the solution, showed that to solve
for σ_{i}, we should solve the matrix R^{t} R, which is
a nice symmetric real matrix, and much easier to solve. Expanding, we
get a relation to the S matrix of R:

R

^{t}R = ( W^{t}S^{t}V^{t})( V S W ) = W^{t}S^{2}W.

From this expansion, we can see that the singular value term of
R^{t} R is S^{2}. In other words, the singular values of
R^{t} R (λ_{i}) are related to the singular values of R
(σ_{i}) by the relation λ_{i} =
σ_{i}^{2}. And also, the right singular vector W is the
same for both R and R^{t} R.

Since R^{t} R is a symmetric real matrix, the eigenvalues are
identical to the singular values λ_{i}. Thus we can use the
standard Jacobi transform algorithm to find the eigenvalues of
R^{t} R, which are identical to the λ_{i}. We substitute
λ_{i} = σ_{i}^{2} into the RMSD formula:

RMSD = √ [ ( E

_{0}- 2 ( √λ_{1}+ √λ_{2}+ √λ_{3}) ) / N ]

## The curse of chirality

Unfortunately, there's one more ambiguity that we will have to deal with
before we get to the Kabsch equation for the RMSD. The ambiguity arises
in a degeneracy in the resulting U_{min} from the SVD theorem.
U_{min} is guaranteed to be orthogonal but for our RMSD
calculation, we don't just want orthogonal matrices, we want rotation
matrices. Orthogonal matrices can have determinants of +1 or -1. We want
rotational matrix, which have determinants of +1 and not reflected
rotational matrices, which have determinants of -1. A reflection in
U_{min} will scramble the internal arrangement of our set of
coordinates.

To restrict the solutions of U_{min} = W V^{t} to
reflections, we check the determinants of W and V:

ω = det( W ) * det( V

^{t})

If ω = -1 then there is a reflection in U_{min} , and we must
get rid of the reflection. The easiest way to get rid of the reflection
is to reflect the smallest principal axes in U_{min}. We can
then simply reverse the sign of the third (smallest) eignevalue
√λ_{3} of the RMSD, which, finally, gives Kabsch formula for
RMSD in terms of λ_{i} and ω:

RMSD = √ [ ( E

_{0}- 2 ( √λ_{1}+ √λ_{2}+ ω √λ_{3}) ) / N ]

## Implementations of RMSD in code

In C, see rmsd.c, rmsd.h. If using C++, don't forget to include 'extern "C" {}'

Here's a Python version in
rmsd.py
under the `numpy_svd_rmsd_rot`

function. This is part of my Python structural
biology library pdbremix.