RMSD: Root Mean Square Deviation

01 Jul 2005 // protein

Updated 2014-05-22 Get my PDB RMSD tool pdbrmsd in the pdbremix package.
I had previously mixed up some matrix dimensions, thx to readers CY L & toto

If it's your day job to push proteins in silico then you will one-day have to calculate the RMSD of a protein. You've just simulated protein GinormousA for a whole micro-second but you don't even know if GinormousA is stable. You could load up a protein viewer and eyeball the stability of the protein throughout the simulation. But eye-balling is just not going to cut it with your boss. You need to slap a number on it. This is where RMSD comes in.

A protein conformation is basically a set of 3-dimensional coordinates. This means that a conformation is a set of n vectors, x_n, where each x_n has three components. As the protein evolves along the trajectory, the conformation changes, resulting in a new protein conformation and a new set of vectors y_n. We capture the difference between conformations through the two sets of vectors x_n and y_n. To make life easier, we shift the centre-of-mass of y_n and x_n to the origin of the coordinates (we can always shift them back afterwards). Then the difference can be captured by the square distance:

E = Σ_n |x_n - y_n|²

The root mean-square distance (RMSD) is then

RMSD = √ E / N

But hola, you say, the mean-square measure doesn't measure the similarity very well. Just a little rotation of the set of y_n, which doesn't change the internal arrangement of y_n, would distort the RMSD. What we really need, is to find the best rotation of y_n with respect to x_n before taking the RMSD.

To rotate the vectors y_n, we apply a rotation matrix U to get y'_n = y_n U. There is a choice to represent a rotation of y_n as y_n U, or as U y_n. This depends as to whether we think of the vectors y_n as a {1 x 3} matrix or a {3 x 1} matrix. Here, we choose {1 x 3} matrix, as it will match better with the Python code referenced below.

We can then express E as a function of the rotation U:

E = Σ_n |x_n - y_nU|²

To find the RMSD, we must first, find the rotation U_min that minimizes E.

Matching sets of vectors

The classic papers of Wolfgang Kabsch showed how to minimize E to get the RMSD. The algorithm described in those papers have since become part of the staple diet of protein analysis. As the original proof for the Kasch equations using 6 different Lagrange multipliers is rather tedious, we'll describe a much simpler proof using standard linear algebra.

First we expand E (which is a function of U)

E = Σ_n ( |x_n|² + |y_n|² ) - 2 Σ_n x_n y_n U

E = E₀ - 2 Σ_n x_n y_n U

The first part E₀ is invariant with respect to a rotation U.

The variation resides in the last term L = Σ_n x_n y_n U. The goal to minimize the RMSD is to find the matrix U_min that gives the largest possible value of L_max from which we get:

RMSD = √ E_min / N = √ ( (E₀ - 2 L_max) / N )

The trick is to realize that we can promote the set of vectors x_n and y_n into the {N x 3} matrices X and Y, and apply the {3 x 3} rotational matrix U to the vectors of Y. Then we can write L as a trace

L = Tr( X^t Y U )

where X^t is a {3 x N} matrix and the transformed values of Y U is a {N x 3}{3 x 3}={N x 3} matrix.

Juggling the matrices inside the trace, we get:

L = Tr( X^t Y U ) trace of a {3 x N}{N x 3}{3 x 3} matrix
= Tr( R U ) trace of a {3 x 3}{3 x 3} matrix

We don't know what U is, but we can study the other matrix R = X^t Y, which is the {3 x 3} correlation matrix between X and Y. By invoking the powerful Singular Value Decomposition theorem, we know that we can always write R as:

R = V S W^t

and obtain the {3 x 3} matrices V and W (which are orthogonal) and S (which is positive diagonal, with elements σ_i). We substitute these matrices into the troublesome L term of the square-distance equation:

L = Tr( V S W^t U ) = Tr( S W^t U V )

As S is diagonal, if we group the matrices W^t U V together into the matrix T, we can get a simple expression for the trace in terms of the σ_i:

L = Tr( S T ) = σ₁ T₁₁ + σ₂ T₂₂ + σ₃ T₃₃

Writing L in this form, we can figure out the maximum value of L. The secret sauce is in T. If we look at the definition of T, we can see that T is a product of only orthogonal matrices, namely the matrices U, W^t and V. T must also be orthogonal.

Thus, the largest possible value for any element of T is 1, or T_ij ≤ 1. Now L is a product of terms linear in the diagonal elements of T, thus the maximum value of L occurs when the T_ii = 1,

L_max = Tr( S T_max ) = σ₁ + σ₂ + σ₃ = Tr ( S )

Plugging this into the equation for L_max into the equation for RMSD, we get

RMSD = √ [ ( E₀ - 2 ( σ₁ + σ₂ + σ₃ ) ) / N ]

From this analysis, we can also get the rotation that gives the optimal superposition of the vectors. When T_ii = 1, then T = I, the identity matrix. This immediately gives us an equation for the optimal rotation U_min in terms of the left and right orthogonal vectors of the matrix R:

T_max = W^t U_min V = I
U_min = W V^t

How to solve for R

In the above section, we showed that the RMSD can be expressed in terms of the singular values of R (σ_i) but we didn't say how to find them. Kabsch, who first derived the solution, showed that to solve for σ_i, we should solve the matrix R^t R, which is a nice symmetric real matrix, and much easier to solve. Expanding, we get a relation to the S matrix of R:

R^t R = ( W^t S^t V^t )( V S W ) = W^t S² W.

From this expansion, we can see that the singular value term of R^t R is S². In other words, the singular values of R^t R (λ_i) are related to the singular values of R (σ_i) by the relation λ_i = σ_i². And also, the right singular vector W is the same for both R and R^t R.

Since R^t R is a symmetric real matrix, the eigenvalues are identical to the singular values λ_i. Thus we can use the standard Jacobi transform algorithm to find the eigenvalues of R^t R, which are identical to the λ_i. We substitute λ_i = σ_i² into the RMSD formula:

RMSD = √ [ ( E₀ - 2 ( √λ₁ + √λ₂ + √λ₃ ) ) / N ]

The curse of chirality

Unfortunately, there's one more ambiguity that we will have to deal with before we get to the Kabsch equation for the RMSD. The ambiguity arises in a degeneracy in the resulting U_min from the SVD theorem. U_min is guaranteed to be orthogonal but for our RMSD calculation, we don't just want orthogonal matrices, we want rotation matrices. Orthogonal matrices can have determinants of +1 or -1. We want rotational matrix, which have determinants of +1 and not reflected rotational matrices, which have determinants of -1. A reflection in U_min will scramble the internal arrangement of our set of coordinates.

To restrict the solutions of U_min = W V^t to reflections, we check the determinants of W and V:

ω = det( W ) * det( V^t )

If ω = -1 then there is a reflection in U_min , and we must get rid of the reflection. The easiest way to get rid of the reflection is to reflect the smallest principal axes in U_min. We can then simply reverse the sign of the third (smallest) eignevalue √λ₃ of the RMSD, which, finally, gives Kabsch formula for RMSD in terms of λ_i and ω:

RMSD = √ [ ( E₀ - 2 ( √λ₁ + √λ₂ + ω √λ₃ ) ) / N ]

Implementations of RMSD in code

In C, see rmsd.c, rmsd.h. If using C++, don't forget to include 'extern "C" {}'

Here's a Python version in rmsd.py under the numpy_svd_rmsd_rot function. This is part of my Python structural biology library pdbremix.