How History affects Pattern Matching inside the Genome

I don't think there will be many times in my life that I'll see a basic bioinformatic algorithm published in a Science magazine article but such is the lament of the theorist. Nevertheless, the recent Löytynoja-Goldman paper published in June 2008, is one of those times. Not only that, but the paper also exposes a fundamental flaw in one of the basic algorithms of the trade.

Löytynoja & Goldman show that all current Multiple Sequence Alignment algorithms – the bread-and-butter algorithm for biologists studying similarities in genes across different species – completely fail to detect independent insertions in the alignment of sequences, and thus, erroneously mismatches regions of evolutionary volatility. In their improved MSA algorithm, PRANK, they use history itself to improve the performance of this basic algorithm.

Sequence Alignment

Let me back up a little. One of the basic tenets in molecular biology is that the random mutations of evolution work directly on protein sequences. As one species evolve into another, the protein sequences embedded in the DNA are mutated, one amino acid at a time. The amino acids can be replaced, or deleted, or inserted. Different related species may possess similar proteins where the amino acid sequences of the proteins are virtually identical. A comparison of the number of similar amino acids in one sequence to another will tell you whether two sequences are likely to fold into similar proteins and consequently, whether two species might be evolutionarily related.

The heart of the sequence similarity comparison is the amino-acid substitution matrix, which tabulates the chemical similarity between the 20 different amino acids. Similar amino acids can be substituted for one another, often without adversely affecting the chemistry of the resultant protein. Thus similarity between two protein sequences is calculated, not only by how many amino acids in a row are exactly the same, but wiggle room is given to amino acids deemed to be similar by the substitution matrix.

However, the matching portions can sometimes be separated by gaps. This is considered an insertion from the point of view of the shorter sequence, or a deletion from the point of view of the longer sequence. Sophisticated gap-detection algorithms are used to optimally match these sequences. This process of optimally aligning two protein sequences is called Sequence Alignment and typically, two sequences are considered similar if 30% of the sequence can be aligned. Such sequences are statistically likely to fold into the same 3-dimensional molecular structure, and are thus, the same protein.

You can thus think of sequence alignment as a string comparison, where the substitution matrix and gap-detection serves as a chemical regular expression.

Multiple Sequence Alignment

Comparing two sequences will tell you whether they are evolutionarily related but when you have a whole family of sequences – the same protein from a bunch of related species – then you can extract serious information from simple pattern matching. You can see which part of the protein sequence is conserved across species, and which part varies. In some cases, you can even identify the exact mutation responsible for the divergence of one species from its siblings.

But to see all the interrelationships between the proteins from different species, you have to align them all properly with respect to each other. You have to make a Multiple Sequence Alignment (MSA). And this is where the trouble lies.

In MSA algorithms, a single master sequence is usually constructed for an family of sequences. Let's say we have four sequences:

AT cat, AGT lion, AT mouse, ACT platypus

As it is expensive to maintain gaps in a MSA for technical reasons, the G & C are aligned over a gap in the master sequence:

A-T   cat
AGT   lion
A-T   mouse
ACT   platypus

A-T is the master sequence that best represents the simultaneous alignment of the sequences. Since the platypus was the most divergent to the rest, you might conclude that the ancestral sequence was ACT, where the AT sequences for mouse and cat was due to one independent deletions, and the G in the lion was a mutation.

But here's the rub. Löytynoja & Goldman pointed out that an equally plausible evolutionary path might be that the original sequence was AT, where the platypus inserted a C, and later, the lion separately inserted a G in the same place in the place. But if they were independent mutations then they have absolutely no evolutionary relationship to each other and so you should never align the two insertions with each other. Then the alignment should be:

A--T   cat
AG-T   lion
A--T   mouse
A-CT   platypus

But it's impossible to distinguish these two scenarios without any more information. So instead all existing MSA algorithms assumes the first scenario, and thereby ignores independent insertion events, which would erroneously align independent insertions together, leading to a wrong interpretation of which parts of the sequence are conserved.

MSA also misses independent deletions

A similar problem occurs for independent deletions in the same place. For instance let's say we have these sequences:

AGT    cat
ACGT   lion
ACGT   mouse
ACT    platypus

Due to the cost of maintain gaps in a sequence alignment, it is more important to line up gaps with each other, so a typical MSA would give:

A-GT   cat
ACGT   lion
ACGT   mouse
A-CT   platypus

This aligns the G in cat to the C in platypus. However, it is possible that the ancestor sequence was ACGT, where the cat and platypus undergo deletions at different points in the ancestral sequence to give an alternative alignment:

A-GT   cat
ACGT   lion
ACGT   mouse
AC-T   platypus

Once again, these ambiguities can't be resolved without extra information.

Phylogenic information saves the day.

But Löytynoja & Goldman found out that you can in fact get this extra information quite easily. All you need is to assign a evolutionary relatedness to each sequence. As most of the time, we know what species a protein sequence comes from, we can just extract the evolutionary relatedness from the known evolutionary species tree (or we can use the pair-wise sequence alignment similarity as a proxy for relatedness).

To make full use of the phylogenetic information, we ought to consider closely related species first. By grouping the normal mammals before considering the platypus, we get:

A-T    cat 
AGT    lion
A-T    mouse

Within this group, it is clear that the G in lion is an insertion and that the ancestral sequence of the normal mammals is A-T.

Now we can compare the ACT in platypus to the ancestral sequence of the normal mammals A-T. It is now clear that the C in the platypus is an independent insertion, and so we create a new space in the master alignment for the C in the platypus:

A--T   cat
A-GT   lion
A--T   mouse
AC-T   platypus

Thus we conclude that the ancestral sequence is A—T, where two independent insertions occurred in the evolution of the species as the most likely scenario.

In the case of the possibly independent deletions, we have

AGT    cat
ACGT   lion
ACGT   mouse
ACT    platypus

Grouping the normal mammals:

A-GT   cat
ACGT   lion
ACGT   mouse

We can see that cat has a deletion from the ancestral sequence ACGT. Consequently, when we look at platypus ACT, we compare it to the ancestral sequence of the normal mammals ACGT (and not to a consensus sequence containing the deletion from the cat), which gives a different alignment:

A-GT   cat
ACGT   lion
ACGT   mouse
AC-T   platypus

Evolutionary history of HIV improves alignment of the GP120 protein

A correct MSA is crucial to understand proteins that have mutated a lot such as the GP120 protein of the HIV virus that grapples the immune cells in our body. The HIV virus evolves very quickly and this hyper-accelerated evolution allows HIV to resist many potential drugs.

Now that we have sequenced a number of different strains of HIV, we can use MSA algorithms to identify conserved patterns in the GP120 protein across the different strains of HIV. Using conventional MSA algorithms, this is what we get for part of the sequence, a so-called evolutionary hot-spot:

It appears that there is a volatile region where pieces of different lengths are randomly inserted and deleted. There is no clear pattern to be discerned from this alignment.

However, if we use Löytynoja & Goldman's PRANK to align the sequences, then, by feeding in the evolutionary tree (left) of the different strains of HIV (left), we get a fundamentally different alignment:

Using the improved PRANK MSA, we can now clearly distinguish insertions and deletions appropriate to each branching in the evolutionary tree. Now if we know any unique biological characteristics of a particular strain of HIV, we might be able to identify the exact insertion or deletion in the GP120 protein that might have caused it. History, as always, adds clarity to analysis.

comments powered by Disqus