I don’t think there will be many times in my life that I’ll see a basic bioinformatic algorithm published in a Science magazine article but such is the lament of the theorist. Nevertheless, the recent Löytynoja-Goldman paper published in June 2008, is one of those times. Not only that, but the paper also exposes a fundamental flaw in one of the basic algorithms of the trade.
Löytynoja & Goldman show that all current Multiple Sequence Alignment algorithms – the bread-and-butter algorithm for biologists studying similarities in genes across different species – completely fail to detect independent insertions in the alignment of sequences, and thus, erroneously mismatches regions of evolutionary volatility. In their improved MSA algorithm, PRANK, they use history itself to improve the performance of this basic algorithm.
Let me back up a little. One of the basic tenets in molecular biology is that the random mutations of evolution work directly on protein sequences. As one species evolve into another, the protein sequences embedded in the DNA are mutated, one amino acid at a time. The amino acids can be replaced, or deleted, or inserted. Different related species may possess similar proteins where the amino acid sequences of the proteins are virtually identical. A comparison of the number of similar amino acids in one sequence to another will tell you whether two sequences are likely to fold into similar proteins and consequently, whether two species might be evolutionarily related.
The heart of the sequence similarity comparison is the amino-acid substitution matrix, which tabulates the chemical similarity between the 20 different amino acids. Similar amino acids can be substituted for one another, often without adversely affecting the chemistry of the resultant protein. Thus similarity between two protein sequences is calculated, not only by how many amino acids in a row are exactly the same, but wiggle room is given to amino acids deemed to be similar by the substitution matrix.
However, the matching portions can sometimes be separated by gaps. This is considered an insertion from the point of view of the shorter sequence, or a deletion from the point of view of the longer sequence. Sophisticated gap-detection algorithms are used to optimally match these sequences. This process of optimally aligning two protein sequences is called Sequence Alignment and typically, two sequences are considered similar if 30% of the sequence can be aligned. Such sequences are statistically likely to fold into the same 3-dimensional molecular structure, and are thus, the same protein.
You can thus think of sequence alignment as a string comparison, where the substitution matrix and gap-detection serves as a chemical regular expression.
Multiple Sequence Alignment
Comparing two sequences will tell you whether they are evolutionarily related but when you have a whole family of sequences – the same protein from a bunch of related species – then you can extract serious information from simple pattern matching. You can see which part of the protein sequence is conserved across species, and which part varies. In some cases, you can even identify the exact mutation responsible for the divergence of one species from its siblings.
But to see all the interrelationships between the proteins from different species, you have to align them all properly with respect to each other. You have to make a Multiple Sequence Alignment (MSA). And this is where the trouble lies.
In MSA algorithms, a single master sequence is usually constructed for an family of sequences. Let’s say we have four sequences:
AT cat, AGT lion, AT mouse, ACT platypus
As it is expensive to maintain gaps in a MSA for technical reasons, the G & C are aligned over a gap in the master sequence:
A-T cat AGT lion A-T mouse ACT platypus
A-T is the master sequence that best represents the simultaneous alignment of the sequences. Since the platypus was the most divergent to the rest, you might conclude that the ancestral sequence was ACT, where the AT sequences for mouse and cat was due to one independent deletions, and the G in the lion was a mutation.
But here’s the rub. Löytynoja & Goldman pointed out that an equally plausible evolutionary path might be that the original sequence was AT, where the platypus inserted a C, and later, the lion separately inserted a G in the same place in the place. But if they were independent mutations then they have absolutely no evolutionary relationship to each other and so you should never align the two insertions with each other. Then the alignment should be:
A--T cat AG-T lion A--T mouse A-CT platypus
But it’s impossible to distinguish these two scenarios without any more information. So instead all existing MSA algorithms assumes the first scenario, and thereby ignores independent insertion events, which would erroneously align independent insertions together, leading to a wrong interpretation of which parts of the sequence are conserved.
MSA also misses independent deletions
A similar problem occurs for independent deletions in the same place. For instance let’s say we have these sequences:
AGT cat ACGT lion ACGT mouse ACT platypus
Due to the cost of maintain gaps in a sequence alignment, it is more important to line up gaps with each other, so a typical MSA would give:
A-GT cat ACGT lion ACGT mouse A-CT platypus
This aligns the G in cat to the C in platypus. However, it is possible that the ancestor sequence was ACGT, where the cat and platypus undergo deletions at different points in the ancestral sequence to give an alternative alignment:
A-GT cat ACGT lion ACGT mouse AC-T platypus
Once again, these ambiguities can’t be resolved without extra information.
Phylogenic information saves the day.
But Löytynoja & Goldman found out that you can in fact get this extra information quite easily. All you need is to assign a evolutionary relatedness to each sequence. As most of the time, we know what species a protein sequence comes from, we can just extract the evolutionary relatedness from the known evolutionary species tree (or we can use the pair-wise sequence alignment similarity as a proxy for relatedness).
To make full use of the phylogenetic information, we ought to consider closely related species first. By grouping the normal mammals before considering the platypus, we get:
A-T cat AGT lion A-T mouse
Within this group, it is clear that the G in lion is an insertion and that the ancestral sequence of the normal mammals is A-T.
Now we can compare the ACT in platypus to the ancestral sequence of the normal mammals A-T. It is now clear that the C in the platypus is an independent insertion, and so we create a new space in the master alignment for the C in the platypus:
A--T cat A-GT lion A--T mouse AC-T platypus
Thus we conclude that the ancestral sequence is A—T, where two independent insertions occurred in the evolution of the species as the most likely scenario.
In the case of the possibly independent deletions, we have
AGT cat ACGT lion ACGT mouse ACT platypus
Grouping the normal mammals:
A-GT cat ACGT lion ACGT mouse
We can see that cat has a deletion from the ancestral sequence ACGT. Consequently, when we look at platypus ACT, we compare it to the ancestral sequence of the normal mammals ACGT (and not to a consensus sequence containing the deletion from the cat), which gives a different alignment:
A-GT cat ACGT lion ACGT mouse AC-T platypus
Evolutionary history of HIV improves alignment of the GP120 protein
A correct MSA is crucial to understand proteins that have mutated a lot such as the GP120 protein of the HIV virus that grapples the immune cells in our body. The HIV virus evolves very quickly and this hyper-accelerated evolution allows HIV to resist many potential drugs.
Now that we have sequenced a number of different strains of HIV, we can use MSA algorithms to identify conserved patterns in the GP120 protein across the different strains of HIV. Using conventional MSA algorithms, this is what we get for part of the sequence, a so-called evolutionary hot-spot:
It appears that there is a volatile region where pieces of different lengths are randomly inserted and deleted. There is no clear pattern to be discerned from this alignment.
However, if we use Löytynoja & Goldman’s PRANK to align the sequences, then, by feeding in the evolutionary tree (left) of the different strains of HIV (left), we get a fundamentally different alignment:
Using the improved PRANK MSA, we can now clearly distinguish insertions and deletions appropriate to each branching in the evolutionary tree. Now if we know any unique biological characteristics of a particular strain of HIV, we might be able to identify the exact insertion or deletion in the GP120 protein that might have caused it. History, as always, adds clarity to analysis.
Good write up. I think I may have even learnt something! :)
Sequence alignment gets bandied about at my lab a bit as it is a practical application domain for a flavour of machine learning algorithm called conditional random fields that people here are working on.
A quick hunt around for CRFs and proteins reveals a number of recent papers on the topic. The beauty of CRFs is that they make it relatively easy to model all sorts of probabilistic dependencies and “background” information such as phylogenetics (e.g., the first four references of this recent paper on gene prediction).
This and other types of Bayesian modelling can do some pretty amazing things. For example, you don’t need to specify in advance what the phylogenetics of a set of animals is provided you model, in general, the idea that “similar animals have similar sequences”. Given enough data, there are techniques that will reconstruct the most likely tree that describes the sequence data. Provided the mechanics of the Bayesian model is qualitatively correct the inferred tree can be very close to the real one – plus you get accurate sequence prediction to boot.
Josh Tenenbaum gave a nice talk at last year’s NIPS about the work he’s doing in psychology using Bayesian modelling. He described one experiment where subjects were asked questions of the form “monkeys, horses and cows have the gene X34; Does an elephant have X34?” (Terms like X34 were chosen to suggest something biological but people weren’t expected to know anything about it specifically). Yes/no answers provided information about the perceived similarity of the animals.
Using Bayesian inference to cluster via these similarities led to a phylogenetic tree very close in structure to the real one suggesting that traces of genetic history are perceivable and we model it cognitively.
Thanks for laying this out so clearly. I had heard about this paper but hadn’t taken the time to look it over myself.