Bioinformatics in the Near Concurrent Future

04 Mar 2009 // protein

There is a huge opportunity for a smart young bioinformatician willing to experiment with a new language.

Face it: the future is concurrency and most of the old fogey bioinformaticians of the last generation are not equipped to deal with it. They even have trouble dealing with dual cores on their brand new Macs. Unfortunately, most scientists were trained in ALGOL-based languages, from FORTRAN to C++ to PERL. These are terrible languages in which to write concurrent code. Just have a look at some multi-threading libraries in C++.

One day soon, the people who use BLAST or CLUSTALW or any system built upon these programs will find that adding multiple cores to their system will not make these venerable programs run any faster. These programs cannot scale to the number of processors because that would require true concurrency. To exploit true concurrency, these basic bioinformatic programs will need to be rewritten from the ground up.

And they will be. They are too important not to be.

But it would be a huge mistake to rewrite them using concurrent libraries written in C++ or FORTRAN although I am sure the maintainers of these programs will try to. [update: Paulo tells me that this has already been done with up to 20x speedups in some situations. Can we do better? Is this the limit?]

There are in fact languages designed to do concurrency easily. Well, at least easy relative to C++ or FORTRAN. Although most of them are kind of niche languages, some of them are not: Erlang, Occam, Haskell and Clojure. But out of these concurrent languages, there is only one serious candidate for bioinformatics and that is Clojure, which is a dialect of Lisp, the most powerful language ever designed.

In many ways, Lisp ought to have been the language designed for bioinformatics. After all, Lisp was originally designed to handle symbolic data in list form. DNA and proteins, the stuff of bioinformatics, is nothing if not symbolic data in list form. But up until recently, if you wanted to use the most powerful language in the world, you would have to settle for one of a bunch of slightly incompatible implentations of Lisp, with questionable portability, and piss-weak libraries.

But not anymore. Clojure is a new dialect of Lisp, designed by Rich Hickey, that cuts through the Gordian knot of impractical Lisp implementations. He did that by implementing Lisp right on top of the Java JVM. In one fell stroke he has written a Lisp that is a) completely portable as the JVM is implemented everywhere, b) has reliable debugged libraries in the form of JVM libraries and c) is plenty fast enough, as the JVM is a marvellous piece of technology.

Furthermore, Rich Hickey has made his dialect of Lisp practical. It has built in dynamic arrays and dictionaries. But more importantly, he has implemented all native Clojure data structures using Software Transactional Memory. What this means is that all data structures are persistent. When you manipulate a data structure Clojure, you never really change it. Instead Clojure plays around with the references to the data structure to give you a cheap copy with the changes embedded. Effectively, this means is that it is remarkably easy to do concurrent programming in Clojure.

And did I say Clojure interfaces with Java? This means you get strings that understand Unicode, internet libraries and even xml parsers, for free.

So what this means is that a smart young bioinformatician willing to dive into Clojure will be able to build the next generation of Blast that can run on ten cores, or fifty cores, or even a thousands cores. Imagine performing a map-reduce to align a million sequences from thousands of organisms over a thousand cores. It will be a day sure to take your breath away.