Bioinformatics in the Near Concurrent Future
There is a huge opportunity for a smart young bioinformatician willing to experiment with a new language.
Face it: the future is concurrency and most of the old fogey bioinformaticians of the last generation are not equipped to deal with it. They even have trouble dealing with dual cores on their brand new Macs. Unfortunately, most scientists were trained in ALGOL-based languages, from FORTRAN to C++ to PERL. These are terrible languages in which to write concurrent code. Just have a look at some multi-threading libraries in C++.
One day soon, the people who use BLAST or CLUSTALW or any system built upon these programs will find that adding multiple cores to their system will not make these venerable programs run any faster. These programs cannot scale to the number of processors because that would require true concurrency. To exploit true concurrency, these basic bioinformatic programs will need to be rewritten from the ground up.
And they will be. They are too important not to be.
But it would be a huge mistake to rewrite them using concurrent libraries written in C++ or FORTRAN although I am sure the maintainers of these programs will try to. [update: Paulo tells me that this has already been done with up to 20x speedups in some situations. Can we do better? Is this the limit?]
There are in fact languages designed to do concurrency easily. Well, at least easy relative to C++ or FORTRAN. Although most of them are kind of niche languages, some of them are not: Erlang, Occam, Haskell and Clojure. But out of these concurrent languages, there is only one serious candidate for bioinformatics and that is Clojure, which is a dialect of Lisp, the most powerful language ever designed.
In many ways, Lisp ought to have been the language designed for bioinformatics. After all, Lisp was originally designed to handle symbolic data in list form. DNA and proteins, the stuff of bioinformatics, is nothing if not symbolic data in list form. But up until recently, if you wanted to use the most powerful language in the world, you would have to settle for one of a bunch of slightly incompatible implentations of Lisp, with questionable portability, and piss-weak libraries.
But not anymore. Clojure is a new dialect of Lisp, designed by Rich Hickey, that cuts through the Gordian knot of impractical Lisp implementations. He did that by implementing Lisp right on top of the Java JVM. In one fell stroke he has written a Lisp that is a) completely portable as the JVM is implemented everywhere, b) has reliable debugged libraries in the form of JVM libraries and c) is plenty fast enough, as the JVM is a marvellous piece of technology.
Furthermore, Rich Hickey has made his dialect of Lisp practical. It has built in dynamic arrays and dictionaries. But more importantly, he has implemented all native Clojure data structures using Software Transactional Memory. What this means is that all data structures are persistent. When you manipulate a data structure Clojure, you never really change it. Instead Clojure plays around with the references to the data structure to give you a cheap copy with the changes embedded. Effectively, this means is that it is remarkably easy to do concurrent programming in Clojure.
And did I say Clojure interfaces with Java? This means you get strings that understand Unicode, internet libraries and even xml parsers, for free.
So what this means is that a smart young bioinformatician willing to dive into Clojure will be able to build the next generation of Blast that can run on ten cores, or fifty cores, or even a thousands cores. Imagine performing a map-reduce to align a million sequences from thousands of organisms over a thousand cores. It will be a day sure to take your breath away.
I heard Clojure mentioned before, but this write up is rather convincing. Thanx for this rather interesting write up!
Now, my immediate thought is, what data types is Clojure good at? Sequences, clearly… others too, graphs maybe? Any pointers on that too?
Very interesting.
Now I need to look at Clojure. :-)
Thanks !
@Egon, Here’s a great piece of clojure code that was going round the inter-tubes using a genetic-algorithm: link
I agree with you and at the same time disagree. There is need for a shift in development, vision and programming languages. And it would be difficult to rewrite these applications from the ground up. But …
… the two examples you gave are off the mark. Blast (as a Wu-Blast implementation, unfortunately not available anymore) is highly scalable and it can run in a 128 multi-threaded environment with speedups of up to 20x. ClustalW also has an MPI implementations that shows speedups of more than 32x depending on the number of concurrent processes.
And by mentioning Clustal, I remember that we also need to shift from the applications “we” use. Mafft is a much better alignment package with 5-6x speedups on Clustal’s speed using only one core.
It’s always good to know these languages, but until you have a large and successful application written in one these new readily-concurrent languages, it will be difficult to make a shift.
@Paulo, hey thanks for the info! I’m not so hot on the hot bioinformatics programs. I only use regular BLAST and CLUSTAL, like occasionally. As for the last point, it’s precisely why I wrote the post. It’s like a chicken-and-egg argument. Until someone writes a large concurrent application in haskell or clojure no one is going to take the route seriously. However, no is going to take that route seriously unless an application has already been taken. Except, of course, if some foolish young bioinformatician plunges in and does in. If it works, so goes the glory! And hence my blog hopefully prods some youngun with time on her hands to do it.
It’s a catch 22. I remember seeing a Haskell application on Bioinformatics, but one needed to “compile” it in order to use.
I just got an Erlang book and Real World Haskell is free (http://www.realworldhaskell.org/blog/), so I guess it’s in our hands!!
@Paulo, how much time have you got on your hands? Unfortunately, I am going to very busy this year. Me, I’d love to see a clojure v. haskell race for bioinformatics.
Not much spare time, unfortunately. But we can try influencing others … I wish.
I have doubts that lisp will be the language that solves it all …
“the most powerful language ever designed”, LOL !???
I never understand why people get so set in their ways when it comes to language. Multiprocessing and multithreading code in bioinformatics has been around for a while. Although a nice plug for a language that I’ve never heard of, you might want to see what algorithms and programs are available for multiprocessor/cluster/grid computing.
@Andreas, ummm… I do cluster computing all the time. I run parallelizable molecular dynamics simulations every day. I speak from the pain of trying to get multiprocessor stuff to work. If I was set in my ways, I wouldn’t even consider new and better ways of doing what I do.
How about Scala? Haven’t had any experience with it, but I’ve heard that it rocks with concurrency (and also is built on top of JVM)
Most of the “concurrency” problems I’ve faced in my career have been database related – such as trying to annotate a million genomic positions across just a few dozen annotation tables, or more recently, memory related, as myself and many others have experienced with Velvet assemblies.
The processor stuff sounds sexy but it hasn’t really been the main bottleneck for me yet. I don’t see BLAST or BLAT holding up people so much. That sounds like an imagined problem. Maybe Vmatch could benefit from being rewritten, but even that seems more memory limited.
