Our Parallel future: an introduction to functional programming
For those of you scientific programmers who keep your ears to the wind, you must have heard, from the b-boys at reddit.com to the fiery old-testament missives of Paul Graham, that Lisp is back. Whilst Lisp may never be cool again in the way of the flared jeans, functional programming (of which lisp is the exemplar) is going to get very very big in the next few years.
The fact is, every computational biologist I know has been buying linux clusters left, right and center. And we've been trying to shoehorn our favorite programs into running on these clusters. Problem is, it's very hard to force a procedural program to spawn efficient parallel processes on the nodes of the cluster. Making programs work efficiently in parallel will be the next big bottleneck of programming. Unfortunately, most of the programs we scientific programmers write are the very model of a procedural algorithm. And we can only make these procedural programs go parallel by using drastic software surgery, which is ugly, messy, and wasteful that multi-processor goodness.
Although we are living in the dying throes of the age of Procedural Programming, the next age is already before us: the Age of Functional Programming. Functional Programming is the only sensible way to write programs that can go parallel. Just ask Google. Their MapReduce algorithm is perhaps the most spectacular example of functional programming in the open source community.
If you don't get with functional programming, you're gonna get left behind. So you'd better dip your toes in the water, now. You might even consider trying some functional programming techniques in Python, being the language-paradigm whore that it is.
Here's a common problem to which I can apply some functional programming techniques. I want to sort the lines in a text file by the first column, which is a column of integers, so that I can just look at it:
def sorted_lines(lines, key_func): pairs = [(key_func(l), l) for l in lines] return [l for (v, l) in sorted(pairs)]
def sort_text_file(fname, key_func): lines = open(fname, 'r').read().splitlines() new_lines = sorted_lines(lines, key_func) return '\n'.join(new_lines)
def first_word_int(line): words = line.split() return int(words) # conversion happens here!!
new_text = sort_text_file('my_data_file', first_word_int) open(fname, 'w').write(new_text)
There are three basic elements of functional programming demonstrated here.
In the sorted_lines function, a function key_func is taken as a parameter. This way, bits of code can be passed around as data, and triggered when needed. If we follow the main program loop at the bottom, we can see how the function first_word_int is first passed into sort_text_file, which in turn passes the function into sorted_lines before being used. Using functions as data, you can write a generic piece of code (sorted_lines), that lets you fill in the details later, with whatever code you want.
In the sorted_lines function, a new list is generated and returned. The parameters are not changed. One of the important aspects of functional programming is that you should never use procedures that change the values of an object. Instead a new list of lines is generated from scratch that is returned. By not allowing procedures that change the values of a parameter, you will never have to worry about which procedure changes which variable. You have eliminated side effects. Functions that avoid side effects are crucial in parallel programming as unintended side-effects are one of the biggest cause of asynchronous locking between parallel processes.
In the sorted_lines function, you will notice that keys were generated from a line using the key_func function. But look, we know nothing about the type of return value of key_func. We can happily roll along to the next few lines, where we apply the sorted() function to the new list. Not caring about the type (as long as we apply a sorted() to it) gives us tremendous flexibility in terms of what kind of functions we can substitute in with the key_func parameter.
I can now substitute any kind of function to generate a key from a line. For example, sort the lines with respect to that angle starting from -180 degrees (a slightly perverse example):
def norm_angle_from_col4(line): angle = float(line.split()) while angle < -180: angle += 360 while angle > 180: angle -= 360 return angle
new_text = sort_text_file('list_of_angles',
norm_angle_from_col4) open(fname, 'w').write(new_text)
But look, I didn't need to change sort_text_file nor sort_lines functions at all. Nice.
So how might functional programming help in parallel processing? Well, by separating bits of code out with the guarantee of no side effects, we can learn to optimize different parts using fancy parallel algorithms. For instance, if you have really big files, or you have distributed files over a system, you can rewrite sort_text_file to run this in parallel, using fancy algorithms. But the person who calls sort_text_file doesn't need to know that, she just has to feed in her key_func() substitute. It might even be Google's MapReduce under the hood.
Finally, there is one more functional programming idea that I want to introduce, and that is the idea of the anonymous function. I only bring this up so that you can learn to look a Lisp program in the eye, if only for a brief momemnt. We could have written the first example as
first_word_int = lambda line: int(line.split()) new_text = sort_text_file('text_file', first_word_int)
This is lambda, one of the holy ideas stolen from Lisp and inserted into Python. The keyword lambda tells python not to evaluate the following expression, but to treat it as an expression to be evaluated later. In effect, lambda gives you another way to define a function. And if you understand this, we can now re-write that first line as an anonymous function:
new_text = sort_text_file('text_file', \ lambda line: int(line.split()))
Look at the abstraction of functional programming! It's powerful, no? So now that I've armed you with the cudgel of Functional Programming, you can go out there and beat those Fortran Neatherdalls to death, not in sequential procedural blows, but in parallel.