Why I think Python is good (as in granola) for the scientist who programs.

31 Mar 2007 // programming

This is a flameworthy topic, but I will wade into it anyway – for 95% of scientists, python is the bomb.

I am not saying that Python is good for everybody, I'm specifically talking about a scientist who programs (as opposed to scientific programmers). You use the computer a lot but you have modest computing requirements. None of your jobs takes more than a day or two to run on a typical computer. In the end, you're more interested in publishing data in a paper than releasing a software package as manna from the open-source heaven.

For you, my friend, Python is the shizzle. Sure Perl, Fortran, Pascal, Java, C, and C++ (C's nerdy younger sister) have had their day in the sun, but they lack that certain je ne sais quoi for today's code savvy scientists. And here, I want to talk about the quoi.

Python has built-in modern data structures

There is no reason in 2007 to write low-level data structures such as variable arrays, iterators, linked lists and hash tables. Hell, you don't even need to write a decent sort algorithm.

I speak from bitter personal experience.

When I started modeling proteins, some 10 years ago, I was a macho physics undergrad, and I wanted to use the most macho computer language. There was no more macho language than C. I wrote everything in C. That was stupid.

The problem with C is that you have to write a lot of low-level stuff yourself. I was a scientist, not a computer scientist. I had to learn a lot fresh-man computer science to do anything mildly complex like implementing a double linked list. It was a complete waste of time.

Before I ended at Python, I spent a protracted period using C++. In C++, you could use Standard Template Library, which is essentially a bolted on library of basic data structures. When I say bolted on, I mean the seams stick out like barbed wire stuck into a stupid cow that was stupid enough to walk into a barbed wire fence. But before you can safely use the STL, you really have to understand how templates work (fearsome meta-programming). What C++ giveth with one hand, it taketh away with the other. Besides, do you really want to type out "vector<int>::iterator iter" every time you want to iterate over a loop?

I learnt enough basic computer science get stuff done but not enough to do it well. Modern languages, like Python, gives you these data structures for free, data structures that you will use 95% of the time. Okay so maybe not for free, maybe you take a speed hit. But (assuming you are not a scientific programmer) the time saved in not writing low-level stuff offsets any speed improvements that you may gain.

Readability counts

I don't know if you've ever read a theory paper – filled with undecipherable acronyms, unreadable equations, and obscure definitions tucked away in fine print in the methods section. Unless you love theory, it's a serious poke in the eye.

Reading a program with wildly flexible syntax is a little like that (I'm talking to you, Perl). Scientists don't really want to spend their time immersed in the deeper issues of the grammar of a computer language. And the best way not to do that is to use a language that looks as much like pseudo-code as possible.

I don't think it's a huge jump from looking at differential equations to reading pseudo-code. But honestly, grasping s-expressions in lisp is beyond most scientists, unless you're the kind of person who reads Gödel's On formally undecidable propositions for bed-time reading. As far as sticking to the pseudo-code idiom goes, Python hits the sweet spot (Bruce Eckels calls Python executable pseudo-code). For scientists who program only some of the time, that's precisely what you want.

Another focus of Python is that it is compact: it tries hard to limit the number of elements in the programming language, perhaps at the expense of some expressiveness. This fits in to Python's philosophy of one-obvious-way-to-do-something, which concentrates the mind on the most straightforward implementation of an algorithm. You only have to learn one thing once.

Python also aims for prose clarity over too much concision. It uses symbols that translate well from outside the programming world. Python uses ':' after an if clause, because colons are used like that in English prose. Instead of obscure operators such as '||' and '&&', Python provides the more prosaic 'or' and 'and'.

The issue of white-space will forever cause violent disagreements amongst programmers. Me, I love Python's white-space requirements. Semantic white-space cuts the gordian knot of deciding how to indent source-code. It forces everybody to indent code the same way, with the result that everybody's code will look a little more like pseudo-code. That is a good thing for scientists, who probably haven't been inculcated with good programming practices by the burn of failing software projects.

Scriptability

If you've ever done anything in C, you realize that your directories will be filled with a salacious mix of shell scripts, make files, C source codes, C object files, and headers. This means that you will have to master, not one, but three languages – make files, the C programming language, and shell script. (Same thing holds for Java or Fortran).

The beauty of Python is that it started off as a scripting language and later, blossomed into a full-featured language. Some might say scripting is dirty and unclean, but sometimes you just want to drop into the os and move a directory. When I first started programming, I didn't understand the power of scripting: being able to drop into the operating system easily (creating directories, moving files, checking domain names), reading files easily, transforming text and data with simple commands. Now I realize that it means I can replace all my shell scripts and make files with Python programs.

It took me a while to realize that text handling is really important. I found that I was spending a lot of my time handling text. In any project, I've always broken up the calculations into a series of modular programs. You code one program that does one thing, and save the data. Then you take that data and do something else with it. For simplicity, I was saving that data as text files. And therefore, I was reading and writing data files a lot. Python handles the reading, writing and manipulation of text files with grace.

For scientists who don't want to master the full complexities of shell script, make files and the programming language of their choice, you can replace this all by using Python.

Python is ecumenical

Python is what is known as a multi-paradigm language. Surprisingly, a few common ones are not. Java forces you to use objects all the time. C doesn't have objects. Lisp doesn't like straightforward procedural programming. C++ is supposedly a multi-paradigm language, but it is as ugly as sin.

Objects are great, when you need them. I am becoming more enamored of functional programming and I even indulge in some meta-programming. But sometimes I just want to bash out a simple script. Python let's you switch between these paradigms easily.

Interpretation is back in style

Back in the day, I used GW-Basic on my PC clone, it was interpreted and slow as molasses. Then I got Turbo C, which compiled programs to native code, and I swore off interpreted languages for the rest of my life. Boys interpret, real men compile.

That was 15 years ago. Things have progressed a little since then. People often hawk the slow-speed card when talking about Python, but the fact is, unless you're doing some seriously hefty calculations, Python is plenty fast for you.

There are manifold benefits to interpretation. First, you don't need to compile anything. Compilation is a painful step that requires make files, and object files, and painful worries about binary library dependencies. In Python, the source code is the program. It is totally portable and sharable.

Python has an interpreted command-line mode. This is more useful than you might imagine. Sometimes when you forget some little parts of Python, you can just test it out on the command line, without resorting to digging through the manual. What order are the operators? Are variables scoped locally? How does a string function strip spaces? Just bash it out on the command line.

Trying things out on the command-line is but the first step towards the Run-Evaluate-Print-Loop (REPL) methodology. I don't want to go into it here, but REPL leverages the command-line mode so that you can organically grow your program, one function at a time. Growing a program is a much more pleasant way of writing programs than splurging out a blueprint before even a single line is written. But to use the REPL methodology, you need a language that can support dynamic module loading, duck typing, and all sorts of dynamic programming niceties, which Python does.

Another great bonus with the command-line mode is that I can use Python as a command-line shell on steroids. For instance, sometimes I want to rename a whole bunch files, and move them around, but with all sorts of non-obvious name changes. The Python command-line gives me the flexibility of rich data-structures, string manipulations, and operating commands, all at my finger-tips on the command-line.

But sir, what if I really need the speed

In professional programming circles, it's often said that the number one error in programming is premature optimization. Most of the time, you really don't know where the speed bottle-neck is. So why pre-empt that ignorance by trying to write in a raw language that takes you 10 times the effort to write something in 10 times the amount of time. You'll probably find that it runs fast enough in Python, with less bugs and more readability.

But occasionally, you will want to speed up that one crucial function in your Python program. You've dutifully done the profiling and find out that your program chokes in one place 95% of the running time.

Before you put your rubber gloves back on and go back to C, you might want to try to write a plug-in module to Python, in C. This will isolate the pain of writing in a low-level language to just a single function.

Even a couple of years ago, contemplating the writing of a plug-in module would be, in scientific terms, extremely non-trivial. But not anymore, there's so many easy ways to write plug-in modules and all of them wildly different. You can use SWIG that wraps a piece of C code. You can use Psyco or Pyrex to compile your Python code directly. You can even use a number of Fortran wrappers. Python is rather promiscuous once you think about it.

Math libraries

Up until recently, I couldn't see much of a difference between Python and Ruby (most of the points above apply equally well to Ruby). But there's been a development that tips Python over the edge for scientists who program.

Finally, we have a single powerful numerics library in the form of numpy. Numerical libraries is perhaps the single reason that Fortran is still popular amongst old-school scientists. In Fortran, creating vectors and arrays is easy, and there are great libraries for them.

Any language that wants to earn the loyalty of working scientists must have a powerful numerics library. Not only that, for portability and prosperity, the library must be standardized across the board, be fast, reliable, and integrated within the core of the language.

Before numpy, Python was hobbled by the fact that there were 2 or 3 competing numerics library (numarray and numerics). This was a bad situation to be in. Something as fundamental as a numerics library better be standard or you will find it exceedingly difficult to share code. But now Travis Oliphant has done the Herculean task of merging these two libraries into one slick package that is fast becoming a standard. So much so, I won't be surprised if numpy makes it into the Python standard library.

And to add the icing on the cake, matplotlib has emerged as a robust graphing library that sits on top of numpy. Graphing is an essential tool in the scientists tool-kit. And there's a rich collection of pluagable python scientific libraries.

As it stands, it's easy to download the one authoritative numerics library that gives you arrays and vectors in Python. This library will let you do essential statistics that every working scientist needs. Like the ex-girlfriend that you keep coming back to but know you shouldn't, you can finally kiss Fortran goodbye.

Take home message

Python is a compact and scrupulously clean language that lets you do dirty scripting, write in different programming styles, and includes a powerful standard math library, with graphing. It is God's gift to scientists who program but who don't want to spend their time learning how to optimize the processing chip or sink too deep into computer science.