Notes to a young computational biologist

13 Mar 2007 // protein

I was looking back the other day, and realized that I've been doing *cough* computational biology *cough* for more than ten years. That's a big chunk of time by any reckoning, and I have the scars to prove it. Well, not so much physical scars (apart from mouse-related RSI, and back pains from sitting down too much) but mental ones.

The problem, you see, was that I learnt my trade in a wet-lab, where there weren't any computational biologists around. So I picked up some bad habits - habits any seasoned computer person mentoring me might have smacked me out of. So, following are some pieces of hard-earned advice I'd like to give you (budding computational biologist working in a wet lab). Some of the stuff is generic computer advice cribbed from the essential The Pragmatic Programmer but other stuff is bioinformatic related.

Learn a scripting language. Data files, directories and files is your bread and butter. You need a centralized way to create them. Sure, you can remember the sequence of programs to run on the command line - when you are awake. But when you are tired, you will forget something, a displaced period, a missing colon. You need to do everything - make directories, run programs, rename files, clean up temporary files - in one place. Batch files and shell scripts are not flexible enough. You need the real stuff, a scripting language - PERL, PYTHON or RUBY.

Here's what you need in a scripting language: it has to handle text well (loads of built-in string functions and a built-in string type), have an easy interface to all operating system commands, have built-in lists and hash tables, do straightforward file handling, be interpreted (you don't want to compile anything), allow dynamic programming (trust me this is good for you). What you don't need: speed.
Learn a programmable statistics package. I've used Matlab, SigmaPlot, and now I use Numpy in Python. Once I used Excel. Yes. Excel. No-one told me any better, and I can bet my bottom dollar that some newbie computational biologist is using Excel this very instant. Don't. Learn to dump all your data into a text-file, and feed it into a programmable stats package. You don't need a powerful one. Most of the time, you'll just be calculating means and standard deviations. But worlds open up to you when you can process simple stats with just two lines of code, and transform data with another line or two.
Save everything in text files. You're a computational biologist, you don't need a notebook, but you still need to record everything like any good practicing scientist. Solution: dump everything into a text file. This is also the UNIX way of doing things. Why text? Months later, when you are writing up a paper, you will come back and ask yourself, what exactly did I do in this directory? Well, the solution of last resort is to read through the data files. If it's in text format, you will have a betting chance of working out what you did. If in binary, forget it. And if you thought you were too he-man to save it, well, I won't even pity you.
Learn the value of config files. Using config files forces you to write programs that treat a group of parameters in a general way. This will allow you to abstract calculations. In any large project, I've always started by writing a one-off program. Then I abstract out the important parameters into a config file. Once I have a nice config file, I can tweak the calculation by adjusting the config file using a text editor, then re-running the program. And if you decide to make a lot of calculations that are slight variations of each other, just write a script that loops over different config files. Config files are also a convenient shorthand for recording what you did.

What format should you save your config file in? Text file, of course (see #3). You might think about saving your configuration file in a standard format. Perhaps XML, or maybe the more readable YAML. Me? I save mine as a text-version of a python dictionary. It's text (readable), can be read easily in my scripting language of choice (see #1) due to Python's pleasant file-handling ability. And with dynamic introspection in Python I can easily turn the config file into a live object that I can script over.
Delete data you don't need. Okay, this is kinda subjective, but in most places I've worked, I've run out of disk space faster than I can say grant extension. This has two deleterious effects. A full disk slows down your system globally and secondly, too many data files confuses the mind. One of the most difficult thing to do is working out what is actually important. After I've done a series of calculations/simulations, I sit back, decide on a representative sub-set of data, write down notes on it, and then delete the rest. Confused mind pacified somewhat.
Learn the art of renaming data files and directories. This is essentially a sanity check. It won't be too long before you start creating huge directory trees with branches reaching out to the loneliest bytes on your hard disk. At first you'll take it in your stride, cleverly remembering the right combination of cryptic letters and unix command-line tab power. But before long, you'll find your directories descending into a monstrosity of similar sounding clauses. It doesn't take long to confuse directories with names like "~/pdz-domain-with-ligand-gas-phase-force-field" with "~/pdz-domain-with-implicit-force-field-backbone-constraint".

When sub-directories start having very similar names at different hierarchical levels, you know that you should re-order and re-name the sub-directories so that there is no redundancy in naming between levels. Use directories semantically. For instance, I transformed the above directories to "~/gas-phase/pdz-domain/ligand" and "~/implicit-solvent-backbone-constraint/pdz-domain/apo". Each sub-directory name contains a different semantic element of the simulation. If you rename directories intelligently, you will save yourself typing, making it easy to remember where things are, and simplifying further analyzes over different directories.

You'll thank yourself when you do this, the next time you sit down with your boss, or a visitor and you need to quickly pull out that really interesting figure or data-file. You don't want to embarrass yourself by tripping over a hornet's nest of sub-directory extensions.
When using off-the-shelf third party software, always generate complete config files in a script. In my research, I ended up using that ever-green molecular-dynamics behemoth AMBER. AMBER is a general purpose package that accepts a zillion parameters, with some weird inter-dependencies, of which you will only ever use 10. But it will take you weeks to work out which 10 parameters, and in which order, are needed, to set up the config file to run AMBER the way you want. Once you know which parameters are which, write a script that takes the 10 parameters that you are interested and generate a complete, proper config file from scratch. You never want to cut-and-paste config files for an off-the-shelf package by hand. That is the way to madness, because I can guarantee that later on, you will forget some weird dependency that will take you 3 days to debug.
Learn to use a command-line plotting program. Don't hesitate. Learn it now. You're a scientist. Your job is to write papers, and good papers almost always are accompanied by great figures. Unlike wet-lab guys, it's your bread-and-butter to reduce large reams of data into elegant looking graphs. This will require imaginative use of data-reduction into a graph. To do this, you will need precise control of drawing data elements: you need a command-line plotting program. There is no other substitute. I've started with Excel (the horror the horror), then SigmaPlot (which is interactive), but then saw my productivity sky-rocket when I moved to Matlab (which has great command-line plotting), and now I use MatPlotlib in PYTHON.

The other reason to use command-line plotting, is that you can automate the data-analysis. You run a script over the weekend, and come back in the morning, open the auto-generated figure, and see the results visually. Believe me, looking at a graph of your data beats pouring over tables of numbers any day. While you're at it, read Edward Tufte's The Visual Display of Quantitative Information.
Get a real programming editor. Emacs, vim, Eclipse or TextMate (what I use). Believe it or not, I used notepad for ages. Some simple things that will save you time: commands that help you jump around your source code: line jump, to function find, auto-indentation, block cut-and-paste. Learn to use regular expressions, and decent search-and-replace. You will use this. A lot. A good programming editor should handle multiple files gracefully. Learn to have many text files opened at once. You won't believe how much easier it is to cut-and-paste if both source files are on-screen at once.
Automate everything. Whenever you find yourself running any program repeatedly using a series of slightly varying parameters, that's an opportunity for you to write a script. You will save yourself hours because every step that you automate (and test) is one less thing you will have to remember later on. As you develop your programs, the collection of scripts will mutate into a precious commodity - you might eventually collect the code into a library, which you can take with you from one lab to another. Another benefit is that if you can automate these steps into a script - the script becomes a record of all the itsy bitsy details of your simulation, that one day, someone who is interested in your work may ask you about.
Beware of the NIH (Not Invented Here) syndrome. You're probably a computational biologist because you like programming. Programming is a vast world with a zillion little corners that you can play with. But you're also a scientist, and your time is precious. If you are going to be absorbing different branches of biology (DNA databases, protein structures, drug libraries, statistics), you really only have time to develop robust code for one or two systems.

The really hard task is deciding what level of the problem that you will be playing around with. A general rule is that it's best to write the code from scratch for the level which fits your problem. Anything below that, use off-the-shelf packages. One useful leg-up is to use a language with rich libraries (Python, Perl, maybe Java, or even Fortran). If you're not developing molecular dynamics algorithms, don't write your own molecular dynamics packages. Use mature libraries like the mpi message passing library, don't waste time writing your own networking code. Deciding the level that is your main interest is admittedly, easier said than done.

The biggest trap, though, is when you, or your boss, is tempted to play around with the source code of an off-the-shelf package. You want to add this fancy way of integrating forces, or adding a really obscure term to the force-field Do this with extreme caution. Yes, you will get kudos for hacking into the source code of AMBER, but this is so easy to get stuck into a dead end. Why? Presumably the maintainers of AMBER are also working on it. As soon as the next version comes out, your hacks will become obsolete.

Personal example: my boss wanted me to implement a special thermometer in a molecular-dynamics simulation. The previous postdoc had hacked into XPLOR (another molecular dynamics package), but that was on another computer system and the source code was lost in the nether regions of some server somewhere in the lab. I decided that I didn't want to hack into XPLOR, nor into AMBER. Instead I figured out of way of running AMBER in tiny steps, and implementing a thermometer on top of molecular dynamics of AMBER in bit-wise fashion. One advantage was that I could plug any molecular dynamics package into my code. Now I have a really useful library that I can take away with me. I can even share the code with other people without requiring them to savage their version of XPLOR or AMBER.

Hope this helps. And happy programming.