I was looking back the other day, and realized that I've been doing *cough* computational biology *cough* for more than ten years. That's a big chunk of time by any reckoning, and I have the scars to prove it. Well, not so much physical scars (apart from mouse-related RSI, and back pains from sitting down too much) but mental ones.
The problem, you see, was that I learnt my trade in a wet-lab, where there weren't any computational biologists around. So I picked up some bad habits - habits any seasoned computer person mentoring me might have smacked me out of. So, following are some pieces of hard-earned advice I'd like to give you (budding computational biologist working in a wet lab). Some of the stuff is generic computer advice cribbed from the essential The Pragmatic Programmer but other stuff is bioinformatic related.
- Learn a scripting language. Data files, directories and files is your bread and butter. You need a centralized way to create them. Sure, you can remember the sequence of programs to run on the command line - when you are awake. But when you are tired, you will forget something, a displaced period, a missing colon. You need to do everything - make directories, run programs, rename files, clean up temporary files - in one place. Batch files and shell scripts are not flexible enough. You need the real stuff, a scripting language - PERL, PYTHON or RUBY.
Here's what you need in a scripting language: it has to handle text well (loads of built-in string functions and a built-in string type), have an easy interface to all operating system commands, have built-in lists and hash tables, do straightforward file handling, be interpreted (you don't want to compile anything), allow dynamic programming (trust me this is good for you). What you don't need: speed.
- Learn a programmable statistics package. I've used Matlab, SigmaPlot, and now I use Numpy in Python. Once I used Excel. Yes. Excel. No-one told me any better, and I can bet my bottom dollar that some newbie computational biologist is using Excel this very instant. Don't. Learn to dump all your data into a text-file, and feed it into a programmable stats package. You don't need a powerful one. Most of the time, you'll just be calculating means and standard deviations. But worlds open up to you when you can process simple stats with just two lines of code, and transform data with another line or two.
- Save everything in text files. You're a computational biologist, you don't need a notebook, but you still need to record everything like any good practicing scientist. Solution: dump everything into a text file. This is also the UNIX way of doing things. Why text? Months later, when you are writing up a paper, you will come back and ask yourself, what exactly did I do in this directory? Well, the solution of last resort is to read through the data files. If it's in text format, you will have a betting chance of working out what you did. If in binary, forget it. And if you thought you were too he-man to save it, well, I won't even pity you.
- Learn the value of config files. Using config files forces you to write programs that treat a group of parameters in a general way. This will allow you to abstract calculations. In any large project, I've always started by writing a one-off program. Then I abstract out the important parameters into a config file. Once I have a nice config file, I can tweak the calculation by adjusting the config file using a text editor, then re-running the program. And if you decide to make a lot of calculations that are slight variations of each other, just write a script that loops over different config files. Config files are also a convenient shorthand for recording what you did.
What format should you save your config file in? Text file, of course (see #3). You might think about saving your configuration file in a standard format. Perhaps XML, or maybe the more readable YAML. Me? I save mine as a text-version of a python dictionary. It's text (readable), can be read easily in my scripting language of choice (see #1) due to Python's pleasant file-handling ability. And with dynamic introspection in Python I can easily turn the config file into a live object that I can script over.
- Delete data you don't need. Okay, this is kinda subjective, but in most places I've worked, I've run out of disk space faster than I can say grant extension. This has two deleterious effects. A full disk slows down your system globally and secondly, too many data files confuses the mind. One of the most difficult thing to do is working out what is actually important. After I've done a series of calculations/simulations, I sit back, decide on a representative sub-set of data, write down notes on it, and then delete the rest. Confused mind pacified somewhat.
- Learn the art of renaming data files and directories. This is essentially a sanity check. It won't be too long before you start creating huge directory trees with branches reaching out to the loneliest bytes on your hard disk. At first you'll take it in your stride, cleverly remembering the right combination of cryptic letters and unix command-line tab power. But before long, you'll find your directories descending into a monstrosity of similar sounding clauses. It doesn't take long to confuse directories with names like "~/pdz-domain-with-ligand-gas-phase-force-field" with "~/pdz-domain-with-implicit-force-field-backbone-constraint".
When sub-directories start having very similar names at different hierarchical levels, you know that you should re-order and re-name the sub-directories so that there is no redundancy in naming between levels. Use directories semantically. For instance, I transformed the above directories to "~/gas-phase/pdz-domain/ligand" and "~/implicit-solvent-backbone-constraint/pdz-domain/apo". Each sub-directory name contains a different semantic element of the simulation. If you rename directories intelligently, you will save yourself typing, making it easy to remember where things are, and simplifying further analyzes over different directories.
You'll thank yourself when you do this, the next time you sit down with your boss, or a visitor and you need to quickly pull out that really interesting figure or data-file. You don't want to embarrass yourself by tripping over a hornet's nest of sub-directory extensions.
- When using off-the-shelf third party software, always generate complete config files in a script. In my research, I ended up using that ever-green molecular-dynamics behemoth AMBER. AMBER is a general purpose package that accepts a zillion parameters, with some weird inter-dependencies, of which you will only ever use 10. But it will take you weeks to work out which 10 parameters, and in which order, are needed, to set up the config file to run AMBER the way you want. Once you know which parameters are which, write a script that takes the 10 parameters that you are interested and generate a complete, proper config file from scratch. You never want to cut-and-paste config files for an off-the-shelf package by hand. That is the way to madness, because I can guarantee that later on, you will forget some weird dependency that will take you 3 days to debug.
- Learn to use a command-line plotting program. Don't hesitate. Learn it now. You're a scientist. Your job is to write papers, and good papers almost always are accompanied by great figures. Unlike wet-lab guys, it's your bread-and-butter to reduce large reams of data into elegant looking graphs. This will require imaginative use of data-reduction into a graph. To do this, you will need precise control of drawing data elements: you need a command-line plotting program. There is no other substitute. I've started with Excel (the horror the horror), then SigmaPlot (which is interactive), but then saw my productivity sky-rocket when I moved to Matlab (which has great command-line plotting), and now I use MatPlotlib in PYTHON.
The other reason to use command-line plotting, is that you can automate the data-analysis. You run a script over the weekend, and come back in the morning, open the auto-generated figure, and see the results visually. Believe me, looking at a graph of your data beats pouring over tables of numbers any day. While you're at it, read Edward Tufte's The Visual Display of Quantitative Information.
- Get a real programming editor. Emacs, vim, Eclipse or TextMate (what I use). Believe it or not, I used notepad for ages. Some simple things that will save you time: commands that help you jump around your source code: line jump, to function find, auto-indentation, block cut-and-paste. Learn to use regular expressions, and decent search-and-replace. You will use this. A lot. A good programming editor should handle multiple files gracefully. Learn to have many text files opened at once. You won't believe how much easier it is to cut-and-paste if both source files are on-screen at once.
- Automate everything. Whenever you find yourself running any program repeatedly using a series of slightly varying parameters, that's an opportunity for you to write a script. You will save yourself hours because every step that you automate (and test) is one less thing you will have to remember later on. As you develop your programs, the collection of scripts will mutate into a precious commodity - you might eventually collect the code into a library, which you can take with you from one lab to another. Another benefit is that if you can automate these steps into a script - the script becomes a record of all the itsy bitsy details of your simulation, that one day, someone who is interested in your work may ask you about.
- Beware of the NIH (Not Invented Here) syndrome. You're probably a computational biologist because you like programming. Programming is a vast world with a zillion little corners that you can play with. But you're also a scientist, and your time is precious. If you are going to be absorbing different branches of biology (DNA databases, protein structures, drug libraries, statistics), you really only have time to develop robust code for one or two systems.
The really hard task is deciding what level of the problem that you will be playing around with. A general rule is that it's best to write the code from scratch for the level which fits your problem. Anything below that, use off-the-shelf packages. One useful leg-up is to use a language with rich libraries (Python, Perl, maybe Java, or even Fortran). If you're not developing molecular dynamics algorithms, don't write your own molecular dynamics packages. Use mature libraries like the mpi message passing library, don't waste time writing your own networking code. Deciding the level that is your main interest is admittedly, easier said than done.
The biggest trap, though, is when you, or your boss, is tempted to play around with the source code of an off-the-shelf package. You want to add this fancy way of integrating forces, or adding a really obscure term to the force-field Do this with extreme caution. Yes, you will get kudos for hacking into the source code of AMBER, but this is so easy to get stuck into a dead end. Why? Presumably the maintainers of AMBER are also working on it. As soon as the next version comes out, your hacks will become obsolete.
Personal example: my boss wanted me to implement a special thermometer in a molecular-dynamics simulation. The previous postdoc had hacked into XPLOR (another molecular dynamics package), but that was on another computer system and the source code was lost in the nether regions of some server somewhere in the lab. I decided that I didn't want to hack into XPLOR, nor into AMBER. Instead I figured out of way of running AMBER in tiny steps, and implementing a thermometer on top of molecular dynamics of AMBER in bit-wise fashion. One advantage was that I could plug any molecular dynamics package into my code. Now I have a really useful library that I can take away with me. I can even share the code with other people without requiring them to savage their version of XPLOR or AMBER.
Hope this helps. And happy programming.
Good points Bosco. I came to several similar conclusions after years of experimental computer science work.
I totally agree with the need to learn a scripting language, using configuration files, a good editor, and a good stats package.
I used a mix of shell, ruby, and prolog scripts edited with a mix of vi, SubEthaEdit and eventually TextMate.
I use R for statistical analysis and graphing. It has a great library for hypothesis testing (t-tests, ANOVA, etc), a large active community and great graphing methods. I ended up doing all my graphing as R scripts after going through gnuplot and Grace. I think the quality of plots you can get out of R are superior to the others.
I’ve played around with MatLab as well and it also looks good, though it carries a hefty price tag compared to R’s $0.
An advantage of keeping data, configuration and data in text files that you didn’t mention is the fact that they are then easily version controlled.
Version control should be a point in its own right in your list. Every time you finish an experimental run that results in an important graph or statistic, whack your entire experimental workspace (code, data, configs) and tag it. That way, even if your code moves on, you can “go back in time” an rerun the experiment without the fear that some new feature that you have since added has subtly modified your methodology.
The other advantage is that if, like me, you repeat experiments while slowly varying parameters you can easily dispatch these experiments to multiple machines. Having your entire experimental workspace in a repository means you can ensure the multiple machines you split your run over are all up to date.
Subversion’s probably the best of the bunch here. It’s better than CVS and has the most third-party support and tools.
Hey Bosco!
True words… my friend. I wished somebody told me that 15 years ago! ;-) (5 more than you my old friend).
I would add a 12th point.
- Be sure you like what you are doing. A PhD somtimes can be very unrewarding, at least enjoy the process of getting one!
All the best BOSCO,
m
Hi Bosco!
Excellent post, many of these insights are hard won I know. Here is a manual trackback from nodalpoint:
Nice Post !
But I must confess: I don’t master Perl, nor Python… :-)
Hi
I did not know your blog. Found the post really interesting. I am new in the “blogsphere” and I took the liberty to comment on your post here:
http://blindscientist.genedrift.org/2007/0...
Hope it is OK.
Cheers
Great post.
I have only been working on bioinformatics for about 5 years and I agree with most of your points, although I still use excel for graphics :).
Good advice. I think I’ve fought my way out of most of the traps you describe.
A related, but subtly different additional point would be to learn how to formulate the question you’re really trying to ask in the world of biology. In other words, learn to develop questions that you can answer through statistical or sampling tests, discrete, small questions, along the way towards solving whatever grand problem in biology forms the basis of your thesis. Because of the large interesting world of computing, sometimes the possibilities seem endless and I find myself in total “exploration” phase, linking together code that could optimize the parameters for the solution if… I ever got to the end of the possibilities of coding such an arrogant solution.
My advice is to look at the world in terms of distributions; a biological question can be tested simply by asking if the model distribution fits the available data. Your real work as a bioinformaticist vs. a biostatistician is, of course, distilling, preprocessing, and correlating the correct data in order to achieve a testable distribution. Your sanity goal in grad school should be no more than to solve something, some small thing, each and every day, no matter how pedantic. Then go home and start again the next day. Rinse, lather, repeat.
Great post : I would have shown it to younger students if I knew some :)
Yet, I’m pretty sure that my “notes to a young computational chemist” would differ. I didn’t understand my boss never used my scripts and ran calculations by hand. So old-fashionned… He claimed that using my scripts won’t save time.
In fact, it took me a while (almost five years…) to realise that my ab initio calculations are more likely to give wrong results than chemical or biological simulations. That’s why checking (quickly) the output is a necessary step : is there spin contamination ? Does the calculation really converge to a chemical solution (scf guess wrong) ? Is the final geometry reasonable ? What about the frequencies… My #12 point would be to check the first output before submitting a whole series of calculations. (A fortiori before sending these results to your boss :)
Cheers.
Thanks for the informative post. I think perspectives like these are particularly useful for us the students in this field, because so few professors have a real understanding of what our working lives our going to be like.
Your advice is good, but some of the items seemed obvious to me (like not using Excel). I have a few obvious and not so obvious things to add:
1) Script everything. For each project, run all the scripts twice. If you can’t get the same answer twice in a row, then you don’t know the answer. You make an excellent point when you recommend the use of programmable statistics software; if you can’t run your statistical analysis twice and get the same answer, then you don’t know the answer.
2) Store your data in a database. Text files are all very fine, but it’s impossible to store complex data efficiently in text files. Take advantage of a relational database and the power of SQL.
3) Most of the time, the initial question you ask about the data turns out not to be the right question. Be especially careful of this when a wet-lab scientist gives you a data set and tells you the answer she or he wants, because it’s usually an answer that can’t be obtained from the data.
4) I agree with Mark’s comments about version control. Save all your code with a source control system, such as Subversion.
5) Learn XHTML and XML. Learn how to distribute results as HTML pages, because they’re easy for scientists to use.
6) Learn a programming language that can be compiled for speed. Perl is all very fine and usually sufficiently fast, but there will come a time when you need to write an algorithm in a faster language such as C. I once wrote a Perl script that was going to take 2 years to run. When I rewrote the fundamental algorithm in C, I reduced the run time to 2 weeks.
7) Back up your data and code. A hard disk crash can be a disaster.
Dear Bosco, great to see these thoughts. I would like to mention the R-project for statistics software! I also like Conrads point that SQL is a necessary thing to know today. My courses try to combine script-programing (currently in Bioperl) with SQL-interfacing via DBI.
I have only 5 years of experience and I think the same of you. Regarding Excel, yes I also started with Excel and the main problem is that is breaks the automatic flow of your work. Then I lean to sort in Python (using set) and stopped using excel (only to share data to others who like Excel).
