Bionostalgia: Notes to an Even Younger Computational Biologist

12 May 2009 // protein

This article was one of the first articles I'd written that hit internet paydirt, so I look back on it with some nostalgia. Recently, Paulo Nuin wrote Notes to a bioinformatician – two years later., an update to his response to my article. It only seems fitting that I write a response to his response on his response about my article. I still stand by most of the points I made, but 2 years on, I'd like to tack on some more points, some based on comments in the original piece:

Little disagreements with Paulo. How do I store information? I store quite a lot of my results as figures in my data directories, ready to insert in a paper or presentation. I also build HTML files that collate a bunch of these figures so I can see all the results together. The fact is that there are many ways to store notes. Pick one you like. My only advice is make sure that you actually do, and you don't need to do it on paper. As for deleting data, I differ with Paulo probably because I am a computational (structural) biologist. I make a lot of simulations that are crap because I tweaked certain parameters that were ridiculous but I only realized they were crap after the fact. These I delete.
Version Control. I am a little guilty of not always version-controlling. But I have used it for some of my projects and it's like taking out a good car insurance policy especially if it's a project where you're rolling changes back and forth. Once you have it working, you can actually speed up your work flow. Instead of commenting out chunks of code, just delete it. You can always recover the deleted code by reloading a previous version. And now with distributed version control systems, it's even easier to clone new version of projects to fiddle around with risky changes than the old method of copying directories by hand. Distributed version control systems make merging versions somewhat painless. Version control actually formalizes something we do quite naturally as programmers.
Databases. This is one piece of advice I don't follow myself. Yet. Most of my data is in molecular dynamics trajectories, so I don't really want to stuff it in a database. But I recently started playing with a bioinformatics projects and I'm beginning to understand the sense of shoving all your data in a database. Once set up, pulling things out is much easier. There's a whole bunch of great open source databases from couchDB to mysql.
Concurrency. If you're doing any kind of serious scientific computing, you will run up against the concurrency wall. I have a 12 node cluster that run my day jobs. This is small bananas. My last group had access to a department wide 500 node cluster. If you use clusters then you will learn the pain that is concurrency. It can come in many forms: batch jobs, mpi passing, gpu code. I've sent days writing code just to copy data back and forth from the clusters to my main machine. If you do it properly, you will avoid some of the pain. If you do it without thinking, welcome to a world of it. Some problems are social: we had one asshole adminstrator running the 500 node cluster who kept on killing our jobs because they were "too small". We ended going to our boss to lean on his boss to make him install a different scheduler. Make an effort to at least learn about the basic concepts in concurrency: deadlocks, race conditions etc. You don't need a thorough knowledge – no one's expecting you to write a Software Tranactional Memory system. Yet.
Get your own server. This kind of slipped my mind when I wrote the original list, but I spend an awful amount of time doing web stuff. I build websites for friends, and I built my own. You're probably using your current lab's server-site. Assuming you're going to be doing interesting stuff and you want to share it with the rest of the world, your website will grows. In the long-run it sucks having your website on someone else's server. One day you will leave and they'll delete your account. If you want to dip your toe in the water, I suggest that you first buy a domain name and link it to your current account. You can even have a cool domain name like boscoh.com to hide your butt-ugly web server name. Mine used to be http://newt.phys.unsw.edu/biophysics/~/bho. If you have your own domain name, when you change labs, you just copy over files, redirect the domain name, and no one will be the wiser. You will not get link-rot. Better yet, I suggest that you rent some server space. Dreamhost costs about $100 a year. It's really nice to have your own site that you control, independent of which institution you're at. And if you ever want to do something fancy, like write a ruby on rails apps, you can.
Web-programming. It's amazing what a little bit of HTML and CSS can do in terms of sharing your data. It doesn't take much to learn, but it does take some learning. Face it, if you're doing bioinformatics, you will either want to share your data or provide programs. You might think your data is so great that other researchers will beat down your door to get it. Wrong! If your data is not packaged in an attractive and easy-to-explore manner, people will give up. I know I do. Learn to put up a decent clean and usable webpage. Or better yet, learn a templating system and write a program to build a website. The better your web skills, the more data you can feed to the world in a form that people would want to use.
Learn how to release software. It may seem easy when your boss says, just send Professor Asshat your code, and he'll use it. But it turns out to take a little more work than that. People forget that publishing a paper is only the beginning of the process of science dissemination. After publication, the hard work of selling your paper begins. In computational biology, that normally means getting people excited about the program, and putting a working program in your hands.

If the program is not too intensive, 9 times out of 10, a good webapp will be the best solution. In a recent paper I wrote, I had a whole bunch of programs I could test my program against. I ended up choosing one because it had a beautiful webapp interface. And I chose the other because they had a working binary for every major platform.

I once wrote a little graphics program. Because I wanted it to work well on different platforms, I had to use wxWidgets and C++. Let me tell you that it's a pain in the ass to maintain multiple platforms, especially on platforms other than the one use every day. If I did it today, I would write it in Javascript, like this and this. That way, I avoid C++, I avoid GUI libraries, and I get cross-platform for free on Firexfox. And it's totally Web 2.0.

Think very carefully about licensing. I release most of my own software as I can under open source. Younger guys don't know this but in bioinformatics we've been very lucky, where the whole field got into the habit of sharing data and software with other academics. This is unusual. Just ask an organic chemist about how much they pay for data and software.
Libraries and API's. The famous MIT electrical engineer introductory computer science course recently switched from SCHEME to PYTHON. One of the reasons they cited was that today's programmers are more apt to spend their time unravelling someone's library and writing code to work against it, rather than writing code from scratch. If you don't want to be the computing equivalent of Grizzly Adams, I'd advise you to do the same. There is an art to unravelling the mysteries of a freshly downloaded library, but you'll be doing it many times in your career if you want become a productive researcher. Part and parcel of this is to learn to read crappy documentation, decipher example programs, and practice the hermeneutics of reference manuals. Hopefully after many years of this, you will learn the art of crafting your own beautiful library, with an appealing API, and documenting it with elegant, new-yorker quality prose. So that I can use it.