The bioinformatic-journal/software hydrid

25 Oct 2009 // protein

The real bioinformatics journal has not been created yet. Wait, let me explain. What we currently have are traditional journals that deal with articles on bioinformatics. The journal themselves, though they may provide digital PDF files, are still stuck in the paper paradigm. The journal itself is not bioinformatic.

What crack am I smoking? In a recent article on the future of science publishing, Michaell Nielsen speculates that the crunch in scientific publishing will not simply lead to an age of open-access science publishing, but in all likelihood, will lead to new forms of publishing centered around services.

Well there's one service that I wish bioinformatics journals provide. How about actually providing what you promise? Here's the situation. You find an 6 year old article in "BMC Bioinformatics" about some clever little algorithm. Tucked away at the end of the abstract is a link http://some.uni.edu/old_group/people/~postdocx/project69/index.html. You click on the link and several things may happen: 1) if you're lucky, yet get a butt-ugly website with a working program or 2) you get a program but it doesn't work with your configuration, or in all likelihood, 3) the website is not found anymore. Without the program, the article is next to useless.

The thing is, the majority of bioinformatics papers are either description of programs, or description of data-sets. Unlike traditional experimental or theory papers, code and data are things that grow with time. Programs can be continually improved – they gather bugfixes, and may even be rewritten with better algorithms. Datasets are often regenerated with the latest updates of standard databases, where many genomic analyses are often expanded to include more organisms.

You might think that dealing with the changes is going to require a lot of work. But the maintainence of a changing program is actually a solved problem. Most people who are serious about software use Github, or Google Code or any number of software repositories that gracefully handle mutating software. These websites provide excellent integration services for downloading the program, registering and tracking bugs, discussion pages, with a solid admin interface, and allows you to look at the history of the program. They make it easy to do all this stuff. Moreover, these repositories provide a permanent location for the program with a clean url.

Nevertheless, the way that academia works is that if you have written a neat little program that solves a bioinformatic problem, you must get it peer-reviewed and published in an academic journal in order to be recognized for official academic business. Thus your program will live on two different websites, where the journal website links to your program website. Of course, one could include the download as a supplementary file on the journal website, but then it's only a single file download without any kind of proper software infrastructure.

So here's my idea, why don't we setup up a specialized bioinformatic journal that is tightly integrated to a software repositry? Let's call it Biohub.org or something like that. Users are first encouraged to set up projects in biohub.org like they already do in Google Code or Github, as a software repository. There is nothing particularly cutting edge about that, as there is plenty of existing software to facilitate the construction of a site like that.

Then when a project reaches maturation, an article is written and sent to the editors who run the website. If the article passes peer-review then the project will be registered on the front page of Biohub.org as a peer-reviewed project, and the link will be directly link to the project page on the very same website. A new tab will pop up with the .html of the article, formatted in a way that is consistent with the rest of the software. The article and the software project will be one and the same.

The editors will be responsible for making a printable PDF that goes with the article, and these can be linked to a journal page of the website, which can be made to look like any other academic journal website. You can even slap an ISBN number and register these articles with the relevant scientific literature databases. More importantly, the journal runs the software repository so that the software will be there as long as the journal exists.

This actually makes sense for datasets as well. Datasets change overtime just like programs, and if stored in a software repositry, it provides an easy way to look at older versions of the database, as well as a provide a place for others to submit useful scripts, and start discussions (especially since most repositories provide wiki and forum services for every project). Furthermore, if the repositry as run as an open-source code repositry, then it makes it quite easy for collaborators to be added to projects.

If we get into the habit of setting up bioinformatic software and data in these centralized hubs, we can stitch a truly bioinformatic journal. No more will the work of some postdoc die on some long forgotten server unplugged in the back of someone's lab. Our collective scientific output will live in an organic fusion of prose, code and data.

Update: great discussion on friendfeed