CSV is my bioinformatic data format of choice

11 Sep 2012 // programming

Boning up on bioinformatics the last year, I've have had to wade through a jungle of wildly different data formats, generated from a menagerie of bioinformatic programs. The funny thing is, of the common text data formats I've come across (XML, JSON, CSV), I find myself increasingly reaching for CSV. I will now argue why for bioinformatic data, CSV is the best.

I typically write glue programs that munge data and present it in a more usable form. To represent data, I stick to lists and dictionaries. I can't remember who said it where, but you can build pretty much any kind of useful data-structure out of lists, dictionaries, lists of dictionaries, and dictionaries of lists. Although I used Object-Oriented Programming a ton in my early days, that was when I was doing mainly GUI stuff and biophysical modeling. When it's data munging, lists and dictionaries are all I need.

The nested contortions of XML

XML is the favourite data format in the JAVA enterprise world. We can be grateful that such a standard exist. It's text, and it's standard. There are XML parsers pretty much on every major programming platform. Even Firefox uses XML, in the form of XUL plugins.

However, XML is a shite fit for large data. First impressions of XML is that it is ridiculously verbose. Verbosity is not necessarily a bad thing. The main problem with the verbosity is that when you look at an XML file, the meta-data drowns the content, the textual equivalent of chart junk.

XML borrows its structure from HTML, which is the reason why XML makes the distinction between data and metadata. This makes sense because HTML is a markup language, where the text is the data, and the tag properties describes the text. Still it's cumbersome to represent in my programs: the minimal data-structure represented by XML would be a dictionary of the properties wrapped around another dictionary that distinguishes between text and properties.

Alas bioinformatic data is rarely of the mark-up document type. XML has a hard time with bioinformatic data as it doesn't have a clean syntax for serialization of data. And serialization or list-making is really what distinguishes big data from documents like HTML.

Just to be pedantic, here's a piece of XML I've had to deal with:

<nsp_information neighboring_bin_smoothing="Y">
  <nsp_distribution bin_no="0" nsp_lower_bound_incl="0.00" nsp_upper_bound_incl="0.00" pos_freq="0.003" neg_freq="0.320" pos_to_neg_ratio="0.01"/>
  <nsp_distribution bin_no="1" nsp_lower_bound_excl="0.00" nsp_upper_bound_incl="0.16" pos_freq="0.004" neg_freq="0.098" pos_to_neg_ratio="0.04"/>
  <nsp_distribution bin_no="2" nsp_lower_bound_excl="0.16" nsp_upper_bound_incl="1.61" pos_freq="0.022" neg_freq="0.069" pos_to_neg_ratio="0.32"/>
  <nsp_distribution bin_no="3" nsp_lower_bound_excl="1.61" nsp_upper_bound_incl="7.04" pos_freq="0.054" neg_freq="0.058" pos_to_neg_ratio="0.92"/>
  <nsp_distribution bin_no="4" nsp_lower_bound_excl="7.04" nsp_upper_bound_incl="19.16" pos_freq="0.090" neg_freq="0.062" pos_to_neg_ratio="1.45"/>
  <nsp_distribution bin_no="5" nsp_lower_bound_excl="19.16" nsp_upper_bound_incl="31.25" pos_freq="0.135" neg_freq="0.071" pos_to_neg_ratio="1.88"/>
  <nsp_distribution bin_no="6" nsp_lower_bound_excl="31.25" nsp_upper_bound_incl="55.84" pos_freq="0.167" neg_freq="0.078" pos_to_neg_ratio="2.13"/>
  <nsp_distribution bin_no="7" nsp_lower_bound_excl="55.84" nsp_upper_bound_incl="98.92" pos_freq="0.270" neg_freq="0.122" pos_to_neg_ratio="2.22"/>
  <nsp_distribution bin_no="8" nsp_lower_bound_excl="98.92" nsp_upper_bound_excl="inf" pos_freq="0.256" neg_freq="0.120" pos_to_neg_ratio="2.12" alt_pos_to_neg_ratio="2.22"/>
</nsp_information>

It's a list, but it looks rather awkward, does it not? I don't have much issue with long descriptive tag names, but it's kind of weird that in XML, the data is embedded as property tags, rather than as the data itself. This is simply the result of the legacy of HTML which makes the forced distinction between the "text" and the "markup", and thus one must make an awkward choice to stuff key-value pairs in property tags, or make new tags with text to describe a key-value pair.

JSON maps pretty well to my programming needs

And thus we come to JSON, which solves a lot of my grumblings with XML. Cleverly, JSON is executable javascript. You can just slurp it in with a highly unsafe eval() call. Text representations in Python look almost the same (the main difference is the capitalisation of boolean values). JSON is just lists and dictionaries. It's not less flexible than XML – you can still create shit. But in JSON, there is no forced distinction between the "content" and 'properties' of tags. You have lists, dictionaries, lists of dictionaries and dictionaries of lists, which maps perfectly to the data structures in my programs.

Another advantage of JSON, as pointed out on a reddit thread, is that JSON has restricted data-types. This kind of restriction is fantastic because it circumscribes the programmer to implement sane data structures.

And yet, JSON does not quite satisfy my needs. You see, even though on most programming platforms, there are parsers for XML and JSON, in perhaps the most important platform for me, no such parser exists. And that platform, alas, is Microsoft Excel.

CSV at the end of the road, can also become the Road

I am a bioinformatician, so I work with biologists. I find it kind of surprising that biologists, though advanced in the ways of killing-and-fucking-up animals plants and bacteria, are often computer luddites. They'll cling to the version of Microsoft Office that they painstakingly learnt 25 years ago. The biologists will want to see the data in Excel. There they have all the graphing and sorting tools that they are familiar with.

And so I found myself adding a function at the end of my data-processing programs that parsed my data structures into a flat structure, and spat out the results in a sequential text file of Comma-Separated-Values. My biologist can then happily open up the CSV file in their beloved Microsoft Excel.

It eventually dawned on me that CSV was a superior format for data presentation for the simple reason that it's easier to understand data that has been parsed into a sequential table format, rather than as a nested format, even though nesting of data-structures a la JSON is more efficient. Some redundancy in columns, allows a tabular layout of the data, making it easier to explore the data, which really, is the whole point of science anyway.

And then I started dealing with really big data files, in JSON. That's when I realized the limit of JSON. The problem with JSON is that it is a nested data format. I've seen my laptop freeze as my standard JSON parser heroically trys to read in a 100 MB nested JSON data structure. It's just impossible to read it all in into memory. With really big data, I realize you need to read it in chunks at a time. Now there are fancy streaming services for JSON data formats, but no general solution. For streaming JSON to work, you will have to prestructure your JSON data format, adding index files and what not. I found it easier to just flatten everything into a sequential data structure and then ... I realized I was making a CSV file!

In order to manage large data files, it was just easier to convert large data to CSV file, so that I can read it back sanely – one line at a time. Indeed, to take advantage of cloud computing, you will have to flatten your data structure so that something like map-reduce can work upon it. And then I realized the irony – Microsoft Excel had it right all along.