On the Visualization of Huge Chunks of Data

21 Feb 2008 // programming

I recently had one of those flashes of intuition that has dramatically improved my work flow in analyzing large data-sets by an order of magnitude.

So to cut to the chase, what is the best file format to create your figures?

Excel charts? No.

Pdf? No.

Png? No.

Latex? No.

No. No. No

The answer, I submit, is simply the format that you are reading this post right now. Yes, I am talking HTML.

Let me explain. Let's say you are like me. You have a huge directory on your hard-drive that stores all the calculations that you've done in the last few months. Let's say you've been scrupulous and ordered all the different parts of the simulations in well-named subdirectories. Your data is dispersed in the directory tree, as well it should be. The directory location of the data tells you exactly what the data is.

So now you want to look at it. By looking at a broad cross-section of the data, you hope to discern patterns in your data. I assume that you've reached the stage where you have automagically generate .png or .jpeg images (using matplotlib for instance) to visualize different aspects of the simulations.

This is how I'd do it a week ago. I would first copy all the images to a working directory. This would be somewhat painful as the names of the images, in their original directories, often have the same file name. They needed to be renamed.

Then I'd have to juggle GIMP, Inkscape and Acorn (or whatever GUI graphics program you use) to pull all the appropriate images together into one large .png image. I might sketch out on paper as to which images go together in a meaningful layout. Then I'd have to load all the images in and resize them properly. I'd have to make sure they line up in the program. Of course, I'd leave the labeling to the end, because typing in the labels is such a pain.

Another alternative would be to print all the images out, letting the operating system's print manager to collate them together. It's a crap shoot then in regards to what order they come out.

This is clearly a pain point in data analysis.

With HTML, everything becomes much much easier.

To generate an image, I can write a simple HTML file in text. Better yet, since HTML is basically a script, I can write a script to generate the HTML file. I typically use PYTHON but any scripting language will do. In my PYTHON script (aided by the very able HTMLTag.py library). In my script, I can store a mnemonic list of what I want to look at, then write code to expand them out to their full directory names, before setting them to HTML tags.

I found out that when I was editing images in GIMP, most of the time I was arranging the component images into a tabular form, and sticking labels on them. Holy shit! That's exactly what the TABLE tag in HTML does. Now I can just write a few loops that generates a table in HTML that places the component images in tabular form. I don't even need to copy the images into a working directory because I tell HMTL to look for the images in their original directories.

Resizing images? Ha! HTML allows you to specify exactly how big you want your image to be resized down to the pixel. You can even play with CSS to display only a part of your image. In some cases, I just add a unique magnification factor for each image and the HTML generation script automatically adjusts the width parameter of the appropriate images.

Inserting text into your HTML file has never been easier. I can import whole chunks of text from anywhere and everywhere. I can translate the directory name of the file and turn it into the label. Since the HTML generation was done in a script, I could even access the original data to fully label the figures.

And then I realized that I didn't even need a working directory to store the HTML file. I just store the INDEX.HTML in the actual simulation directories, where the data is stored and then create a master INDEX.HTML at the top of my simulation directory. This points to all the individual INDEX.HTML buried inside the other directories, using hyperlinks.

Designing a figure in HTML is so much more flexible than fooling around with .png files. I generally find that when editing big .png file, things pretty much start to slow down. Fiddling with alignment and resizing in a moderately complex figure requires a lot of arthritis-inducing mouse twiddling. In contrast, HTML pages are limitless in breadth and width – you can navigate infinitely from left-to-right and from up-to-down. For once, I could create huge tables of images where I could include the images from all my simulations, not just the ones that could fit on the width of a physical piece of paper. And if I wanted to refine the figure, I would look at the original HTML file, and write a new script to generate a new figure.

I have never had such access to my data before – open the web-browser to the top INDEX.HTML file, and then open up all the other pages in tabs. It really feels I'm directly browsing my data through the web.

Showing the figures in the web-browser has many other advantages. If you just want to select a handful of images, you can simply drag them onto your desktop. Or to make a copy of your entire data structure, you can just copy the directory structure of your data, the image files, and the HTML files. As long as you use relative file naming, you can download this straight to a server, and show the data to your friends. What about eventual publication? Well, if you're on a Mac then you're in luck. Every web-browser has a print to PDF option (with a scaling of your choice). Then you just open PREVIEW and save to .png or .jpg or whatever.

I already knew that the web-browser lets me look out at things in the world. Little did I know that I could use it to peer at the data right under my nose.