|
I’ve been making a lot of figures recently, in a general rush of paper making as my current postdoc winds down to an end. One of the most annoying parts of writing a paper is making that final graph, the composite image complete with cute little captions that is to be submitted as “Figure”. At this point, you’ve already created beautiful individual graphs, each carefully proportioned, illustrated, bevelled and perfumed to perfection. But adding one last insult to injury, you must now condense a bunch of graphs into a composite figure, sub-labeled with the appropriate A’s, B’s and C’s. So you sigh, and reach for Photoshop. I bet you’ve been here before. You’ve loaded all the graphs into a large empty canvas in Photoshop or GIMP. It’s then a nightmare of cuting-and-pasting, resizing and layer munging. This is fiddly and boring shit, especially if your computer is not a sexy desktop-publishing beast, where your programs choke on high resolution images. Watch as your 600dpi image follows your mouse like a retarded animal to find that final resting place 30 seconds after you actually moved your hand. Font control is not exactly the most intuitive operation in graphics manipulations program, especially if all you want to do is put one A in the right place at the right size. Stop. There is a better way. After years of pain, I finally made myself learn the Python Imaging Library, and lo-and-behold, I quickly hacked together some code that let’s me do that last assembly step in Python, even including a precise placement of the labels. Goodbye GIMP. Goodbye Photoshop. First you need to install PIL. Then you need to find your system font file. This is very important. Without it, you can’t make the labels. Once it’s installed, in your python script, import the three modules that you will need:
Resizing is the first thing you need to worry about. PIL provides you with a resizing function that does smoothing. Everything works through an Image object, so here’s a function that generates a new Image object resized to a new width:
Next, you want to stick any smaller image into a bigger image. In this case, I create a helper function that reads straight from an image file (png, or in fact any PIL friendly image formats), and inserts it into an existing Image:
Finally, I have a function that reads parts (which is a list of smaller images and associated parameters that I will explain below), and constructs a brand new png (or whatever name your image is). If png is actual a ‘.tiff’ file, it will automatically write a .dpi of 300 (which some journals require). Make sure you know where your system fonts are because this function needs to find it to do the labeling. The labeling simply goes from ‘A’, ‘B’ …. following the order of the parts.
So here’s how I’d use it. Let’s say you have these pngs’s:
Okay, we then create a plot, where all the design is embedded in a bunch of parameters:
Which gives:
On the Mac, you can also add a last line to your script, such as:
Which will open the image in a Preview window. From here, changing the design is as simple as simple as tweaking the parameters, and re-running the script (here are some published figures that I generated using this method). Photo-editing pain now transformed into elegant number twiddling. You may never have to reach for Photoshop to do figures again. I sure don’t. — As a bonus, I find that I often like to crop figures. Cropping is a pain in the butt because I have to look up the width and height to calculate the right values to crop. Instead, an easier but less exact way is to use fractions to do decide how much cropping is to be done:
It seems that we have reached some kind of tipping point where microsecond molecular dynamics simulations are feasible. I’m starting to see many papers published where the only genuine point of interest seems to be that a protein was simulated for MORE THAN A MICROSECOND. Still a microsecond is not quite long enough to watch biological stuff shuffle through interesting conformations without specifically tuning the simulation. What remains hard is to find something interesting for which a microsecond of simulation can simulate adequately. And so, one of the more important papers from the measurement side of things came out in 2008 from Bert de Groot’s group, Recognition Dynamics Up to Microseconds Revealed from an RDC-Derived Ubiquitin Ensemble in Solution, which claims to be the first NMR experiment to measure the complete microsecond dynamics of every residue in a protein. In this case, ubiquitin. The measurements are in the form of S order parameter, a measure of the spin relaxation of N atoms along the backbone derived from the Nuclear Magnetic Response. These are measured from Residual Dipolar Couplings. The S order parameter is somewhat ill-defined with respect to the protein conformation, so there is some lattitude in how to interpret it. de Groot and coworkers showed that if the S order is interpreted as simulation constraints then an MD simulation with these constraints produce the same variabiality as an ensemble of 46 different crystal structures of ubiquitin, which translates the S order parameter into a quasi-experimental RMSD measure, which they call the EROS RMSDf.
With complete microsecond dynamics measurements, we finally have a litmus test for those spanking mew microsecond MD simulations. Step up to the plate, D.E. Shaw and coworkers, using their state-of-the-art Molecular Dynamics package DESMOND: Microsecond Molecular Dynamics Simulation Shows Effect of Slow Loop Dynamics on Backbone Amide Order Parameters of Proteins. So how did they do? If we plot the calculated S order parameters with the ones derived from the Residual Dipolar Couplings (actually I plot 1-S as I find it easier to see), we get an okay match of r = 0.63:
But here’s the rub, in the course of my research into long time dynamics, I came across a whole bunch of great approximations that simulated protein flexibility (actually a reviewer suggested these to me). In particular, I was pretty impressed with CONCOORD, written by the very guy who made the NMR measurements. The idea behind CONCOORD is simple and elegant: replace every bond, and inter-atomic interaction in a crystal structure with distance constraints that represent the strength of the interaction, and run a monte-carlo simulations. Using CONCOORD, in several minutes, I got a decent ensemble, from which I get the CONCOORD RMSDf. Comparing the experimentally derived EROS RMSDf to the CONCOORD RMSDf, I got a match r=0.54:
Comparing DESMOND to CONCOORD, the improvement of correlation to experiment was about 0.1 of correlation. The lesson to learn is that if you’re running MD, you have to run it smart. The reason that CONCOORD performs well for ubiquitin, is that the motions of ubiquitin are actually not that large. Compared to the 6 or 7 Å motions of the ligand-binding loops of TIM or DHFR, ubiquitin really doesn’t move all that much. Hence even a stripped down model such as CONCOORD can reproduce the experiment almost as well as DESMOND. To simulate these motions with MD, you need tens of microseconds of simulations. Minutes of calculation with CONCOORD produces pretty much the same result as hundred of thousands of hours of CPU time with DESMOND. There’s a lesson in there somewhere. This article was one of the first articles I’d written that hit internet paydirt, so I look back on it with some nostalgia. Recently, Paulo Nuin wrote Notes to a bioinformatician – two years later., an update to his response to my article. It only seems fitting that I write a response to his response on his response about my article. I still stand by most of the points I made, but 2 years on, I’d like to tack on some more points, some based on comments in the original piece: 1. Little disagreements with Paulo. How do I store information? I store quite a lot of my results as figures in my data directories, ready to insert in a paper or presentation. I also build HTML files that collate a bunch of these figures so I can see all the results together. The fact is that there are many ways to store notes. Pick one you like. My only advice is make sure that you actually do, and you don’t need to do it on paper. As for deleting data, I differ with Paulo probably because I am a computational (structural) biologist. I make a lot of simulations that are crap because I tweaked certain parameters that were ridiculous but I only realized they were crap after the fact. These I delete. 2. Version Control. I am a little guilty of not always version-controlling. But I have used it for some of my projects and it’s like taking out a good car insurance policy especially if it’s a project where you’re rolling changes back and forth. Once you have it working, you can actually speed up your work flow. Instead of commenting out chunks of code, just delete it. You can always recover the deleted code by reloading a previous version. And now with distributed version control systems, it’s even easier to clone new version of projects to fiddle around with risky changes than the old method of copying directories by hand. Distributed version control systems make merging versions somewhat painless. Version control actually formalizes something we do quite naturally as programmers. 3. Databases. This is one piece of advice I don’t follow myself. Yet. Most of my data is in molecular dynamics trajectories, so I don’t really want to stuff it in a database. But I recently started playing with a bioinformatics projects and I’m beginning to understand the sense of shoving all your data in a database. Once set up, pulling things out is much easier. There’s a whole bunch of great open source databases from couchDB to mysql. 4. Concurrency. If you’re doing any kind of serious scientific computing, you will run up against the concurrency wall. I have a 12 node cluster that run my day jobs. This is small bananas. My last group had access to a department wide 500 node cluster. If you use clusters then you will learn the pain that is concurrency. It can come in many forms: batch jobs, mpi passing, gpu code. I’ve sent days writing code just to copy data back and forth from the clusters to my main machine. If you do it properly, you will avoid some of the pain. If you do it without thinking, welcome to a world of it. Some problems are social: we had one asshole adminstrator running the 500 node cluster who kept on killing our jobs because they were “too small”. We ended going to our boss to lean on his boss to make him install a different scheduler. Make an effort to at least learn about the basic concepts in concurrency: deadlocks, race conditions etc. You don’t need a thorough knowledge – no one’s expecting you to write a Software Tranactional Memory system. Yet. 5. Get your own server. This kind of slipped my mind when I wrote the original list, but I spend an awful amount of time doing web stuff. I build websites for friends, and I built my own. You’re probably using your current lab’s server-site. Assuming you’re going to be doing interesting stuff and you want to share it with the rest of the world, your website will grows. In the long-run it sucks having your website on someone else’s server. One day you will leave and they’ll delete your account. If you want to dip your toe in the water, I suggest that you first buy a domain name and link it to your current account. You can even have a cool domain name like boscoh.com to hide your butt-ugly web server name. Mine used to be http://newt.phys.unsw.edu/biophysics/~/bho. If you have your own domain name, when you change labs, you just copy over files, redirect the domain name, and no one will be the wiser. You will not get link-rot. Better yet, I suggest that you rent some server space. Dreamhost costs about $100 a year. It’s really nice to have your own site that you control, independent of which institution you’re at. And if you ever want to do something fancy, like write a ruby on rails apps, you can. 6. Web-programming. It’s amazing what a little bit of HTML and CSS can do in terms of sharing your data. It doesn’t take much to learn, but it does take some learning. Face it, if you’re doing bioinformatics, you will either want to share your data or provide programs. You might think your data is so great that other researchers will beat down your door to get it. Wrong! If your data is not packaged in an attractive and easy-to-explore manner, people will give up. I know I do. Learn to put up a decent clean and usable webpage. Or better yet, learn a templating system and write a program to build a website. The better your web skills, the more data you can feed to the world in a form that people would want to use. 7. Learn how to release software. It may seem easy when your boss says, just send Professor Asshat your code, and he’ll use it. But it turns out to take a little more work than that. People forget that publishing a paper is only the beginning of the process of science dissemination. After publication, the hard work of selling your paper begins. In computational biology, that normally means getting people excited about the program, and putting a working program in your hands. If the program is not too intensive, 9 times out of 10, a good webapp will be the best solution. In a recent paper I wrote, I had a whole bunch of programs I could test my program against. I ended up choosing one because it had a beautiful webapp interface. And I chose the other because they had a working binary for every major platform. I once wrote a little graphics program. Because I wanted it to work well on different platforms, I had to use wxWidgets and C++. Let me tell you that it’s a pain in the ass to maintain multiple platforms, especially on platforms other than the one use every day. If I did it today, I would write it in Javascript, like this and this. That way, I avoid C++, I avoid GUI libraries, and I get cross-platform for free on Firexfox. And it’s totally Web 2.0. Think very carefully about licensing. I release most of my own software as I can under open source. Younger guys don’t know this but in bioinformatics we’ve been very lucky, where the whole field got into the habit of sharing data and software with other academics. This is unusual. Just ask an organic chemist about how much they pay for data and software. 7. Libraries and API’s. The famous MIT electrical engineer introductory computer science course recently switched from SCHEME to PYTHON. One of the reasons they cited was that today’s programmers are more apt to spend their time unravelling someone’s library and writing code to work against it, rather than writing code from scratch. If you don’t want to be the computing equivalent of Grizzly Adams, I’d advise you to do the same. There is an art to unravelling the mysteries of a freshly downloaded library, but you’ll be doing it many times in your career if you want become a productive researcher. Part and parcel of this is to learn to read crappy documentation, decipher example programs, and practice the hermeneutics of reference manuals. Hopefully after many years of this, you will learn the art of crafting your own beautiful library, with an appealing API, and documenting it with elegant, new-yorker quality prose. So that I can use it. There are holes inside us I sat on the floor I carried the music with me ~ Bucky Sinister
“All Blacked Out & Nowhere to Go” “For centuries, scientists have attempted to identify and document analytical laws that underlie physical phenomena in nature…” and so goes the rather bombastic opening salvo from a recent Science article titled Distilling Free-Form Natural Laws from Experimental Data by Schmidt and Lipson. It’s the kind of work that follows the well-trodden path of the logical positivists who tried to subvert science into a branch of logic. Although the logical positivist took a near fatal beating from Gödel’s theorem, there are some who want to keep the dream alive. Or in this case, it is the attempt to reduce the scientific process to a computable process. According to my machine-learning life-line (Dr. Mark Reid), this article represents a huge advance. The article describes an algorithm that deduces analytical equations from the analysis of observations made on several mechanical systems. These guys were able to identify the subtle tweaks needed to let the system find invariants in a reasonable amount of time, a major breakthrough in machine-learning. Yet the paper suggests that these methods can be applied to “all physical laws”, which rhetorically suggests the method can be widened to many different branches of science. This, I think is a massive overstatement. Let me explain. Many physics undergraduates cut their teeth on Goldstein’s Classical Mechanics, an exhaustive encylopedia of mechanical systems that has served as the standard text of classical mechanics, that slowly builds the formal machinery of mechanics from Newton’s equation to the abstract formalism of Lagrangians and Hamiltonians. Goldstein is not a pleasant read. Later on, if sufficiently motivated, they might crack open Feynman Lectures on Physics (a rarity in the science literature in that the book is genuinely fun) where much of the messy guts of mechanics is exposed. But only a few physics undergrads will ever venture onto Lev Landau’s slim volume Course of Theoretical Physics: Mechanics where the formal properties of mechanics are properly explained in some 80 terse pages, so terse that I’ve had to read the book several times. It’s the kind of mechanics book where Newton’s three laws of motion are not even mentioned. I bring up Landau’s “Mechanics” because not many people have studied it and it is there that Landau points out that if you have any system that is deterministic with respect to coordinates and velocity, you will end up with a conserved Langragian, from which you can derive a conserved Hamiltonian. If you look at the systems studied in the Science paper, they are all mechanical systems, and the observed data are coordinates and velocities. If you assume the system is deterministic, then you can be sure that there must be a conserved Lagrangian or Hamiltonian based on the coordinates and velocities. The algorithms identified by Schmidt and Lipson will only work on deterministic mechanical system, which would be obvious to anyone who’s studied Landau. Unfortunately, there are not many systems that have such beautiful analytical properties, so it is hard to see how this system can be applied to other systems. For instance, could it work for the Schrödinger equation, the workhorse equation in everything from chemistry to semiconductor physics? The Schrödinger equation is deterministic in a very loose sense, and it is the wave function that is conserved, not the observable probabilities! In biology, we have an incredible amount of data on genomes, on genes, and interaction maps. Unfortuantely, we do not have any equivalent Lagrangians for them. In some ways, this article illustrates one of the points made by the great Canadian philosopher of science, Ian Hacking, that physics was the first science to be developed was no accident. It was because the data for theorizing about planets are the easiest to measure in the natural world. This data came in the form of careful measurements of the motion of the stars and planets, made not originally for science, but for commerical purposes in the need for accurate navigational charts. These precious measurements of planetary motions allowed Kepler and Brahe and Newton to theorize about planetary orbitals and interplanetary forces. Fortunately for them, the forces that dictate planetary orbitals, at least from a non-relativistic approximation, are beautiful determistic systems that, as Landau could well appreciate, could be derived from the coordinates and velocities only. What is it to be alike? When you’re talking about protein structures, you need to align the structures as best you can before you can talk about similarity. This is not a trivial matter. The standard approach is to use the Kabsch least-squares algorithm to find the optimal superposition between two sets of coordinates. But if you’ve had any experience with real structures, or molecular dynamics trajectories, this is not a straightforward process. This is because some regions of the protein may vary much more than other regions. To get a better fit, you may need to apply “human intuition” to manually select a region that appear to be more stable, and then align the structures only to these regions, letting the rest flop away. Manually selecting the stable region is, to say the least, somewhat arbitrary. Except you don’t need to fudge it anymore! Douglas Theobald and Deborah Wuttke have provided a new way of aligning structures THESEUS that automagically detects stable regions from variable regions, and aligns to the more stable regions, instead of the everything-in-the-blender of the Kabsch least-squares approach. It’s a beautiful algorithm, and here’s their example of the spectacularly better alignment of NMR models of the Kunitz domain 2sdf where LS is the Kabsch least-squares approach and ML refers to the Maximum Likehood approach of Theseus: ![]() The killer application is in aligning crystal structures of different proteins in the same domain family. Typically, you first select a set of “conserved” residues and then do a Kabsch least-squares to just the “conserved” residues. This is normally an iterative process where you manually refine just which residues are conserved. Instead, Theseus hooks nicely into multiple-sequence alignment programs, and using the sequence alignment, it can produce a totally automated structual alignment of the structures that is every bit as good, if not better than a manual approach. As an example, I’ve been working on the PDZ domains, an example 1×8s: ![]() And using Theseus hooked to Clustalw (plz don’t hit me Paulo) for a bunch of PDZ domains (1be9, 1bfe, 1d5g, 1m5z, 1nf3, 1ry4, 1rzx, 1vj6, 1×8s, 2qkt-a, 2qku-a, 2qku-b, 2qku-c, 2qkv-a, 3pdz), generates this alignment (yellow is non-aligned residues, and green are non-protein ligands):
I never once had to apply “human intuition”. Aligning structures has never been this easy! I like PLOS. I really do. Look I’ve published two articles with them. I fully believe that they represent a big stepping stone to the future of science publishing. Yet, oh yet, I feel like their attempt at building community is kinda shoddy at the moment. The heart of any social media website (reddit, slashdot, 4chan etc.) is the commenting and rating system. The commenting and rating system of the PLOS website is really poor, and I can’t see how it will gain traction given the usability failings.
The comments are not integral to the article. There is a hermetic seal between the comments from the article by the fact you need to make two clicks from the article page before you see even one comment. No one will ever discover comments by accident. Discoverability is inversely related to the number of links to the information. The comment page is a mess. Whoever designed it must have used an off-the-shelf forum commenting system. All sorts of metadata are included making the comment itself the smallest and faintest font on the page. The comments are not actually shown on the first comment page, just the comment headers. If you’re pumping out hundreds of comments a day, this might be acceptable, but since it’s a dribble, it’s ridiculous that there is more screen-space devoted to comment headers and comment meta-data than the comments. In top commenting websites, such as Reddit, you are shown all the comments, and there is no need for headers, which are not only a waste of time but another usability obstacle. Even worse, if you want to post a message, you need to click again! That’s 3 clicks from the article, virtually burying the feature in web 2.0 terms. Most community websites have a text-entry html box at the bottom of the page, often using javascript to give it a nice interactive experience. Since PLOS uses some nice javascript in other parts of the site, it’s not the technology that’s the failing, but the lack of vision. In short, it is prohibitively expensive to add comments, and when the comments are posted, they are displayed out of context and difficult to read. If I may as be bold as to make some suggestions, then
So, please PLOS, I think you’re great, but you could be greater still, you might even be the first open-access site that truly embraces social media. From Seth Weintraub:
I recently helped someone make a bunch of images of membrane protein structures, so I thought I'd put them up here for you to enjoy. Considering that membrane proteins make up ~30% of the genome (Stevens & Arkin, 2000), solution of membrane protein structures have lagged far behind cytosolic proteins. But now we're catching up:
You should read F*** My Life. |
||