26 August 2009

The Lumpy Shape of Science Publishing in the not too Distant Future

I recently attended my very first un-conference, Sci Bar Camp, held in the pleasant genteel buorgeoisie neighborhood of Palo Alto, home of Stanford and Google. The etymology of "Sci Bar Camp" is a tad obscure. "Sci Bar" is the cheapskate open-invite counterpart to the invite-only "Sci Foo", which is funded by Google and O'Reily. The "Foo" in "Sci Foo" allegedly comes from Friends-Of-O'Reily. The "Bar" in "Sci Bar" is a pun from the traditional programming place-holder function name "def foo-bar()". But I digress.

I'd read about these web 2.0 unconferences in blogs, and the reality was as fun as those self-congratulory blog posts make it out to be, probably because the attendees were self-selective. The Sci Bar Camp crowd were engaging, visionary, and even well-dressed. I got a chance to hash out many ideas but in particular, I think I figured out something about the future of science publishing, a running obsession at Sci Bar Camp.

The old model is dying, and we are dying to know what will take its place. There were obvious things like open-access, author copyrights, and better web technology. But I think there are some even bigger changes lying ahead: the future federation of science publication, and a revolution in the definition of disciplinary boundaries.

What is this PLoS One that you speak of

Perhaps the most interesting person I met was Peter Binfeld, managing editor of PLoS One. From what I understood, PLoS was set up to publish any article without judgement of fit or perceived impact. The only criteria was clarity, methodological soundness and novelty. The driving idea was to let citation after publication be the judge of of impact, instead of predicting the success of the article before publication.

After talking to Peter, I now realize that PLoS is after bigger game: to drag scientific publishing kicking-and-screaming into the 21st century. Case in point: at SciBarCamp, Peter was whoring out a feature of PLoS that was going to be rolled out in the next few months: they were going to drop Impact Factors and switch to article metrics. They will be the first journal to do so. Although cries of hypocrisy have sounded out over this volte-face (given that PLoS had loudly trumpeted that it wanted to feast at the table of Impact Factor with the big boys), I think that PLoS is actually fulfilling its mission of taking scientific publishing into the 21st century.

To get into the game, PLoS had to establish credibility. Playing up Impact Factors was a tactical move. In the last 5 years, it's clear they have managed this with PLoS Biology and PLoS Medicine, using a bog standard open-access model. That set them up to lauch PLoS ONE, which had a much more radical publishing model. Would this radical model work? Well after 2 years of PLoS ONE, it's clear that it's at least workable – they have the financial and publishing numbers to prove it.

How well? "PLoS One accounts for 1% of Pubmed". Let that sink in a little. In two short years, PLoS ONE has bitten off a huge chunk of scientific publishing. Has the growth plateaued? Or is this the beginning of something big?

Disciplinary boundaries

One of the gates that PLoS One has dismantled is the barrier of disciplinary distinctions. This is one of the factors that have fueled the growth of PLoS One. Any and everyone can publish here. If the growth of PLoS One reaches a tipping point, it may swallow whole a bunch of smaller journals. These smaller journals were created in the day of print journals where there were physical restrictions in the dissemination process. A journal had a natural hard limit of paper pages with the attendant problem of finding an audience. As new disciplines popped into existence, a group of scientists might come together and realize that they formed a potential new audience for a print journal. They would then hire an editor to serve themselves. This was the only way to disseminate information amongst themselves.

These limits no longer apply as printing costs have shrunk to zero. Distribution is as far away as the url of your browser. It is difficult to avoid the conclusion that everything may one day flow through the voracious mouth of PLoS One. Funny enough, Peter pointed out that the infrastructure for this kind of journal exists at Biomed Central. BMC's business model is to provide an open-access infrastructure for any group of interested scientists to start up their own journal where BMC has become a hub of hundreds of itsy-bitsy journals. There is no reason that they could funnel everything through a single pipeline. The only thing stopping them, it seems, is to protect the Journal Impact Factor for some of their journals instead of diluting them by the poor Factors of the crappy ones. What if, Peter muses, you merge the editorial boards into one behemoth. This may become a straight competitor to PLoS ONE? We may yet see a future Battle of the Online Publishing Giants.

If the disciplinary boundaries have dissolved, what would take its place? There are folks talking about a totally federated scientific publishing system. A recent blog argued that we've effectively got a world unified federated journal in the form of free citation databases and easy access to the archive of the past. If this is the case, it'd be nice if we could resolve the inefficiencies of running an archipelago of little journals and create a single giant mound of science. With PLoS making 1% of pubmed in just two years, it's well poised to do the mounting.

Of what use is reviewing?

Bitching about the review process seems to a common hobby horse amongst the SciBarCamp folk. New flash: folks also bitch about old age and being rejected about the other sex. The current system we have is that professionalized editors defer to the judgments of two blind reviewers with scientific credentials. By separating the general editing duties from the science reviewing, this takes the full-time burden of editing away from the scientists.

What might replace the current review system? I can think of two possibilities: one based on the past and one based on the web. In the past, you have an editor who makes all the decision themselves. No if's or but's. The editor's decision is final like in, er, the rest of the magazine publishing industry. The greatness of a magazine like the New Yorker owes everything to the eccentric genius of its editor David Remnick. Not everything good will make it there, but there will be many flashes of inspired brilliance due to the quirks of the editor.

The other model might be something like arVix, where anybody can deposit something. But even arVix has some nominal gates. If you really want to open the gate, then one could simply science publishing with blogging. As anyway who reads blogs knows, 99.97% of posting will be crap, with a redeeming sliver of brilliant stuff. I shudder to wade through the tidal wave of shit that this might bring.

In contrast, the current system of peer-review system is meant to pull everything to an average level of good-enough. It came out of the 1950's, coinciding with the massive growth of professional science. With the greater output of science articles it made sense to create a system that could function a more autonomously from a technoratic all-powerful editor. The accepted articles never gets too bad, but brilliant articles oftentimes fall through the gap. Still, there is value in keeping the bottom from dropping out all together. But, unless the actual task of editing articles becomes zero and we dispense with full-time editors, I don't see peer review going away.

Article Metrics and Prestige

Prestige is a deeply human trait. The need for respect will probably never go away. In Aristotle's "Rhetoric", it was put forward that the strength of an argument depends as much on the character of a debater as the logic of the argument. Character is also necessary in science. In scientific articles, whilst the most egregious claims may be easily tested, many experiments, calculations, measurements cannot. The reputation of a scientist is often used to judge evidence, providing a level of believability for prima facie unbelievable claims.

In science, prestige works at various levels: on articles, on individuals, and on journals. In science publishing, for journals, prestige comes in the form of Journal Impact Factor. Now of course, impact factors are imprecise, Nobel prizes have been given for work rejected by Science and Nature. But the point is that metrics are useful depending on who is using it.

Journal Impact Factors provided a useful metric for librarians to purchase journals. You might think the scientists would be heavily invested in helping librarians choose journals, but from my experience scientists are very lazy in helping other people do their jobs in the university. The reason is that most scientists don't contribute their own money to buy journals, and so they don't make any hard decisions and just tell the librarians to get everything. The librarians then resort to using Journal Impact Factors to decide.

However, as journals have gone online and open access, people have realized that the entire corpus of scientific content is easier to obtain through online browsing rather than a walk to the library. I was physically made aware of this when I saw our lab's old copies of Nature/Science/Cell being dumped into the paper recycling bin. It's PDF's all the way.

Clearly, the need for Journal Impact Factors for librarians is no longer important but what about using it as a proxy for evaluating the worth of a researcher? Journal Impact Factors themselves are really proxies for article citations based on a mysterious formula designed by Thomson Scientific that purports to aggregrate overall citations for a journal.

Most often an individual is evaluated on job search committees for academic positions. Of course a dedicated job search committee member could rumble through the citation databases and collect citations for individual researchers themselves. And now that PLoS One is planning to roll out article metrics, we may finally get to the point where article metrics may be the only thing that matters.

Still, in a discussion about these things, there will always that idiot (in the Doystevskian sense) who argues that one really shouldn't use metrics at all. Just read. the. godamn. paper. What kind of candyland do these guys live in? I mean let's say you a member of a search of committee and you have 20 candidates. Say you look up and photocopy (just joking I mean print out the pdf) 5 papers for each. That's 100 papers. How much time do you have to read 100 papers of some vaguely related discipline? Maybe you have a time machine that slows down time. But even if you did read all the papers, it's still beside the point. You're not hiring the person for their past work, you are using it to predict their future output. This is clairvoyance and using article metrics is about as good as anything else I can think of.

Who are you? I am Researcher ID

In the past, disciplines could be defined by the existence of specialized journals. The lauch of a specialized journal demonstrates that a certain critical mass had been reached. When enough researchers jump on a bandwagon, they can direct energy and funds to create a print journal. If PLoS One is about to pull down the boundaries of scientific disciplines in publishing, the notion of scientific disciplines may need to mutate. Don't get me wrong, I think scientific disciplines are really important. Practitioners of certain techniques and theories need to talk to each other in a creole of specialized terms for advances to be made. But this is not the kind of talk for polite company and general science journals. Specialized journals are a good place to have this conversation. But if specialized journals are about to be swallowed by PLoS One, where can we pick up the threads of such conversations?

My guess is that scientific disciplines will be derived by the network of citations in papaers and the network of author collaborations. Although the mathematics of networks has been pretty much worked out, we are waiting for one of the citation databases to open up their API to actually map the corpus scientific literature. This will probably be Google Scholar API (once legal issues are resolved) rather that Thomson Scientific or Scopus. Once that API is open, we will be able to construct detailed maps of disciplinary related articles in a machine automated way. Related papers and authors will then be as clear as cleanly drawn nodes linking hubs in a graphs.

Unfortunately, after talking to Duncan Hull, I realized that there is one big obstacle to constructing this hypothetical world historical map of citation networks. That obstacle is the problem of names. If we are to grind through the world's literature, we'd better be able to recognize who what. With citations it's easier to recognize since since citations refer to lots of interlocking pieces of metadata. But to classify authors, you need to first identify the correct name, and in that, you have one of the hardest computational linguistics problems staring you in the face: name disambiguation. First names, second names, family names differ across cultures. Some names change upon marriage. And common names abound for multiple researchers. These need to be disentangled in order to assign the correct articles to the correct authors.

Working this out may be the greatest challenge to construct the future networked book of science. What will probably happen is that various large institutions will fund a name disambiguation project for the in the scientific archive. In the future, it will probably be solved by the usage of a universal global scientific Researcher ID. Already, the NIH assigns every researcher an ID. One day, the NIH, or the NSF, or some international agency will assign every publishing author with an ID on first application. It might even be PLoS ONE if it ends up dominating the publishing world with their federated system.

But imagine this, with the network constructed, it's entirely possible to discover new disciplines through the simple analysis of citation patterns to a long forgotten paper. Obscure collaborations may identify new connections to old ideas. The tree of knowledge may even branch backwards in time.

In Isaac Asimov's Foundation series, he imagines a future where a sorry band of scholars are exiled to the edge of the universe to compile the "Encyclopedia Galactica", a collection of the known knowledge of the dying Galactica Empire, much like the medieval christian monks tried to save the knowledge of the Roman Empire. We are at the point where such a project is conceivable. Asimov imagined that the descendents of these sorry scholars evolved into a virile society of innovative scientists and politicians who ended up building the foundations to the Second Galactic Empire. May our scientific vision be even a hundreth as grandiose.