Archiving a PHP-based Blog to Markdown text files

12 Jun 2013 // programming

For those who've followed this blog for a while, you may have noticed some changes. Not just the front-end, but the back-end. I have gone from a PHP-backed blog server to a static website.

I've been chugging along with the blog engine Textpattern for almost 6 years. During that time, I've gradually leveled-up my web skills to the point that I now consider myself a sufficient web ninja, deftly weilding HTML and CSS like a pair of hardened nunchukas.

As such, it became painful for me every time I went to my blog to make an update through the constricted web-interface of Textpattern. Don't get me wrong, I think Textpattern is perfectly adequate, for the web ignoramous I was 6 years ago. But now it's a straitjacket. I have to squeeze my CSS designs into the Textpattern presentation language. But worse, the only way I can test any kind of redesign tweak is to code it straight into the live site. I feel really bad when I do that.

I was never happy with the lack of an automatic excerpt function. I have to do a bunch of cut-and-pasting to get the content into the Textile format. And once uploaded, the articles are cumbersome to edit. What once was a great convenience is now a road of a thousand speedhumps.

The pain of editing my blog texts stands in stark contrast to my main desktop editor: Sublime Text 2. Sublime is a powerful editor: I easily navigate multiple files, search-and-replace with regular expressions over multiple files, retreive files with a simple command. Editing my entire backlog of articles in Sublime Text 2 would be lovely.

Looking back, I also realise that I would have got more mileage if I had chosen Wordpress instead of Textpattern. A couple years back, I started to suffer from a horrendous amount of comment spam and I had to shut down the comments. Fortunately I could switch to disqus. But a Wordpress update would have given me spam filtering.

I lived with this frustration for a long time, whilst watching my blog output slowly dwindle. But I could not stomach the thought of changing. There would be so much I have to learn. But in the last few months, I've learnt a whole new bunch of web-technologies: overwhelming Django framworks, jingly Jinja2 templates, awesome HAML templates and slinky SASS style sheets. And then I thought, this is what I want to be doing all the time.

How hard would it be to convert my Textpattern blog, written in timeless PHP/MySQL, into a static web-state that I would maintain with python, markdown, Jinja2, HAML, SASS, and rsync?

But first I would have to archive my blog. This is how I did it.

I decided to grab the archive in the time-honoured way - screen scraping. Now I could have done a database dump from Textpattern, but parsing database dumps did't look like a whole lot of fun. Also I don't know how databases work. Besides, screenscraping is easy and fun! And I get to work with a delightful XML format.

The actual download was done with that unix workhorse wget. In default mode, wget was way too powerful, the recursive fetching would fetch everything from my website. I ended up writing a bash script to get wget to fetch only the blog directories of my site.

From the scraped web-pages, I had to strip away all the HTML markup and sidebars and get at the content. As well, I had to extract meta-data such as publish date and categories. And more importantly, I had to remember the URL of the article, because I want to preserve the links, especially since I use disqus to handle the commenting to my blog. This was accomplished with Python scripts using the HTML parser BeautifulSoup.

But what to do with the content? It's in HTML, and that's kind of ugly and unreadable. Well, I used the truly marvelous document transformer pandoc to convert the HTML into markdown. You need this tool if you do any kind of text munging: it offers an unbelievably number of interchangeable formats. pandoc is so good, it even kind of makes me want to learn Haskell.

So now I've saved all my blog texts as lovely Markdown text files, with all the relevant metadata - date, title, URL - in a YAML header. But there were still a few loose ends. Some posts had links to videos and embedded links. Alas pandoc cannot convert these - I had to manually put them back in. More importantly, there were unicode characters sprinkled all over my Markdown. You don't want this because it's unpredictable how they will show up further down the line.

However the fix is easy, I wrote a little unicode sanitizer which converts all non-ASCII characters in a unicode string into an escaped HTML entity:

from htmlentitydefs import codepoint2name

def unicode_to_entities(text):
  """
  Identifies unicode characters that can be converted to
  HTML-safe html-entities. Also translates smart single and
  double quotes into normal double and single quotes, and 
  turns ellipses into three full-stops.
  """
  new_lines = []
  for line in text.splitlines():
    pieces = []
    for ch in line:
      codepoint = ord(ch)
      if codepoint > 128:
        if codepoint in codepoint2name:
          html = '&' + codepoint2name[codepoint] + ';'
          html = clean_quotes(html)
          pieces.append(html)
        else:
          html = '&#{0};'.format(codepoint)
          pieces.append(html)
      else:
        pieces.append(ch)
    new_lines.append(''.join(pieces))
  return "\n".join(new_lines)

Also, I didn't like how Textpattern would add fancy quotes and ellipses to my text, so I got rid of them:

def clean_quotes(html):
  if html == '&ldquo;' or html == '&rdquo;':
    return '"'
  if html in ['&lsquo;', '&rsquo;', '&sbquo;', '&prime;']:
    return '\''
  if html in ['&hellip;']:
    return '...'
  return html

After all this, I had a clean copy of my blog, in beautiful text-based markdown files. Searchable, editable, and ready to go. Now I was ready to try different static web generators, but that's a story for another post.