Conversion to Movable Type - Part 1

09 Jul 2011

The original Gillius.org site was entirely static content. In order to move to Movable Type, I needed to first convert the original site's content in a way that would be compatible with the old one, both in looks, content, and even the URLs themselves. I did choose MT partly because it appeared to be easier to import. I also looked at Drupal and Wordpress as well.

In this first part, I will cover conversion of the blog posts, conversion of the style, and parsing the original static HTML pages into a format that can be uploaded into MT. Read on for the details.

Why Movable Type?

Ultimately, I chose MT because it allowed for "static publishing" -- meaning it would generate static HTML pages and put them on the site, but still allow for hooks for comments/trackbacks/etc. such that I can update the site easily. The performance on my host for PHP and other dynamic content is not quite fast, and I've always enjoyed my site's very low overhead and fast loading, even on mobile devices. I'm not trying to be the next media mogul here.

Another requirement that I wanted was a remotely accessible interface (web service). MT supports XMLRPC, similarly to other solutions as well -- there are even "standard" interfaces like MetaWeblog API and WordPress has an API which MT also supports some of those calls as well. The biggest problem I had is that I had a horrible time trying to find actual documentation from MT. Ultimately to figure out the details of some items I ended up diving into the source code. Even though I'm not a Perl wizard, I was able to glean enough to see how some of the calls worked. That came in handy to workaround/fix some of the quirks/bugs/limitations I encountered when trying to import the site. This is one situation where it's really handy to work with an open-source project.

I noticed two main versions of MT, MT 4 and MT 5. I chose MT 5 to be the "latest" at first but then noticed that at least at this time the vast majority are still on MT 4. I'm not completely sure why, but my understanding is that MT 5 supports the concept of a "website" in addition to a "blog", which was important for my site -- it's mostly a website and secondly a blog, although I wanted to move towards the latter over time.

I ended up writing a custom piece of software in Java to convert my site (except for the blog part, which was easy enough to do in pure SQL).

Converting the Blog

The blog was by far the easiest part to convert. The original blog was a simple PHP script that I wrote against a MySQL database. To move this to Movable Type, I just performed a SQL SELECT to put the data into the Movable Type import/export format as documented here. The only fields in my news items were title, date and content (as string of HTML), which made the export/import straightforward. I was able to do the export with a single statement:

SELECT CONCAT('TITLE: ', `title`, '\nDATE: ', DATE_FORMAT( `newsDate`, '%m/%d/%Y %H:%i:%s' ), '\n-----\nBODY:\n', `content`, '\n-----\n--------' )
FROM `News`
ORDER BY `newsDate` ASC

The output then, looks like the following:

--------
TITLE: Super IsoBomb 3D Announced
DATE: 09/15/2003 00:00:00
-----
BODY:
A continuation to the Super IsoBomb game...

The News table is specific to my site, but if you had a custom system or even another database-based blog, the SQL may be similar. You just need a view that consists of the above items, but more importantly the content needs to be an HTML fragment and not in some markup language.

Converting the Style

My original static HTML site was originally based on Dreamweaver-style templates, since I worked with Dreamweaver in the late 90s. I used only the basic functionality for templates, which basically is given a template page you could view on its own with markers to indicate where content should be placed:

<html><!-- #BeginTemplate "/Templates/main.dwt" --><!-- DW6 -->
<head>
<!-- #BeginEditable "doctitle" -->
<title>Contact Gillius</title>
<!-- #EndEditable -->
...
</head>
<body>
... template content ...
<div class="contentArea"><!-- #BeginEditable "content" -->
<!-- #EndEditable --> </div>
</body>
<!-- #EndTemplate --></html>

Since everything generated was just plain HTML, after I lost access to Dreamweaver, I could still edit it by hand (using gVim), and upload by SCP. You could also see how easy it would be to replace the template content with regular expressions -- just select everything between the unique tags and replace. Even still, it was a pain and you can see it by the very limited updates to the site in the past years. This was my motivation to move to a content management system.

Since my pages were structured this way, I could copy and paste the shared portions of the template into MT's template design system, and bring over my CSS styles. I had to tweak some of MT's built-in widget HTML to add/change styles to make it easier to fit into my existing CSS. I thought about a site redesign, but decided to keep things easy for myself and tackle only one problem at a time and try to convert the site verbatim before I redesign when I have the time and motivation.

Exporting the Pages

Ultimately, for the export process, I would need to get the pages into a similar format in memory to export via a series of XMLRPC calls. Pages are basically like blog posts, you need the following:

Location of the page (aka the "permalink")
Date the page was last updated
Title of the page
Content of the page as an HTML fragment

Since I have the web site mirrored in my disk and in version control, it was easy to get the first two items just by traversing the filesystem and looking at the file's name and date. Getting the title is possible since it's in the HTML head, and getting the content was easy because it's simply everything between the tags. Therefore, I was able to use the following Java patterns:

    public static finalPattern titlePattern = Pattern.compile(
            "(?s)(?i)<html.*?>.*<head>.*<title>(.*)</title>" );

    public static final List<Pattern> contentPatterns = asList(
            Pattern.compile( "(?s)<!-- #BeginEditable \"content\" -->(.*)<!-- #EndEditable -->" ),
            Pattern.compile( "(?s)(?i)<body.*?>(.*)</body>" )
    );

The second pattern in the contentPatterns list is in case the first one failed to select any content. Not all of my pages on the site fit into the template, and for those pages I wanted to just simply pick out all of the HTML content in the body. There were a few pages I excluded from the export, such as the GNE tutorials, that don't use my site template at all since they are meant to be distributed standalone. As with any export process you also find a few "exceptions" to the rule that I had to just smooth over by hand. Now that I had my pages broken up into the 4 pieces, I was ready to upload them with XMLRPC.

The second part covers XMLRPC and uploading the static HTML content into Movable Type.

The third part covers the asset (images, zip files, etc) conversion.