How to export a blogspot blog to HTML/GitHub with C#

A LINQ-to-XML Example 01 Jul 2014

I finally did it: I bought LINQPad’s Code Completion so I could write C# scripts easily. Now, sure, I could have used C# script for free, but… wait, why didn’t I do that?

Anyway, in my previous post I explained in detail how to set up a blog on GitHub. Now it’s time to convert my old blogger blog to a shiny new GitHub format.

To export your blog from Blogger, log into Blogger, go to your blog control panel, go to the __Settings

Other__ tab, and click “Export Blog”.

You get an XML file that is basically in Atom format. It’s hard to understand because it doesn’t have any line breaks (apart from the line breaks in your blog template, which just serve to distract you). It’s a <feed> root element containing a bunch of <entry> elements, some of which contain your posts and others of which contain metadata.

Here’s the code I came up with to export to GitHub. Just paste this into LINQPad or whatever, change the filepath to point to your xml file, and run it!

void Main()
{
	string filepath = @"C:\Downloads\Blog.xml";
	string text = File.ReadAllText(filepath);
	XDocument doc = XDocument.Parse(text);
	
	// Use XNamespaces to deal with those pesky "xmlns" attributes.
	// The underscore represents the default namespace.
	var _ = XNamespace.Get("http://www.w3.org/2005/Atom");
	var app = XNamespace.Get("http://purl.org/atom/app#");
	
	var posts = doc.Root.Elements(_+"entry")
		// An <entry> is either a post, or some bit of metadata no one cares about.
		// Exclude entries that don't have a child like <category term="...#post"/>
		.Where(entry => entry.Element(_+"category").Attribute("term").ToString().Contains("#post"))
		// Exclude any entries with an <app:draft> element except <app:draft>no</app:draft>
		.Where(entry => !entry.Descendants(app+"draft").Any(draft => draft.Value != "no"));
	
	var outfolder = Path.Combine(Path.GetDirectoryName(filepath), Path.GetFileNameWithoutExtension(filepath));
	Directory.CreateDirectory(outfolder);
	
	foreach (var entry in posts)
	{
	   // Extract data from XML
		DateTime published = DateTime.Parse(entry.Element(_+"published").Value);
		DateTime updated = DateTime.Parse(entry.Element(_+"updated").Value);
		string title = entry.Element(_+"title").Value;
		string content = entry.Element(_+"content").Value;
		string type = entry.Element(_+"content").Attribute("type").Value ?? "html";
		XElement empty = new XElement("empty");
		XAttribute emptA = new XAttribute("empty","");
		string originalLink = ((entry.Elements(_+"link")
			.FirstOrDefault(e => e.Attribute("rel").Value == "alternate") ?? empty)
			.Attribute("href") ?? emptA).Value;
		string outFileName = string.Format("{0:yyyy-MM-dd}-{1}.{2}", published,
		       Path.GetFileNameWithoutExtension(originalLink), type);
		var outPath = Path.Combine(outfolder, outFileName);
		
		if (content.Count(c => c == '\n') <= 3)
			content = AddLineBreaks(content); // optional
		
		// Write output file (partial HTML for Jekyll)
		using (StreamWriter output = File.CreateText(outPath)) {
			output.WriteLine("---");
			output.WriteLine("title: \"{0}\"", title);
			output.WriteLine("layout: post");
			output.WriteLine("# Pulled from Blogger. Last updated there on: {0:yyyy-MM-dd}", updated);
			output.WriteLine("---");
			if (originalLink != "")
				output.WriteLine("<small><p><i>This post was imported from "+
				 "<a href='{0}'>blogspot</a>.</i></p></small>", originalLink);
			output.WriteLine(""); // Disable Jekyll/Liquid
			output.Write(content);
			output.WriteLine("");
		}
	}
}

It will create a folder named after the xml file, and inside that folder it will create an html file for each post, like this:

2007-09-03-hello-no-one.html
2011-07-05-why-wpf-sucks.html
2012-06-07-smart-tabs.html
2013-05-28-onward.html

These filenames are in the correct format for Jekyll, so if you’re moving to GitHub, just move all these files to your /_posts folder, commit, and you’re done! If you want “proper” HTML files, modify the code above to produce proper code like <html><head>...</head>... instead of Jekyll front-matter.

Oh, by the way, Blogger’s exported HTML contains no line breaks at all in your posts. So I wrote this little method to add some line breaks at appropriate spots:

string AddLineBreaks(string content)
{
	var sb = new StringBuilder(content.Length + 100);
	
	bool pre = false, fail;
	for (UString rest = content; !rest.IsEmpty;) {
		if (rest.StartsWith("<pre")) pre = true;
		if (rest.StartsWith("</pre")) pre = false;
		bool s;
		if ((s = rest.StartsWith("<br />")) || rest.StartsWith("<br/>")) {
			sb.Append(pre ? "\n" : "<br/>\n");
			rest = rest.Substring(s ? 6 : 5);
			continue;
		}
		if (rest.StartsWith("<li>") || rest.StartsWith("<p>") || rest.StartsWith("<tr>") || rest.StartsWith("<pre>") || rest.StartsWith("<blockquote>") || rest.StartsWith("<img"))
			sb.Append('\n');
		if (rest.StartsWith("</ul>") || rest.StartsWith("</ol>") || rest.StartsWith("</blockquote>"))
			sb.Append('\n');
		char c = (char)rest.PopFront(out fail);
		if (!fail) sb.Append(c);
	}
	
	return sb.ToString();
}

This relies on UString in my Loyc.Essentials.dll library though (it’s a kind of string slice).

The code was good enough for me, and hopefully it will be good enough for you… but I don’t know if images work (certainly Blogger doesn’t include images in the export file). Note: by default the HTML in blogspot has an “implicit” line breaks feature in which newlines are converted to <br/> for you. I turned off that feature because it often screws up formatting of nontrivial posts; if you left that option on, I’m not sure if the HTML that blogger gives to you in the XML file includes those auto-inserted <br/>s.

Published on CodeProject

Loyc

How to export a blogspot blog to HTML/GitHub with C#

Possibly Related Posts

Comments