The wonderful world of Character Sets

January 19, 2007 at 3:53 PM

Avast yer browsing! I ran into a funny story about character set mishaps on Language Log today. If only I were having problems with unintentionally affiliating my work with piracy--alas, no. Hopefully, you're using a browser that actually displays Unicode correctly. If not, you'll be perplexed by the icon below. (What Icon? Ruh roh... I'd suggest emailing the makers of your wonderful browser and telling them to read this article on Unicode.)

What I am running into in my capstone project is this: Thus far, I've imported roughly 10,000 of Project Gutenberg's E-Texts. That's a lot o' document, if you ask me. Many of these texts are available in multiple encodings, and multiple character sets. For simplicity's sake, I'm only grabbing texts that are available in raw text--no zips, no HTML, et cetera.

Despite my filtering by encoding type and my best efforts with iconv, many of the text streams throw encoding errors when I try to insert them into the database. Most are available in an alternate character set, but (so far) 7.39% are only available in what I will term an "offensive" format.

Were I more concerned with writing a perfect importation program, I would deal with the offending characters in some manner and salvage the rest of the text. But, I'm happy with the number of texts I have so far. An important aspect of software engineering is realizing when to trade the theoretical ideal for a practical implementation--I'm going to be swamped implementing the features that do make the final cut in this project, and missing >8% of the available seed texts is hardly a rational concern.

Update: I've finished processing the catalog. Here are the final figures...

 potential | processed | errors | valid |      percent_err

     19573 |     18718 |   2646 | 16072 | 0.14136125654450261780

What this means is that out of 19,573 uniquely titled works, I attempted to process 18,718. The difference between those two numbers lies in available format: I only accepted raw text. Out of those 18,718 documents, 16,072 were entered successfully; ~14% were lost due to character set incompatibility. Frankly, 16,072 documents is more than enough for my purposes, and so I'm moving on.