Programming N-Grams in D

January 2, 2007 at 12:02 PM

I'm looking into using the D programming language for part of my Capstone project this coming quarter. It's really an attractive looking language, supposedly taking the speed of C and the object oriented structure of Java to a new level.

Part of my capstone project may involve an automatic classification of texts using a computational linguistics model known as the N-gram. N-gram analysis involves counting all occurrences of ordered tuple of size N. While a 1-gram is a simple word frequency count, a 5-gram tells you how often each 5 word phrase occurs in a document. A basic explanation of how N-grams can be used for classification can be found in a paper by Cavnar & Trenkle.

D enters into the scene because it purports to combine speed with syntactic (and structural) sugar. Extracting all of the N-grams up to N=10 from 4,000 Project Gutenberg E-texts is an interesting problem. Writing the program in Python or Ruby would alleviate many implementation headaches, but their relative inefficiency would make it nightmarishly slow. Coding the program in C could prove to be another form of nightmare: there's nothing more frustrating than debugging layer upon layer of memory management problems.

A few features of D make me extremely excited to try it out:

Garbage collection
Dynamic closures
Array slicing
OOP goodies (interfaces, inheritance, overloading)
Contract programming
Unit testing

And those are compiler-level features! No stumbling through JUnit or writing Ruby extensions here. Hopefully, it'll live up to my expectations.

« FreeBSD e l'italiano.

Reminiscing with Jeopardy »

Quod erat faciendum

A technical blog by Brendan Ribera

Programming N-Grams in D