Solving cryptograms in Clojure

December 31, 2009 at 10:33 AM

While working on Project Euler, I discovered Peter Norvig's elegant Sudoku solver implementation. Although the associated TDD discussion brought me much humor, it was the efficacy and flexibility of constraint propagation and search that really stuck with me. As I implemented this solver in Clojure, I had a small epiphany: one could also model cryptograms as a constraint problem and find a solution in a nearly identical manner. Here is my account of implementing such a solution.

Formulation

The problem is rather different than Sudoku. In some ways, it's simpler: we don't need to figure out which rows, columns, and boxes are affected by each state change. That eliminates a lot of bookkeeping, which is fortunate; figuring out how to compare a set of partially decrypted words to see whether they match an encrypted word and the current puzzle state is enough of a hassle.

I want to list the possible solutions for an encrypted word, and then use search and constraint propagation to eliminate these possibilities. To find the solution list, I'll convert each word into what I term its base form. This is done by replacing each character with a representation of its order of appearance. For example, both "seas" and "that" become "ABCA". Using integers naturally limits this technique to words with 10 or fewer distinct characters, so I have opted to use upper-case characters instead. To make a comprehensive list, I group all words in /usr/share/dict/words by their base form to create a dictionary keyed on base forms. We can then look up candidate solutions using the base form for any encrypted word. This solution is similar to the one required by the ever popular anagram interview question. It quite naturally introduces a limitation to our solver; namely, in order to decrypt a word it must appear in the computer's word list. I find this list on a standard *NIX distribution to be quite sufficient.

In Clojure, a single entry in the base form dictionary looks something like this:

{"ABBABC" ["beebes" "inning" "peeped" "peeper" "teeter"]}

When an encrypted word has only one candidate on its list, we can consider it solved and propagate that choice. If we encounter a state where an encrypted word has an empty candidate list, we know we've propagated impossibly conflicting constraints. Unlike the Sudoku problem, simply maintaining a list of potential valid states isn't enough. We really need to store a list of all applied decryption rules, and use them as follows.

Rules and eliminating candidates

Say we have the following encrypted word/candidate list with some yet-to-be-applied rules:

;;; Candidates
("vkkvkl" ["beebes" "inning" "peeped" "peeper" "teeter"])

;;; New rules
{\v \p}

The new rules say we should translate every encrypted "v" to a "p". We can integrate this into the base form and generate a partially decrypted base form for the encrypted word. We can then use that to further eliminate candidates.

;;; New base form for "vkkvkl", given our rules
"pBBpBC"

;;; New hybrid base forms for existing candidates
["ABBABC" "ABBABC" "pBBpBC" "pBBpBC" "ABBABC"])

;;; New candidates
("vkkvkl" ["peeped" "peeper"])

Propagation

Propagating the constraints is straightforward. We follow four simple rules:

  1. If any encrypted word has no candidates, we are in an inconsistent state. Return nil.
  2. Otherwise, if all words have only one candidate, return. We've found a solution!
  3. Otherwise, if any encrypted word has only one candidate, accept that candidate as a solution. Update the existing rules in light of this solution and apply them to all the other candidates recursively.
  4. Otherwise, return what we have so far — all words still have more than one candidate, so we have deduced all that we can. We'll fall back on search to progress further.

Search

Now that we have the basic tools for modeling and enforcing the constraints, we can easily write a function to fully apply given rules to a set of candidates. However, as seen above in point 4, we can encounter a state where we've eliminated some candidates but have not found a solution. Indeed, this is the initial problem state. In order to begin (and later to progress), we need to choose an encrypted word and one of its candidates and run with that pair. If it's an invalid pairing, we will reach an inconsistent state and backtrack to try the next pairing. This is a basic depth-first search. Our search will return the first valid result (where "valid" means all encrypted words are solved). We can optimize the search by always starting with first candidate of whichever unsolved encrypted word has the fewest number of candidates — this trims the search tree fairly well. We can also ensure that the set of decryption rules is actually changed by a given candidate before attempting to search further.

Concluding thoughts

Cryptograms often have many, many solutions. The first solution found by search is often not the most correct one — especially in terms of human language. There are some things that could be done in the search function to improve the chance of correctness: e.g. picking the next candidate by word/letter frequency. A more advanced system could use some NLP techniques to determine whether the decrypted solution passed as human language. All of these are moot, considering that the benefit provided wouldn't outweigh complexity they would add to this simple algorithm.

Caveat developer

I'm providing my implementation as a reference. Given that I'm a novice Clojurer, it's liable to be hideous and non-idiomatic. I found working in Clojure to be extremely satisfying, both for this project and for the Sudoku one. My solution is a mere 132 lines (whitespace/comments included), so this is obviously a very expressive language. I've been very impressed by flexibility of its persistent data structures and its lazy evaluation. These make it a very powerful language, and it's definitely a tool I want to sharpen.

cryptogram.clj