Developing With Algebras - Edit Distance

In a recent interview, I saw the remnants of a familiar algorithm scribbled over a whiteboard. It was one that asked: given a source string and an output string, how do you go from source to output?

I started to brainstorm about it on the spot. This was a problem that stumped me in the past. Could I solve it better now with new tools I’ve picked up over the past year?

This post is all about the process I went through towards solving that problem.

Early Design

Given just the problem statement, I knew I needed this much, at least:

f :: String -> String -> [?]

Next, I considered what actions an editing algorithm might do:

data Edit
  = Add Char
  | Rm Char
  | Replace Char Char

This became the foundation for the rest of my solution. This was my edit algebra, and it helped me consider all cases in turn.

In the past, I might have scrambled about, thinking, “There’s probably a dynamic programming solution hiding in there somewhere…”. There probably is! However, this doesn’t tell me how to solve the problem. It only tells me that there’s a way to get to a faster solution once I have a solution.

Next Steps: Parsing

So now I had more clarity - I knew I needed to get from the source string given the target string. The steps to get from one to the other are made up of the terms of my Edit algebra:

parse :: String -> String -> [Edit]

I opted to name this function parse since it helped me connect solving this problem with my previous experiences in parsing and evaluating mini-languages.

Let’s take a look at the solution:

parse :: String -> String -> [Edit]
parse = go
  where go (s:ss) (t:ts) = Replace s t : go ss ts
        go [] (t:ts) = Add t : go [] ts
        go (s:ss) [] = Rm s : go ss []
        go [] [] = []

parse works by making a linear pass over each string given to us. The case are as follows:

GHC’s case analysis tells me this is complete, which at least reassures me that I’ve considered all cases. I note now that this was not my first attempt. You’re seeing the pretty version at the end of the tunnel. I mucked about with the lengths of the source and target on my first pass in emitting instructions, and the outcome was not pretty.

Let’s see parse in action for further reassurance:

> parse "cat" "doge"
[Replace 'c' 'd',Replace 'a' 'o',Replace 't' 'g',Add 'e']

Nice! We can go from “cat” to “doge”. How about…

> parse "" "doge"
[Add 'd',Add 'o',Add 'g',Add 'e']

All paths lead to “doge”. Now we know. One more case:

> parse "doge" ""
[Rm 'd',Rm 'o',Rm 'g',Rm 'e']

Great! This is by no means thorough testing, but it gave me confidence that things appeared to work.

Co-Parsing: From Parsing to Transform

At this point, I had enough of an implementation to calculate the edit distance. It’s just the length of the instructions generated:

> length $ parse "cat" "doge"
4

The next thing I wanted to do was to verify that my instruction generator worked as expected. Given a set of Edit instructions and the source string, could I generate the target string?

edit :: String -> [Edit] -> Maybe String

This one took more time for me to solve. Not only was there the happy path evaluation to consider, but… what if I was given a source string that didn’t match up with the instructions?

So given a source string and a list of instructions, I’d like to output a string if the instructions match up with the source string. Here’s one solution:

edit xs' is' = sequence $ go xs' is'
  where go :: String -> [Edit] -> [Maybe Char]
        -- handling Replace edits
        go (x:xs) (Replace i1 i2:is) =
          if i1 == x then Just i2 : go xs is else [Nothing]
        go [] (Replace _ _:_) = [Nothing]

        -- handling Add edits
        go [] (Add i:is) = Just i : go [] is
        go (_:_) (Add _:_) = [Nothing]

        -- handling Rm edits
        go (x:xs) (Rm i:is) = if i == x then go xs is else [Nothing]
        go [] (Rm _:_) = [Nothing]

        -- handling termination
        go (_:_) [] = [Nothing]
        go [] [] = []

There’s a lot to consider in there, so we’ll unpack it bit by bit.

First, let’s consider Replace instructions, because those are the first emitted as a result of the form of the parse implementation. There’s three cases to consider:

  1. We have a Char x, and x matches what Replace expects
  2. We have a Char x, and x does not match what Replace expects
  3. We’re out of characters, and we encounter a Replace instruction

These three cases explain the first two lines of go.

Next, let’s consider Add:

  1. We’re out of characters, and we encounter an Add
  2. There are characters remaining, and we encounter an Add

An invariant on parse is that Add instructions are only emitted if the source string is exhausted and the target string has more characters remaining. As a result, if we encounter an Add along with characters in the given string, we’ve encountered a string that does match up with the given edits.

Next, we consider Rm:

  1. We have a Char x, and that x matches the Rm
  2. We have a Char x, and that x does not match the Rm
  3. We have no more characters, and we encounter an Rm

This one depends on similar invariants on parse. If we have no more characters, and we encounter a Rm, this string does not correspond to the given edits.

Finally, we must consider termination. The only happy termination is that we simultaneously run out of source string characters and edits. This still leaves the case where we have source characters remaining but no more instructions. This is another mismatch error, so I return [Nothing].

That’s the form of the algorithm! Next up, how do we verify that this behaves as expected?

Specifying Invariants

We know of at least one invariant already between parse and edit: given that we parsed x to get to y, if we use those Edits on x, we should be able to get back to y.

edit x (parse x y) == Just y

On the flip of the coin, if we’re given a string that isn’t x at the edit step, we should get back Nothing:

x /= z ==> edit z (parse x y) == Nothing

Very conveniently, the above snippets are executable QuickCheck properties! The (==>) operator instructs quickcheck to only consider cases where the given property is true. In this case, I assert that x must not equal z.

Let’s take them for a spin:

> import Test.QuickCheck
> let prop_one x y = edit x (parse x y) == Just y
> let prop_two x y z = x /= z ==> edit z (parse x y) == Nothing
> quickCheck prop_one
+++ OK, passed 100 tests.
> quickCheck prop_two
+++ OK, passed 100 tests.

Nice! I’ve laid out some invariants between edit and parse. In doing so, I’ve also made it clear that the behavior of edit is very closely coupled with that of parse.

Generalizing Things

After implementing parse and edit, I looked over what I’d written a few times. In doing so, I noticed that what I’d written so far only made two hard requirements on the data in operates on:

  1. The type must be wrapped in a List, [a]
  2. For edit, the type a must be comparable using ==.

This allowed me to generalize Edit, parse, edit, and go above to:

data Edit a
  = Add a
  | Rm a
  | Replace a a deriving Show

parse :: [a] -> [a] -> [Edit a]
edit :: Eq a => [a] -> [Edit a] -> Maybe [a]
go :: Eq a => [a] -> [Edit a] -> [Maybe a]

The Edit algebra is more general now and allows us to work on more types of data. Reading around, I noticed that this sort of algorithm is used in computational biology with things like DNA. Let’s give it a try!

data DNA = A | C | G | T deriving (Eq, Show)

Now, with parse and edit:

> parse [A,A,G,G,T] [C,A,A,A,A,A,T]
[Replace A C,Replace A A,Replace G A,Replace G A,Replace T A,Add A,Add T]
> edit [A,A,G,G,T] $ parse [A,A,G,G,T] [C,A,A,A,A,A,T]
Just [C,A,A,A,A,A,T]

Yay! I have a very weak background in biology, so I couldn’t tell you if either of the DNA fragments I gave above make any sense. parse and edit seem to handle these fragments well enough.

Next Steps and Closing

So that was fun. It was empowering to realize that being able to define an algebra gave me a powerful new tool for crafting solutions to problems. It sort of leaves me feeling that once I understand a domain well enough, I can solve any problem. This was not a feeling I felt quite so accutely in my years of working with C, C++, Python, and others.

There’s still more to be done, though. In writing this post, I realized that the current algorithm doesn’t yield the minimum edit distance. Consider the following case:

> parse "cot" "cat"
[Replace 'c' 'c',Replace 'o' 'a',Replace 't' 't']

For words that happen to line up exactly, we’ll get Replace instructions where the source and target are the same. One possible fix is to extend the Edit algebra as follows:

data Edit a
  = Add a
  | Rm a
  | Replace a a
  | Keep a deriving (Eq, Show)

This would involve adding the appropriate cases to edit and parse, and defining a distance function akin to:

let isKeep t = case t of {Keep _ -> True; _ -> False}
let distance x y = length . filter (not . isKeep) $ parse x y

Making it optimal - this is where it’s time to turn to classic algorithms. To make it at all, though, I’m happy to have gained new tools!

All code used in this post is available here.