Developing With Algebras - Edit Distance

In a recent interview, I saw the remnants of a familiar algorithm scribbled over a whiteboard. It was one that asked: given a source string and an output string, how do you go from source to output?

I started to brainstorm about it on the spot. This was a problem that stumped me in the past. Could I solve it better now with new tools I’ve picked up over the past year?

This post is all about the process I went through towards solving that problem.

Early Design

Given just the problem statement, I knew I needed this much, at least:

``f :: String -> String -> [?]``

Next, I considered what actions an editing algorithm might do:

``````data Edit
| Rm Char
| Replace Char Char``````

This became the foundation for the rest of my solution. This was my edit algebra, and it helped me consider all cases in turn.

In the past, I might have scrambled about, thinking, “There’s probably a dynamic programming solution hiding in there somewhere…”. There probably is! However, this doesn’t tell me how to solve the problem. It only tells me that there’s a way to get to a faster solution once I have a solution.

Next Steps: Parsing

So now I had more clarity - I knew I needed to get from the source string given the target string. The steps to get from one to the other are made up of the terms of my `Edit` algebra:

``parse :: String -> String -> ``

I opted to name this function `parse` since it helped me connect solving this problem with my previous experiences in parsing and evaluating mini-languages.

Let’s take a look at the solution:

``````parse :: String -> String -> 
parse = go
where go (s:ss) (t:ts) = Replace s t : go ss ts
go [] (t:ts) = Add t : go [] ts
go (s:ss) [] = Rm s : go ss []
go [] [] = []``````

`parse` works by making a linear pass over each string given to us. The case are as follows:

• If both the source and the target have characters remaining, we can `Replace` the character in the source with a character in the target
• If the source is out of characters, we just `Add` the characters in the target
• If the target is out of characters, we `Rm` characters from the source
• If they’re both empty, we’re done

GHC’s case analysis tells me this is complete, which at least reassures me that I’ve considered all cases. I note now that this was not my first attempt. You’re seeing the pretty version at the end of the tunnel. I mucked about with the lengths of the source and target on my first pass in emitting instructions, and the outcome was not pretty.

Let’s see `parse` in action for further reassurance:

``````> parse "cat" "doge"
[Replace 'c' 'd',Replace 'a' 'o',Replace 't' 'g',Add 'e']``````

Nice! We can go from “cat” to “doge”. How about…

``````> parse "" "doge"

All paths lead to “doge”. Now we know. One more case:

``````> parse "doge" ""
[Rm 'd',Rm 'o',Rm 'g',Rm 'e']``````

Great! This is by no means thorough testing, but it gave me confidence that things appeared to work.

Co-Parsing: From Parsing to Transform

At this point, I had enough of an implementation to calculate the edit distance. It’s just the length of the instructions generated:

``````> length \$ parse "cat" "doge"
4``````

The next thing I wanted to do was to verify that my instruction generator worked as expected. Given a set of Edit instructions and the source string, could I generate the target string?

``edit :: String ->  -> Maybe String``

This one took more time for me to solve. Not only was there the happy path evaluation to consider, but… what if I was given a source string that didn’t match up with the instructions?

So given a source string and a list of instructions, I’d like to output a string if the instructions match up with the source string. Here’s one solution:

``````edit xs' is' = sequence \$ go xs' is'
where go :: String ->  -> [Maybe Char]
-- handling Replace edits
go (x:xs) (Replace i1 i2:is) =
if i1 == x then Just i2 : go xs is else [Nothing]
go [] (Replace _ _:_) = [Nothing]

go [] (Add i:is) = Just i : go [] is
go (_:_) (Add _:_) = [Nothing]

-- handling Rm edits
go (x:xs) (Rm i:is) = if i == x then go xs is else [Nothing]
go [] (Rm _:_) = [Nothing]

-- handling termination
go (_:_) [] = [Nothing]
go [] [] = []``````

There’s a lot to consider in there, so we’ll unpack it bit by bit.

First, let’s consider `Replace` instructions, because those are the first emitted as a result of the form of the `parse` implementation. There’s three cases to consider:

1. We have a Char x, and x matches what `Replace` expects
2. We have a Char x, and x does not match what `Replace` expects
3. We’re out of characters, and we encounter a `Replace` instruction

These three cases explain the first two lines of `go`.

Next, let’s consider `Add`:

1. We’re out of characters, and we encounter an `Add`
2. There are characters remaining, and we encounter an `Add`

An invariant on `parse` is that `Add` instructions are only emitted if the source string is exhausted and the target string has more characters remaining. As a result, if we encounter an `Add` along with characters in the given string, we’ve encountered a string that does match up with the given edits.

Next, we consider `Rm`:

1. We have a Char x, and that x matches the `Rm`
2. We have a Char x, and that x does not match the `Rm`
3. We have no more characters, and we encounter an `Rm`

This one depends on similar invariants on `parse`. If we have no more characters, and we encounter a `Rm`, this string does not correspond to the given edits.

Finally, we must consider termination. The only happy termination is that we simultaneously run out of source string characters and edits. This still leaves the case where we have source characters remaining but no more instructions. This is another mismatch error, so I return `[Nothing]`.

That’s the form of the algorithm! Next up, how do we verify that this behaves as expected?

Specifying Invariants

We know of at least one invariant already between `parse` and `edit`: given that we parsed x to get to y, if we use those Edits on x, we should be able to get back to y.

``edit x (parse x y) == Just y``

On the flip of the coin, if we’re given a string that isn’t x at the edit step, we should get back `Nothing`:

``x /= z ==> edit z (parse x y) == Nothing``

Very conveniently, the above snippets are executable QuickCheck properties! The `(==>)` operator instructs quickcheck to only consider cases where the given property is true. In this case, I assert that x must not equal z.

Let’s take them for a spin:

``````> import Test.QuickCheck
> let prop_one x y = edit x (parse x y) == Just y
> let prop_two x y z = x /= z ==> edit z (parse x y) == Nothing
> quickCheck prop_one
+++ OK, passed 100 tests.
> quickCheck prop_two
+++ OK, passed 100 tests.``````

Nice! I’ve laid out some invariants between `edit` and `parse`. In doing so, I’ve also made it clear that the behavior of `edit` is very closely coupled with that of `parse`.

Generalizing Things

After implementing `parse` and `edit`, I looked over what I’d written a few times. In doing so, I noticed that what I’d written so far only made two hard requirements on the data in operates on:

1. The type must be wrapped in a List, `[a]`
2. For `edit`, the type `a` must be comparable using `==`.

This allowed me to generalize `Edit`, `parse`, `edit`, and `go` above to:

``````data Edit a
| Rm a
| Replace a a deriving Show

parse :: [a] -> [a] -> [Edit a]
edit :: Eq a => [a] -> [Edit a] -> Maybe [a]
go :: Eq a => [a] -> [Edit a] -> [Maybe a]``````

The `Edit` algebra is more general now and allows us to work on more types of data. Reading around, I noticed that this sort of algorithm is used in computational biology with things like DNA. Let’s give it a try!

``data DNA = A | C | G | T deriving (Eq, Show)``

Now, with `parse` and `edit`:

``````> parse [A,A,G,G,T] [C,A,A,A,A,A,T]
[Replace A C,Replace A A,Replace G A,Replace G A,Replace T A,Add A,Add T]
> edit [A,A,G,G,T] \$ parse [A,A,G,G,T] [C,A,A,A,A,A,T]
Just [C,A,A,A,A,A,T]``````

Yay! I have a very weak background in biology, so I couldn’t tell you if either of the DNA fragments I gave above make any sense. `parse` and `edit` seem to handle these fragments well enough.

Next Steps and Closing

So that was fun. It was empowering to realize that being able to define an algebra gave me a powerful new tool for crafting solutions to problems. It sort of leaves me feeling that once I understand a domain well enough, I can solve any problem. This was not a feeling I felt quite so accutely in my years of working with C, C++, Python, and others.

There’s still more to be done, though. In writing this post, I realized that the current algorithm doesn’t yield the minimum edit distance. Consider the following case:

``````> parse "cot" "cat"
[Replace 'c' 'c',Replace 'o' 'a',Replace 't' 't']``````

For words that happen to line up exactly, we’ll get `Replace` instructions where the source and target are the same. One possible fix is to extend the `Edit` algebra as follows:

``````data Edit a
This would involve adding the appropriate cases to `edit` and `parse`, and defining a `distance` function akin to:
``````let isKeep t = case t of {Keep _ -> True; _ -> False}