In a recent interview, I saw the remnants of a familiar algorithm scribbled over a whiteboard. It was one that asked: given a source string and an output string, how do you go from source to output?
I started to brainstorm about it on the spot. This was a problem that stumped me in the past. Could I solve it better now with new tools I’ve picked up over the past year?
This post is all about the process I went through towards solving that problem.
Given just the problem statement, I knew I needed this much, at least:
Next, I considered what actions an editing algorithm might do:
This became the foundation for the rest of my solution. This was my edit algebra, and it helped me consider all cases in turn.
In the past, I might have scrambled about, thinking, “There’s probably a dynamic programming solution hiding in there somewhere…”. There probably is! However, this doesn’t tell me how to solve the problem. It only tells me that there’s a way to get to a faster solution once I have a solution.
So now I had more clarity - I knew I needed to get from the source string given the target string. The steps to get from one to the other are made up of the terms of my
I opted to name this function
parse since it helped me connect solving this problem with my previous experiences in parsing and evaluating mini-languages.
Let’s take a look at the solution:
parse works by making a linear pass over each string given to us. The case are as follows:
Replacethe character in the source with a character in the target
Addthe characters in the target
Rmcharacters from the source
GHC’s case analysis tells me this is complete, which at least reassures me that I’ve considered all cases. I note now that this was not my first attempt. You’re seeing the pretty version at the end of the tunnel. I mucked about with the lengths of the source and target on my first pass in emitting instructions, and the outcome was not pretty.
parse in action for further reassurance:
Nice! We can go from “cat” to “doge”. How about…
All paths lead to “doge”. Now we know. One more case:
Great! This is by no means thorough testing, but it gave me confidence that things appeared to work.
At this point, I had enough of an implementation to calculate the edit distance. It’s just the length of the instructions generated:
The next thing I wanted to do was to verify that my instruction generator worked as expected. Given a set of Edit instructions and the source string, could I generate the target string?
This one took more time for me to solve. Not only was there the happy path evaluation to consider, but… what if I was given a source string that didn’t match up with the instructions?
So given a source string and a list of instructions, I’d like to output a string if the instructions match up with the source string. Here’s one solution:
edit xs' is' = sequence $ go xs' is' where go :: String -> [Edit] -> [Maybe Char] -- handling Replace edits go (x:xs) (Replace i1 i2:is) = if i1 == x then Just i2 : go xs is else [Nothing] go  (Replace _ _:_) = [Nothing] -- handling Add edits go  (Add i:is) = Just i : go  is go (_:_) (Add _:_) = [Nothing] -- handling Rm edits go (x:xs) (Rm i:is) = if i == x then go xs is else [Nothing] go  (Rm _:_) = [Nothing] -- handling termination go (_:_)  = [Nothing] go   = 
There’s a lot to consider in there, so we’ll unpack it bit by bit.
First, let’s consider
Replace instructions, because those are the first emitted as a result of the form of the
parse implementation. There’s three cases to consider:
These three cases explain the first two lines of
Next, let’s consider
An invariant on
parse is that
Add instructions are only emitted if the source string is exhausted and the target string has more characters remaining. As a result, if we encounter an
Add along with characters in the given string, we’ve encountered a string that does match up with the given edits.
Next, we consider
This one depends on similar invariants on
parse. If we have no more characters, and we encounter a
Rm, this string does not correspond to the given edits.
Finally, we must consider termination. The only happy termination is that we simultaneously run out of source string characters and edits. This still leaves the case where we have source characters remaining but no more instructions. This is another mismatch error, so I return
That’s the form of the algorithm! Next up, how do we verify that this behaves as expected?
We know of at least one invariant already between
edit: given that we parsed x to get to y, if we use those Edits on x, we should be able to get back to y.
On the flip of the coin, if we’re given a string that isn’t x at the edit step, we should get back
Very conveniently, the above snippets are executable QuickCheck properties! The
(==>) operator instructs quickcheck to only consider cases where the given property is true. In this case, I assert that x must not equal z.
Let’s take them for a spin:
Nice! I’ve laid out some invariants between
parse. In doing so, I’ve also made it clear that the behavior of
edit is very closely coupled with that of
edit, I looked over what I’d written a few times. In doing so, I noticed that what I’d written so far only made two hard requirements on the data in operates on:
edit, the type
amust be comparable using
This allowed me to generalize
go above to:
Edit algebra is more general now and allows us to work on more types of data. Reading around, I noticed that this sort of algorithm is used in computational biology with things like DNA. Let’s give it a try!
Yay! I have a very weak background in biology, so I couldn’t tell you if either of the DNA fragments I gave above make any sense.
edit seem to handle these fragments well enough.
So that was fun. It was empowering to realize that being able to define an algebra gave me a powerful new tool for crafting solutions to problems. It sort of leaves me feeling that once I understand a domain well enough, I can solve any problem. This was not a feeling I felt quite so accutely in my years of working with C, C++, Python, and others.
There’s still more to be done, though. In writing this post, I realized that the current algorithm doesn’t yield the minimum edit distance. Consider the following case:
For words that happen to line up exactly, we’ll get
Replace instructions where the source and target are the same. One possible fix is to extend the
Edit algebra as follows:
This would involve adding the appropriate cases to
parse, and defining a
distance function akin to:
Making it optimal - this is where it’s time to turn to classic algorithms. To make it at all, though, I’m happy to have gained new tools!
All code used in this post is available here.