A notch above a monkey » Speeding up Levenshtein

Speeding up Levenshtein

Simon proposes using a dictionary to match mistyped URLs with real ones. I’d probably like the idea better if I actually understood it. Still using Levenshtein can be a bit easier than using a spell checker responsively.

Easier, but slow. I used existing implementation by Magnus Lie Hetland and made a test with 245 blog titles using a simple script. 100 iterations on aging powerbook produced:

1.766s, 29.152s, 9.399s (min, max, avg)

Average time to calculate distance between randomly chosen title and rest of them is 9.4 seconds, which is far too much to be useful. And there’s not even 250 of them.

There are two obvious ways to speed things up. Since minimum distance is at least as much as a difference in length of both strings, there’s no point in calculating it when difference already exceeds a limit we chose.

The other trick took into an account that Levenshtein’s algorithm of two strings of comparable length has complexity of O(n2) and that my blog titles are form quite sparse space. If strings are longer than a certain limit (I arbitrarily chose 10 letters), then first calculate edit distance on a small sparse substring and then calculate the real distance only if first one was close enough.

When I made 1000 iterations of randomly chosen title using only first test, I got following results:

0.003s, 0.284s, 0.167s (min, max, avg)

However, if I used both checks, the same 1000 iteration test got me:

0.003s, 0.162s, 0.027s (min, max, avg)

So, two simple checks can speed up search up to 350 times. Not bad, but I’m not happy. This is certainly fast enough for a personal website or sites where site structure effectively divides searching to relatively small sets of possible hits, but there are plenty of sites where it would be too slow.

I tried to simplify searching using distance to map strings to coordinate system, which was at least hopelessly naive if not downright dumb. A much better approach would be to read more, which is what I’m doing now.

By the way, I’ve also tried to speed it up using Pyrex, but got pretty much same results, which means I either don’t know how to use Pyrex or Python optimizes well. Or both.

Comments (13)

  1. Have you tries using Levenshtein automata? The allow you to find all the words in a dictionary that are k changes or less from a source work. I used them to write a spell checker once.

    Comment by Ben — #
  2. $ python test_distance.py
    Time to compare 245 titles: 0.638s, 5.775s, 3.440s (min, max, avg)
    $ jed test_distance.py
    $ cat test_distance.py | head -3
    #!/usr/bin/env python
    import psyco
    $ python test_distance.py
    Time to compare 245 titles: 0.403s, 1.811s, 0.969s (min, max, avg)
    $ python -c ‘print 3.44/0.969′

    That’s a 3.5 times improvement just using psyco (python 2.4)

  3. Thanks to all. I’ll definitely take a look at both algorithms and thanks to Peter for reminding me of psyco.

    Comment by markos #
  4. Udi Manber has a version of Levenshtein which is faster than quadratic but which counts letter transpositions not as one error, as Levenshtein does, but as two (one deletion, one insertion). This is still good for typo correction. The algorithm is from 1991 or 1992, but I don’t remember off the top of my head which of his papers from that time it’s in.

    Comment by david — #
  5. Is difflib.get_close_matches useful?

    Comment by bearophile — #
  6. Markos,

    I’ve been trying to implement the Levenshtein algorithm for us and ran across your posting to help improve the performance of it. I’m not sure I understand what you mean when you say in your second point

    first calculate edit distance on a small sparse substring and then calculate the real distance only if first one was close enough

    as far as the first point when you are exceeding maximum distance requirement and you disregard further calculations, how can you determine that you have already exceeded this without calculating it?

    I guess not being a python guru hurts me in this, but you seem to have some good advice. Thanks and let me know when you can.

    Comment by Josh — #
  7. Since Levenshtein’s distance is at least as much as difference between length of two strings, we can start with two strings that have approximately the same length of n.

    Levenshtein’s algorithm in this case takes roughly n*n operations to calculate the distance. So calculating the distance of two 6 letter strings (36 operations) is around 10 times faster than performing the same with 20 letter strings (400 operations).

    What I propose is to calculate partial distance when strings length exceed a certain threshold.

    Let’s take two titles from my blog as an example. “HyperScope and my blog’s evolution” and “IE7 will remain to be MIME challenged” have approximately the same length (34 and 37 each), so they would probably pass the first test.

    Calculating real distance would take 1258 iterations, but if I take just first 6 letters (“HyperS” and “IE7 wi”) and calculate the Levenshtein’s distance of those substrings (which in this case is 6), it takes me only 36 operations to get the idea that these two strings are most likely not the same.

    If I got something like 1 or 2, then I couldn’t judge from that number alone and I would need to calculate the real distance. In that case my first, partial, calculation would just slow me down a bit.

    That’s why in my script I calculate partial results only for strings that are at least 10 letters long. Since 10 letter long strings take around twice as much time to calculate the L distance, I get to the break even point if only every second comparison of two long strings can be shortened with partial calculation.

    If strings are on average sufficiently different (that is the space they occupy is mathematically sparse) as they are in my case, the number of useful partial calculations far exceeds the number of those which are not. And longer the strings are, less often do partial calculations need to be definite. That is they are cheaper for longer strings.

    “My” algorithm is a bit fuzzy, since you have to make all sorts of decisions. You have to decide how close is close enough and at which lengths you get the best trade-off between certainty in result and price of calculations. You also need to know a bit about your strings.

    If all strings started with “Report:”, then it wouldn’t make any sense to use first six letters for partial calculations. You need to know where there is enough variations to make it worthwhile. In my case I can rely as much on my taste that I won’t start all titles the same way.

    Comment by markos #
  8. BUMP!
    I stumbled upon this site while looking for difflib info =) However my 2 cents worth: i also needed to search up similar strings, for much the similar reason as you. My solution consisted of processing ALL the candidate strings at the same time, in parallel. I did this by constructing a TRIE from the string list and during the search throwing away subtrees as soon as they were found to exceed the cutoff threshold.

    I believe my solution kind of subsumes yours, in the sense that it is more general and more or less independent of the string list size. The trouble here was that the static TRIE itself takes up *a lot* of memory, especially in python, so i had to optimize by merging non-branching node chains etc. In the end, the code was not nearly as elegant as my first 20 line attempt. Depending on the string list size (up to let’s say 1 million strings), the naive TRIE might well be enough.

    Oh and by the way, the answer to bearophile’s difflib question is: it’s very slow and highly unsuitable here =) Just look at the source code…

    Comment by Radim — #
  9. Hi,

    I have been playing with BK-Trees and VP-Trees as well. Here’s my implementation: http://well-adjusted.de/~jrschulz/mspace/mspace-pysrc.html

    Its levenshtein implementation is probably not very efficient, but it’s easy to use the C implementation someone mentioned above instead of the built-in function. And if you have Psyco installed, it will be used automatically.

    Comment by Jochen #

Sorry, the comment form is closed at this time.