Speeding up Levenshtein

Simon proposes using a dictionary to match mistyped URLs with real ones. I’d probably like the idea better if I actually understood it. Still using Levenshtein can be a bit easier than using a spell checker responsively.

Easier, but slow. I used existing implementation by Magnus Lie Hetland and made a test with 245 blog titles using a simple script . 100 iterations on aging powerbook produced:

1.766s, 29.152s, 9.399s (min, max, avg)

Average time to calculate distance between randomly chosen title and rest of them is 9.4 seconds, which is far too much to be useful. And there’s not even 250 of them.

There are two obvious ways to speed things up. Since minimum distance is at least as much as a difference in length of both strings, there’s no point in calculating it when difference already exceeds a limit we chose.

The other trick took into an account that Levenshtein’s algorithm of two strings of comparable length has complexity of O(n2) and that my blog titles are form quite sparse space. If strings are longer than a certain limit (I arbitrarily chose 10 letters), then first calculate edit distance on a small sparse substring and then calculate the real distance only if first one was close enough.

When I made 1000 iterations of randomly chosen title using only first test, I got following results:

0.003s, 0.284s, 0.167s (min, max, avg)

However, if I used both checks, the same 1000 iteration test got me:

0.003s, 0.162s, 0.027s (min, max, avg)

So, two simple checks can speed up search up to 350 times. Not bad, but I’m not happy. This is certainly fast enough for a personal website or sites where site structure effectively divides searching to relatively small sets of possible hits, but there are plenty of sites where it would be too slow.

I tried to simplify searching using distance to map strings to coordinate system, which was at least hopelessly naive if not downright dumb. A much better approach would be to read more, which is what I’m doing now.

By the way, I’ve also tried to speed it up using Pyrex, but got pretty much same results, which means I either don’t know how to use Pyrex or Python optimizes well. Or both.

Handling 404

This blog doesn’t use descriptive URLs, which is not the only time I screwed this up for no good reason. In this case it was mainly laziness and smug i-don’t-care-if-people-find-me attitude, but I hadn’t really realized how stupid this decision was until I started thinking about the problem of missing pages (error 404).

It always bugged me how useless most 404 pages are. Sure, I could notify the webmaster about a broken link, but that won’t help me find what I’m looking for. Can’t we do better than that?

As it happens we often can. There’s an unlimited number of ways in which visitors can fail to reach their destination, but majority of them probably fall in four categories:

content moved elsewhere
broken links
typos
bad memory recall

Content moved elsewhere

Many believe that pages should be permanent and links shouldn’t break over time. Yet sometimes we either have little control over this or there are good reasons for breakage. We should still try to mitigate such situation by guiding visitors to correct new location when possible.

This is done with HTTP redirects. If content was moved only temporary, then response should have a status code of 302 and contain a link in the header to correct current address. If move was permanent, then the same thing is achieved with code 301.

Missing pages amount to around 0.6% of hits on this website. They would be around 8% if redirects weren’t used.

Broken links

Broken links appear when email clients break URL longer than maximum line length or from botched copy operation. This usually means that address is cut off and part we have is incomplete, but otherwise correct. Handling such links can range from trivial to impossible. Let’s look at one Wikipedia link as an example:

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

Let’s say that link was broken near the end of it and there was lem missing in problem . Resulting address might not be complete, but there probably aren’t that many Wikipedia articles where such string forms first part of their address and it would be reasonable to assume that Wikipedia could find the right article. Alas it doesn’t.

On the other side of the spectrum are impossible or almost impossible guesses. If address was broken anywhere before Longest , then we could learn nothing from it about visitors expectations. This would look like a good place to give up.

However, if we are lucky, then our visitors came from one of popular search engines, which means their referrer attribute includes search string that led to our page. We can extract those keywords, use them and hopefully find that page or failing that offer a list of related pages. Not perfect, but beats plain “Not here” sign.

Typos

Typos are what happens when people use keyboards. I can’t live without one, but I still recognize that my fingers and my brain are not always of the same mind about how to spell a word. Letters get added, dropped or just switch places, all being a problem for a program that usually looks for that one perfect match.

There’s help. Edit distance algorithms, like Levenshtein’s , can be used to measure how similar two strings really are. Matching then becomes finding the page with shortest distance from a list of its closest neighbors.

Downside is that algorithm is fairly computationally expensive and it might take time to find a match. I’ll explore this problem in tomorrows article.

Bad memory requests

The main purpose of descriptive URLs is the same as catchy domain names. Make it easy to remember address for later use, when bookmarks or browsers autocomplete can’t be used.

But memory is notoriously unreliable and it doesn’t work any better with web addresses. So addresses used may vary enough from the right one that they don’t get caught with edit distance algorithms. As an example, someone might try to access aforementioned Wikipedia’s article with:

http://en.wikipedia.org/wiki/Longest_subsequence_problem

Calculated distance between this address and the real one is 7, which is probably more than would be real limit for matching. We can still look at referrer for clues, but we can also use requested address. In our case, we can extract keywords longest , subsequence and problem and use them to search for possible hits. Wikipedia doesn’t do this either, but neither do I, so I shouldn’t complain.
Time to wrap it up

I believe this approaches make a good case for logical and descriptive addresses, since most of them don’t work (well) otherwise. If someone requests an article with a nonexistent ID 145, it’s impossible to resolve which of existing ones with IDs 155, 245 or 149 he really wanted.

Still, sometimes descriptive addresses are not an option. I’d love to hear ideas or practical experience of how to handle such cases.

Preventing download of javascript on mobile web

Every year I spend a few weeks somewhere where internet connection is either slow or metered and expensive. Usually it’s both which makes me rather twitchy when it comes to big web page sizes. Yet I’m also certain that my new web home, however it may turn out at the end, will have a significant chunk of it written in Javascript. None of it necessary for working, but more than I’d probably like to download over flimsy mobile phone connection.

The problem are not really old mobile browsers which don’t support Javascript. They won’t download any of it anyway. The question is how to prevent eager browsers, who would, from downloading this stuff when you don’t want them to. My first, rather primitive attempt was this demonstration , which only works in Opera 9, Safari and Firefox, but most certainly doesn’t work in all browsers.

What it does is check font size of a title and based on its value resolves which media style sheet was used. If it was mobile.css , which is used only when media is set to handheld , then it was probably done from a mobile environment, so it includes the mobile version of a Javascript or it could, if I wanted, none at all.

There are several problems with this approach. First one is that it doesn’t recognize notebook users connecting over a cellphone. It can’t really, since browser environment is literally the same, unless it would try to measure latency and bandwidth of page elements and guess from those results, which is neither easy nor reliable. 3G networks can be rather fast and have a better latency than a wired connection from somewhere like Tanzania.

But let’s say we don’t care about this case. We can always turn Javascript off in Firefox if it’s important enough to us, which leads us to the next problem. Support for handheld media type is still rather spotty. If browser doesn’t support it, it may load the wrong style sheet if any at all and the wrong CSS value results in wrong turn in Javascript. However handheld media support is getting better. Since I decided from the start that my personal page is a good place where leading edge can also be a bit of bleeding one, that is good enough. On a different site this probably wouldn’t be true.

So it sort of works in principle, but it is crude and error prone. Javascript check is not self-contained and I could easily break it by changing font size value of a title in CSS files while forgetting to do a corresponding correction in code. What would be much better is to learn from the browser which media types style sheets were actually used and act accordingly. Now that would be grand.

There are two ways you could go about this. My first go was finding style nodes in DOM and looking at their disabled property, which is commonly used in style sheet switchers for turning sheets on and off. It doesn’t work, since ‘wrong’ media types in Firefox are ignored, not disabled. Their disabled value is still set to false .

A proper way of doing it would be using DOM Style Sheets methods. Basic idea is to compare actual values as set in the page with values read from style sheets and resolving which ones were used. While not exactly trivial, by little forethought it can be made to work fine in Opera and Firefox. It can also work in Explorer using its own methods, but it’s a pain to make it work in Safari, which in this case is not only an incompetent liar, it’s also lying in some weird accent. If you’d like to learn more about it, then there’s a great page describing current state but the gist of it is that you can’t rely on methods being there and even when they are, in some form or another, they might not work reliably (Safari) or might produce weird results (yup, Safari again).

Well, it could still work, but would require an unreasonable amount of code. Hence I made a different tradeoff, which can be seen here . A bit less automation for a lot less code. It keeps the idea of first demo with a small twist. First rule of every media style sheet targets the same element that gets created by Javascript if necessary and compares its font size value with those found in style sheets. It returns media types of all sheets where this value was found.

The end result certainly isn’t perfect or pretty, but it keeps amount of bookkeeping to minimum and limited to CSS files. Even using a “safe” property like font size can lead to problems (e.g. 0.9em is sometimes interpreted as 0.90em), but nothing difficult to overcome.

It might not be of production quality, but it will work as good starting point for further exploration.