Ivan • Levenshtein distance

In brief: what this article covers

Levenshtein distance (Levenshtein distance) is the minimum number of insertions, deletions, and substitutions of characters needed to transform string s into string t.

For example, take kitten: turning kitten → sitting requires 3 edits: k→s (substitution), e→i (substitution), ∅→g (insertion). Therefore, d("kitten", "sitting") = 3.

Why is this useful?

Have you ever wondered how your computer—or the Google search bar—figures out what you meant and fixes typos? This capability appeared early in the development of NLP. It’s made possible by this straightforward approach (yes, it looks daunting at first, but trust me, it gets easier).

A brief historical note

The metric was proposed by academician Vladimir Levenshtein in 1965 while studying 0–1 sequences (in other words, bit operations). He derived the following recursive formula for measuring the similarity of strings/sequences:

\operatorname{lev}(a,b)= \begin{cases} |a|, & \text{if } |b|=0,\\[2pt] |b|, & \text{if } |a|=0,\\[2pt] \operatorname{lev}(\mathrm{tail}(a),\,\mathrm{tail}(b)), & \text{if } \mathrm{head}(a)=\mathrm{head}(b),\\[6pt] \displaystyle 1+\min\left\{ \begin{aligned} &\operatorname{lev}(\mathrm{tail}(a),\,b)\\ &\operatorname{lev}(a,\,\mathrm{tail}(b))\\ &\operatorname{lev}(\mathrm{tail}(a),\,\mathrm{tail}(b)) \end{aligned} \right., & \text{otherwise.} \end{cases}

Later, in 1974, while tackling the string-to-string correction problem (String-to-string correction problem), Robert A. Wagner and Michael J. Fischer developed the eponymous Wagner–Fischer algorithm, which we’ll discuss in more detail below. This algorithm made computing Levenshtein distance efficient and became the basis for most text autocorrection systems (typically in more optimized forms).

Algorithm pseudocode

Let d(i, j) be the edit distance between the first i characters of s and the first j characters of t.

Boundary conditions:

d(0, j) = j
d(i, 0) = i

Recurrence for i > 0, j > 0:

d(i, j) = min(
    d(i-1, j) + 1,                 # deletion
    d(i, j-1) + 1,                 # insertion
    d(i-1, j-1) + [s[i] != t[j]]   # substitution (0 if characters are equal)
)

This is the classic Wagner–Fischer algorithm. Time complexity is O(mn), where m = len(s), n = len(t).

For an interactive visualization, experiment here.

Minimal Python implementation

def levenshtein(s: str, t: str) -> int:
    if s == t:
        return 0
    if len(s) < len(t):
        s, t = t, s  # ensure s is not shorter than t
    prev = list(range(len(t) + 1))
    for i, cs in enumerate(s, 1):
        curr = [i]
        for j, ct in enumerate(t, 1):
            ins = curr[j-1] + 1
            delete = prev[j] + 1
            sub = prev[j-1] + (cs != ct)
            curr.append(min(ins, delete, sub))
        prev = curr
    return prev[-1]

Why this way: there’s no need to store the whole matrix—two rows (prev and curr) suffice. This reduces memory from O(mn) to O(min(m, n)).

Intuition

We build a (m+1) × (n+1) matrix where each cell stores the “best cost” of transforming the corresponding string prefixes. The answer for the full strings sits in the bottom-right cell. Because each transition (insertion/deletion/substitution) is locally optimal, we obtain a global optimum.

To get a crude syntactic similarity score, you can use, for example, sim = 1 − d / max(len(s), len(t)), where sim ∈ [0, 1]. For instance, “kitty” will be closer to “cat” than to “copycat,” even though both share the substring “cat.”

At the same time, the basic algorithm has drawbacks you may have noticed if you tried comparing words here:

it effectively reads strings left to right and heavily penalizes deletions, especially in the early steps; for example, a substitution at the beginning has a larger impact than one near the end (because it immediately pushes us one step off the main diagonal).

For this reason, “kitty” might end up closer to “mitty” than to “cat” (even if your intuition says otherwise).

Don’t be discouraged: these shortcomings are specific to the basic version. More specialized variants address them.

What algorithm variants exist?

Damerau–Levenshtein: adds transposition of adjacent characters (ab ↔ ba). Useful for common typos.
Jaro / Jaro–Winkler: more robust to permutations and “jumbled” characters; often better for matching names.
Hamming distance: if strings have the same length and only substitutions matter.
Token-level distance: edit distance over words (not characters) for comparing phrases and sentences.
Weighted Levenshtein: different operation costs (e.g., i↔j cheaper than i↔x), keyboard-layout and error-frequency aware.

When should you use it?

Autocorrection and suggestions: finding nearest dictionary words to a typo (e.g., in a search bar).
Fuzzy joins / deduplication: merging records from different sources (names, addresses, brands) despite spelling differences (e.g., given a complete brand list, infer a product’s brand from its description).
Catalog search: tolerant search when users type “as they hear it” (“google find the song that goes ‘oppan gangnam style’”).

When should you not use it?

As you’ve probably gathered, Levenshtein is suitable only for primitive matching, and it will definitely not fit if:

Semantics matter more than surface form.
Large internal permutations are critical: the basic version does not “understand” them (prefer Jaro–Winkler or token-level methods).
Strings are very long and there are many deletions.

Afterword

Honestly, I never gave much thought to something as basic as autocompletion, which is underpinned by surprisingly intuitive and approachable math. I hope this short walkthrough of an algorithm that quietly helps you every day (not you, T9) piqued your interest too, broadened your horizons a bit, and lifted the veil on what powers your browser’s search box or your word processor.