Something Awesome Every Day: Edit Distance Edition

It is outrageous how much fun stuff there is to learn at Adaptive. Lots of it is just brand new (thank you Adaptive scientists for putting up with me at Journal Club!). But my most exciting moments come when I find that ideas I’ve used for years take on awesome new depth and relevance in the context of immunology.

Take edit distance. This is a super-core concept for things like spell check and the search recommendations you see all the time at Google:

carly

The “edit distance” between two strings is defined as the number of changes required to transform from one to the next, where a “change” means inserting, deleting or changing one letter. For example, the distance from SEAN to PEANUT is 3:

  • Change the leading S to a P = PEAN
  • Insert a U at the end = PEANU
  • Insert a T at the end = PEANUT

The smaller the edit distance, the more similar two strings are. Most spelling mistakes involve an edit distance of just 1-2, so it’s very helpful when building a list of possible replacements.

The real world is a tough place, though — actually computing edit distance is pretty expensive. The most popular algorithm is something called Levenshtein Distance, which basically runs in time proportional to the product of the length of each string, i.e., O(m*n). Yes, there are optimizations, blah blah, not important right now.

For some applications, there are cheaper ways to look at edit distance. When your strings are the same length, the most common one of these is called Hamming Distance. It’s really simple: ignore insertions and deletions, and just count the number of character positions where the two strings are different. For example, the Hamming Distance between SEAN and BEAN is 1 (swap in a B for an S). This runs with one pass through the strings, so it’s much faster (and simpler to code) than Levenshtein.

Ah, but even when your strings are the same length, you’ve given up something by choosing Hamming. Take the two strings OTHER and THERE. Levenshtein computes a distance of 2 (delete the O and insert an E at the end), but Hamming sees that every character position is different and so reports 5. What a bummer.

In my pre-Adaptive world, this was a simple tradeoff: Levenshtein is more precise but slower, Hamming is an approximation but faster. Check, thank you very much, moving on.

Now fast forward to Adaptive. Our world is all about strings too, but rather than words, our strings define DNA or protein sequences. As we work on the next generation of our analysis tools, one feature we want to implement is “search by similarity” — find sequences that have an edit distance <= X from some exemplar, in particular for B-Cells. The actions tracked by edit distance (deletion, insertion, or change) mirror biologic mutations that can occur during transcription and so on. So it can be useful to group sequences that likely were the same prior to some mutation event. This comes up a lot in bioinformatics — most of the hard work comes down to adjusting for errors in measurement and mutations in reality, and figuring out which is which.

Anyway, I was sitting in front of my editor late at night a week or so trying to decide which algorithm to use for this search feature. “Obviously” I would prefer Levenshtein, but wasn’t sure if it was worth the extra compute cost in this particular immunologic context.

So I did was any self-respecting newbie does … I punted the question to a friend who understands all this way better than me. The answer I got back was not what I suspected, and was actually way way cooler than my simple view of the speed tradeoff:

“It depends whether you’re comparing B-Cells within or between individuals.”

Remember that CDR3 sequences are formed through a process called VDJ recombination, where gene segments are randomly selected and pasted together with some more random insertions and deletions at the join points.

After this initial step, B-Cells “in the wild” undergo a secondary process called “Somatic Hypermutation” that allows them to fine tune their response through additional spot mutations — flipping nucleotides along the sequence in an attempt to better match particular antigens. This mechanism creates a “family” of similar receptors, each of which originated from a common ancestor.

Because the “edits” caused by SMH are nucleotide flips and not insertions or deletions — it turns out that Hamming distance is actually not just faster but is actually better for identifying these families. From an SMH perspective, our earlier examples OTHER and THERE are very different beasts!

On the other hand — when you’re comparing between individuals, similarity is defined less by matches at specific positions, and more about common (or “conserved”) patterns found across sequences. This is because a particular pattern of nucleotides tends to create a similar receptor “shape,” which may tend to bind to the same antigen. Said another way, finding the pattern “THE” in both OTHER and THERE is the important part, not whether or not it appears at exactly the same position in both receptors. Hello, Levenshtein.

So at the end of the day — Hamming is the best choice for finding similarity within an individual’s B-cell repertoire; Levenshtein is best when looking across a population. Not such a simple tradeoff after all — we’ll be providing both options to our r.esearchers next release.

It is just insanely awesome to be able to take these tools that I felt like I knew so well, and find so much more complexity when applying them to this new domain. SO FREAKING COOL. The choice we’ve made at Adaptive to really invest in both computing and chemistry is paying off … I can hardly contain myself I’m so psyched about it.

Wooooooooo hoo!