One Test, Every Disease

t-cell blibbet
The old blog still looks pretty ok after two years of inactivity! I ought to have remembered how the startup game goes … raise my head after a few crunch cycles and I’ve got a lot less hair and every article on CNN is about some angry old guy tweeting stuff. Huh.

Anyways, here at Adaptive we’ve been busy creating more awesome. My blog hiatus has been all about scaling up the clinical business around our new clonoSEQ assay. Beyond getting a completely new chemistry validated and into production, we’ve automated reporting and report delivery, created whole new products to support clinical trials, implemented new robots and sample tracking systems, and much much more. Progress every day.

But today is really special, which is what brought me back here to share.

We’re creating a Universal Diagnostic.

We’ve just announced a significant collaboration between Adaptive and Microsoft. Our shared goal is to create a “universal” diagnostic test — a simple blood draw that tells a story about every condition and every disease impacting your health. And to be clear, this is not speculative — we have a plan and every reason to believe that we will nail it. I am super, super psyched.

As we’ve discussed previously, Adaptive’s core technology allows us to describe the millions of T-Cell (and B-Cell) receptors that make up a person’s immune repertoire. Each of these receptors is a specific match for a unique “antigen” (a protein from an attacker such as cancer or Ebola). T-Cells circulate in blood and lymph; when they happen to bump into their soulmate antigen, they go into action — making copies of themselves and killing bad guys Rambo-style.

Cool thing is, this happens even when you have very tiny trace amounts of antigen in your system — so if we could match receptors to the antigens they are fighting, we could recognize and diagnose the corresponding condition earlier and more accurately than most current methods.

Even better, this is a general-purpose mechanism — the receptors are unique, but the actions of binding and responding are the same for every antigen. Since our test can read all the receptors at once, we’d be able to see everything that you are responding to with just one test (and as a bonus, things you have memory for).

Sounds pretty simple: Get the list of receptors, see which ones are activated, match them up to antigens, here’s your universal health picture, thank you very much and have a nice day. Woot!

Wait it’s still hard.

Two problems. First, nobody has been able to do this matching of receptor to antigen at any kind of scale. We’ve been able to quantify the receptor sequences here at Adaptive for years, but it’s completely non-obvious what any given receptor will bind to.

Perhaps not surprisingly, we’ve been working hard on this problem, and at this point can do it really well. Our MIRA assay follows the classic Adaptive chemistry + computation playbook, using combinatorics to multiplex a bunch of binding experiments onto one plate — resulting in thousands of binding pairs every time we do a run. So good news on that front.

The second problem, though, is that there are simply way, way too many possible TCRs for us to ever exhaustively map the entire space using MIRA. We’re talking like 10^16 possible receptors, and millions of potential antigens. And to make matters worse, people evolve different receptors to do the same thing — my TCRs for defending against chicken pox likely look completely different from yours, even though they respond to the same antigens.

Here’s where Microsoft and machine learning enter the game. By using our MIRA-derived data as a training set, we believe that we can train computer models to predict the binding properties of receptors we’ve never seen before. There’s actually a bunch of early evidence that this will work — we’ve done some work with CMV, and a couple of great small-scale examples showed up in Nature earlier this year (here and here).

Machine Learning is everywhere these days — and for good reason. Anywhere there are rules and patterns hiding in data, new ML techniques are proving exceptionally good at figuring them out. T-Cell binding is a physical phenomenon — gene sequences translate to protein sequences translate to physical structures translate to binding — so it’s a perfect application of these new tools.

Microsoft Research has been a pioneer in ML and artificial intelligence for years — it is incredibly exciting to be collaborating with them in a way that is so completely synergistic. There is not a doubt in my mind that over the next few years, Adaptive MIRA + Microsoft ML will deliver a Universal Diagnostic to the world.

Do that, and we really have fundamentally changed medicine together.

Time to play.

(Adaptive) Adaptive Biotechnologies Announces Partnership with Microsoft to Decode the Human Immune System to Improve the Diagnosis of Disease

(Microsoft) Microsoft and Adaptive Biotechnologies announce partnership using AI to decode immune system; diagnose, treat disease

(NewCo) All We Have Yet To Understand

(Geekwire) Microsoft partners with Adaptive Biotech on AI-driven blood test to diagnose dozens of diseases at once



Immunology, meet the Internet

30-pipetteWhenever software friends visit Adaptive, I make a point of showing them a multichannel pipette. It’s actually a pretty cool piece of technology, but from a design perspective it just feels wrong. Like, if one pipette is good, then a ton of them must be even better, right? Maybe not. At some point, incrementing old solutions isn’t enough. That’s why robots are taking over the job of liquid handling in the lab — they’re just fundamentally better at it than people, no matter how many tips you cram onto the end of a manual pipette.

Software in biotech is in the same place as the pipette, but even more so. Celebrity gossip surfing habits are mined in real time using the latest cloud-based “big data” technologies, but your genome is still largely analyzed by some old Perl code on a Commodore 64 that requires a statistics module written in 1925 by some woman that threw away the source (I will cop to a bit of hyperbole here, but far, far less than you want to know).

There is simply no excuse for this. Biology is hard enough. Researchers should not have to learn to program just to understand their experimental data. They should have flexible, real-time, visualization-rich tools at their fingertips. Tools that highlight patterns and provide answers, leading immediately to the next question, the next discovery, and the next answer.

We get it. Which is why we’re not just creating unbelievable assays (although we are doing that too) — we’re creating end-to-end solutions that include the tools our customers and partners need to figure out what those assays are saying. We just released the latest version of these tools, and I believe they are game changing.

I am seriously excited about this. The marketing team made a pretty fantastic video showcasing the brand new immunoSEQ Analyzer 3.0 experience — if you’re already sick of reading, feel free to head over and check it out at Woot!

Still here? OK then — a bit more about what we’ve built.

Supporting Experimental Design

Research runs on experiments and publications. The design of these experiments is critical — what are the cohorts, what is being measured, and how will results be compared and evaluated. The Analyzer is built on top of these same concepts, so you see data in the right context, without having to do a bunch of manual pre-processing.

In the Analyzer, samples can be easily tagged with metadata to segment and correlate observations using experiment variables. Researchers can draw from our rich library of tags, or create their own as needed. This metadata travels with the samples, always available to help guide analysis.

This magic here is subtle, but super-powerful. Metadata defines “comparison groups” — control vs. protocol, pre- vs. post-treatment, one demographic vs. another, responders vs. non-responders, etc. — whatever groupings are relevant can be used to drive visualizations.

And once conclusions are drawn and work is published, the Analyzer provides, free for our customers, a repository where data can be publically shared interactively. Figures can be saved with the data, which remain completely dynamic for readers to explore. Readers can even incorporate published data into their own work, enabling new research to build easily on what has come before. The potential of these “Published Projects” is ridiculous, and will be the subject of its own post soon.

Designed for Immunology

We’ve learned which metrics matter – so we make them available right out of the gate. We can segment productive and non-productive receptors, characterize sample clonality, report CDR3 length and the genes and alleles that contribute to a particular rearrangement, pinpoint somatic hypermutations in B-cells, measure the density of receptors within tissue, and much more. All of these values are delivered with clear, online documentation as to how they were derived.

30-scatterIn the same way, the visualizations that we’ve built are custom-designed for immune research. For example, Combined Rearrangements and Scatter Plots illuminate the “overlap” between cohorts — not just the amount of, but the specific sequences that are shared.

The really cool thing here is, as we learn more, we just keep driving that experience back into the Analyzer. We’ll be doing more and more of this, and because the tools are delivered online, they upgrade magically without any downtime or local IT hassles.

Extra Power with Advanced Tools

Of course, science is about new questions and discoveries. So no matter how much we bake in, there will always be a need to reach just a bit farther. Traditionally in biotech, this is a “cliff” — download huge data files and start writing code to process them. More often than not, this barrier means it just doesn’t happen. We’ve worked hard to add capabilities that minimize this challenge, and frankly I think we’ve nailed it.

30-pivotFirst, check out the Pivot Table view; it’s a point-and-click interface for creating aggregations and cross-tabs. For example, if you want to just see how many unique rearrangements were called against each V Gene, just add the V Gene column on the left, pick Count in the values section in the middle, and go. Want to segment that further by V allele? Just add it on the left. Want to cross-tabulate those against J Genes? Just add that at the top of the chart using the “+” button and you a complete two-dimensional table of counts by V and J. Bam. Once you get the hang of it, it’s really fun to play with data in this environment.

Then there’s my favorite — the Advanced Query view. Advanced Query exposes immunoSEQ data as two “tables” of rows and columns, one for samples and one for rearrangements / sequences. A robust implementation of the SQL query language sits on top of these tables thanks to Apache Spark … and you can do amazing things with it.

Here’s a simple one; find all amino acids within a Levenshtein edit distance less than 3 from a particular sequence:

select amino_acid, templates, reads from sequences
where Levenshtein(amino_acid, 'CATWDLKKLF') < 3

30-advancedqueryKaching! This really is a supercharged interface. SQL syntax can also be a bit of a bear, but there is a ton of documentation out there – and it is way better than cracking open that Perl book and starting from scratch!

It all starts with Spark

I could (and often do) run on forever about all the work the team has put into making the new Analyzer, hands down, the world’s best immunology research tool. But I’m going to wrap up today with just one more tease, this time about the backend that drives all that great experience.

All this great functionality comes at a price. The tools require massive amounts of computation, far beyond what we could reasonable provide with traditional database-backed systems. And while there are tons of great large-scale processing technologies out there, most of them are built for long-running offline computation — they’re just not made for real-time, highly iterative experiences like the Analyzer.

Enter Apache Spark. Spark is a pretty remarkable new technology that attempts to solve exactly this challenge — provide “big data” capabilities in support of a real-time environment. And you now, it just works great. We use a combination of Spark SQL and hand-coded map/reduce style jobs — Spark enables many ways of working with data. It is truly amazing technology, and will also be the subject of a more detailed post soon.


There is so much new stuff in this release that it’s hard to start teasing out clear and concise messages. I am incredibly proud of the team that built it. We’re a great mix of long-time biotech folks and long-time software folks — it’s a great recipe for awesome.

And I promise to write much more about all of these bits and pieces over the coming weeks. In the meantime, if you’ve got any questions about the new release, just drop me a note here. Releasing stuff is great, but it only really matters when customers use it to do good work — and we are super-motivated to help you make that happen.



pairSEQ breaks down another barrier to fixing, well, everything.

There are a ton of things that make Adaptive a special place — but for me, the real magic is our team’s unique ability to blur the line between biology and computation. A year ago when I was trying to decide if I should join the company, it was Harlan describing pairSEQ that tipped me over the edge. I’ve wanted to write about it here ever since, and now that we’ve published the paper I can finally do it. Woo hoo!

pairSEQ is one of the key advancements driving our expansion into therapeutics. And that’s great, because it puts us one step closer to actually fixing immune-related diseases. But it’s also just a mind-blowing festival of geeky awesome brain crack. That’s what I’m focused on today. 😉

Single-chain immunosequencing is super-useful…

If you’ve been paying attention, you’ll remember that our core technology uses next-generation sequencing to determine the “thumbprints” for millions of adaptive immune cells in a sample of blood, bone marrow or other tissue. These thumbprints, and the way they group together, tell us some incredible things about the state of the immune system. We can use them to track the progression and recurrence of certain blood cancers, predict the likelihood that immunotherapy will be effective, and much more. It’s cool stuff.

But it’s also only part of the story. T-Cells and B-Cells are “heterodimers” … a word that kind of sounds dirty but really just means that they’re composed of a pair of distinct protein chains. T-Cells have “alpha” and “beta” (or “delta” and “gamma”) chains; B-Cells have “heavy” and “light”.

Our core immunosequencing process measures the gene sequences for these chains independently. That is, we can tell you which alpha clones are in a sample, and which beta clones are in the same sample, but we historically haven’t been able to tell you which alphas were paired with which betas.

For diagnostic purposes, this isn’t a big deal. The TCR Beta sequence is incredibly diverse, and its CDR3 “thumbprint” is more than enough to use as a marker for disease (and similar for heavy-chain sequences in B-Cells). This is our bread and butter and frankly we’re rocking the world with it.

… but sometimes you just gotta find the pairs.

Still, as useful as the individual chains are, they only get you so far. It turns out that the “shape” of a T-Cell receptor is determined by the alpha and beta chains together — and it’s that unique shape that enables the cell to precisely target one specific antigen.

This targeting is the basis of some of the most exciting work in immunotherapy: identify a T-Cell that attacks a particular bad guy, then copy that receptor shape to create a therapeutic (the idea behind CAR T-Cell Therapies). Simple in concept, but until you can identify paired chains, basically impossible.

Past attempts focused on the physical biology of single cells. For example: extract the genes from a single cell, paste them together using bridge PCR, and then sequence them as a unit. It works — but only one cell at a time. A more recent approach tries to automate the process by isolating single cells in tiny droplets of oil. This seems to work better, and has identified thousands of pairs, but is cumbersome to manage at scale.

Hooray for math!

pairseqThis is where the Adaptive magic — combining biology and computation — makes the difference. Harlan and his team realized that we didn’t need to isolate the individual cells at all. Because the sequences are so highly diverse, we can instead use probability and combinatorics to do the hard work for us. Here’s how:

  1. Take a sample and distribute it randomly across N (we used 96) wells.
  2. Amplify the alpha and beta chains within each well, just as we’d do for traditional immunosequencing.
  3. Use standard barcode adapters to tag each chain with a unique identifier corresponding to the well it was placed in.
  4. Mix the whole soup back together and run it through the sequencer.

Now, say we’ve found alpha sequence A in wells A1, B5 and E3. We then find beta sequence B in the same wells. Because we know the number of wells and the total number of cells we started with, we say with X% confidence that these chains must have come from a pair. Want to be more than X% sure? Just add more wells.

Of course, the math is a bit more complicated that than, because there are a bunch of confounding factors. Like, even though sequence B may have actually been present in a well, our PCR process may have missed it. So the paper is, like any good scientific piece, full of impenetrable equation porn.

But the basics are pretty simple — and incredibly effective. Our first run identified more than an order of magnitude more pairs than previous known methods, and did so using standard lab equipment and consumables. Therapeutics here we come.

This is why bringing both biology and computation to the party makes such a difference. We simply have double the weaponry at our disposal to attack hard problems. And dang if we don’t use those weapons really well. I’m super-proud to be a part of it.

Yeah, just another day at the coolest company around.

Noodling on samples, the real currency of the life sciences.

While it’s obvious in retrospect, until recently I didn’t really understand the one thing that fundamentally makes it difficult to move the life sciences forward. It comes down to exactly one word: samples.

We have no credible virtual models of the human body, so the only way we can figure things out is to measure “stuff” in real people. Stuff that usually requires cutting or poking into their bodies. Often many times, over the course of many years. After giving them drugs or other agents that we’re not really sure about. And not just any people, but those that match certain disease or other criteria that may be pretty rare. And not just a couple of people, but enough to be able to draw statistically-relevant conclusions.

Looked at in this way … it’s amazing we learn anything, actually.

We’ve tried to create mathematical models of biology. We’ll get there someday, but so far we’ve struggled. Which is why, despite all the ethical questions around humanely managing the practice, we do so much of our experimentation using animals — preferably animals like mice that reproduce quickly and can be engineered to approximate the conditions that affect humans.

But the gold standard, and the one required by civilized society before calling any new drug or intervention “good” … is human studies. And that is just a long, long, expensive road.

This reality has a surprising impact on just about every facet of our work. For example, over the last week or so I’ve been helping out a couple of our scientists set up a public data set we’ve created. It’s pretty awesome — immune sequencing data that we’ve created at significant expense — and we’re giving it away for free. (Of course, it’s not completely altruistic; we want folks to see for themselves the value of the assays we’ve developed and sell, but still.)

The route to publicizing data like this is to publish it in a “respectable” journal. To do that, you have to get through a gauntlet of peer review that passes judgment on its overall scientific value. And human nature being what it is, this is an EXTREMELY political process. It’s a perfect vehicle for cranky scientists to show just how much smarter they are than everyone else. Super annoying.

There must be a better way, I thought. Certainly in a connected world like this we have plenty of options to make people aware of the data. But our CSO pushed back on me — how would this really work? Folks need some confidence beyond “we say so” to believe that the data we share is valid and trustworthy, because all science is building one discovery atop another.

And here’s the rub — it is so expensive and time-consuming to run life science experiments; there is simply no practical way to validate them all in the real world. So we’ve adopted peer-review, an intentionally political process as a proxy to, hopefully, ensure that at least we aren’t actively lying and can back up our processes and methods.

Fields backed by well-understood math don’t have this problem. In theoretical physics, anyone (ok not anyone, but enough) with a computer or maybe even a pencil can validate published results with minimal investment. And they do, so bogus research gets noticed pretty quickly. Not so for us.

It’s frustrating when I can’t come up with a better solution for something that seems so obviously broken. But I guess that’s one more reason that the work we’re doing now is important … it gets us closer to understanding the systems, and with enough understanding we will begin to create the mathematical models that will accelerate progress and eliminate human ego from the game.

Wait, would that be the singularity?


Vacation perspective and why I can’t wait to get back to work…

The BurrenI just got back from a pretty incredible two week trip to Ireland — my son’s choice for a high school graduation experience. I don’t often get away from work for this long; when I do it always strikes me how long it takes to reset my brain, and also how worthwhile it is to do that occasionally.

We pretty much circled the country by car, starting in Dublin, heading down through Cork to the Sheep’s Head Peninsula, up past the Cliffs of Moher, through Galway and Connemara before heading back out from Shannon. I learned a ton about the creation of the Irish republic, shot clay pigeons and archery targets, hiked over mountains and though more than a few herds of sheep, toured the Guinness factory and Arsenal stadium (ok that was in London), learned about hurling and the GAA, watched Connor kiss the Blarney stone (did not partake myself), walked though passage tombs created long before the pyramids, and ate a truly ridiculous number of full Irish breakfasts.

Travelling reminds me that the world is a big place — and that there are literally thousands of interesting and useful ways to spend the eighty or so years we get here. But as much as I love to explore, at the end of the day there’s nothing better for me to do right now than contribute to the work we’re doing at Adaptive.

On the very last day of our trip, my wife, who is on hard-core immuno-suppressants for Lupus, picked up a bug — so after a very (very) challenging travel day home, we headed directly from SeaTac to the hospital. She’s on the mend and the folks at Overlake are taking good care of her, but what a bummer. Yet at the same time, it’s a powerful reminder of why I’m motivated to show back up at work tomorrow morning and dig back in.

Most people don’t understand that our best defense against auto-immune conditions like Lupus is to literally turn off the immune system. That’s right — we know so little about what is going on, that all we can really do is use chemo drugs like Methotrexate or transplant drugs like Cellcept to tell our bodies to stop fighting everything, good or bad.

Obviously, this leaves folks in a pretty vulnerable state. Any minor virus or other baddie has pretty much a free ticket to party. So life for auto-immune patients becomes a balancing game … turn off immunity until you get sick, then try to turn it back on so you can get better, hopefully enough to kill the real bad guys without inflicting too much damage on your own body.

It’s kind of ridiculous really — how little we know and how crude our tools are. I know I’ve said that before, but it is incredibly striking. Break through the complicated jargon we use to classify things, and again and again the reality is … we just don’t know that much.

But the reason humans are awesome is that we just keep swimming — knocking down questions one after another until we get to solutions. And while usually this process is just a grind, once in awhile you make a big leap — and the tests we do at Adaptive are helping to enable and accelerate some seriously huge leaps in our understanding of the immune system. And I get to be a part of that.

So vacation was great, but tomorrow bright and early, I’ll head on over the bridge and get back to work. The next release of the immunoSEQ Analyzer is going to be awesome … can’t wait to put it in the hands of our research partners.


Something Awesome Every Day: Edit Distance Edition

It is outrageous how much fun stuff there is to learn at Adaptive. Lots of it is just brand new (thank you Adaptive scientists for putting up with me at Journal Club!). But my most exciting moments come when I find that ideas I’ve used for years take on awesome new depth and relevance in the context of immunology.

Take edit distance. This is a super-core concept for things like spell check and the search recommendations you see all the time at Google:


The “edit distance” between two strings is defined as the number of changes required to transform from one to the next, where a “change” means inserting, deleting or changing one letter. For example, the distance from SEAN to PEANUT is 3:

  • Change the leading S to a P = PEAN
  • Insert a U at the end = PEANU
  • Insert a T at the end = PEANUT

The smaller the edit distance, the more similar two strings are. Most spelling mistakes involve an edit distance of just 1-2, so it’s very helpful when building a list of possible replacements.

The real world is a tough place, though — actually computing edit distance is pretty expensive. The most popular algorithm is something called Levenshtein Distance, which basically runs in time proportional to the product of the length of each string, i.e., O(m*n). Yes, there are optimizations, blah blah, not important right now.

For some applications, there are cheaper ways to look at edit distance. When your strings are the same length, the most common one of these is called Hamming Distance. It’s really simple: ignore insertions and deletions, and just count the number of character positions where the two strings are different. For example, the Hamming Distance between SEAN and BEAN is 1 (swap in a B for an S). This runs with one pass through the strings, so it’s much faster (and simpler to code) than Levenshtein.

Ah, but even when your strings are the same length, you’ve given up something by choosing Hamming. Take the two strings OTHER and THERE. Levenshtein computes a distance of 2 (delete the O and insert an E at the end), but Hamming sees that every character position is different and so reports 5. What a bummer.

In my pre-Adaptive world, this was a simple tradeoff: Levenshtein is more precise but slower, Hamming is an approximation but faster. Check, thank you very much, moving on.

Now fast forward to Adaptive. Our world is all about strings too, but rather than words, our strings define DNA or protein sequences. As we work on the next generation of our analysis tools, one feature we want to implement is “search by similarity” — find sequences that have an edit distance <= X from some exemplar, in particular for B-Cells. The actions tracked by edit distance (deletion, insertion, or change) mirror biologic mutations that can occur during transcription and so on. So it can be useful to group sequences that likely were the same prior to some mutation event. This comes up a lot in bioinformatics — most of the hard work comes down to adjusting for errors in measurement and mutations in reality, and figuring out which is which.

Anyway, I was sitting in front of my editor late at night a week or so trying to decide which algorithm to use for this search feature. “Obviously” I would prefer Levenshtein, but wasn’t sure if it was worth the extra compute cost in this particular immunologic context.

So I did was any self-respecting newbie does … I punted the question to a friend who understands all this way better than me. The answer I got back was not what I suspected, and was actually way way cooler than my simple view of the speed tradeoff:

“It depends whether you’re comparing B-Cells within or between individuals.”

Remember that CDR3 sequences are formed through a process called VDJ recombination, where gene segments are randomly selected and pasted together with some more random insertions and deletions at the join points.

After this initial step, B-Cells “in the wild” undergo a secondary process called “Somatic Hypermutation” that allows them to fine tune their response through additional spot mutations — flipping nucleotides along the sequence in an attempt to better match particular antigens. This mechanism creates a “family” of similar receptors, each of which originated from a common ancestor.

Because the “edits” caused by SMH are nucleotide flips and not insertions or deletions — it turns out that Hamming distance is actually not just faster but is actually better for identifying these families. From an SMH perspective, our earlier examples OTHER and THERE are very different beasts!

On the other hand — when you’re comparing between individuals, similarity is defined less by matches at specific positions, and more about common (or “conserved”) patterns found across sequences. This is because a particular pattern of nucleotides tends to create a similar receptor “shape,” which may tend to bind to the same antigen. Said another way, finding the pattern “THE” in both OTHER and THERE is the important part, not whether or not it appears at exactly the same position in both receptors. Hello, Levenshtein.

So at the end of the day — Hamming is the best choice for finding similarity within an individual’s B-cell repertoire; Levenshtein is best when looking across a population. Not such a simple tradeoff after all — we’ll be providing both options to our r.esearchers next release.

It is just insanely awesome to be able to take these tools that I felt like I knew so well, and find so much more complexity when applying them to this new domain. SO FREAKING COOL. The choice we’ve made at Adaptive to really invest in both computing and chemistry is paying off … I can hardly contain myself I’m so psyched about it.

Wooooooooo hoo!

API mania misses the point on healthcare interoperability

I recently decided to stop participating in the ONC “Architecture, Services and APIs” workgroup. I have an incredible amount of respect for the people running the group, but I also believe that the momentum at ONC right now is just wrong. My plan was to just go quietly into the night, but looking through the recently-released Meaningful Use 3 requirements convinced me that I really ought to put this out there.

This is too long and probably super-boring to 99% of humans, so if you keep reading, that’s on you.

The Connectathon Problem

When I first testified at a FACA meeting in 2009 (six years ago!) — I talked about the “Connectathon” problem. Interop is a tough nut for HIT vendors, because it’s obviously important to individual health outcomes, but it’s also a net loser from a competitive and commercial point of view.

Industry solved this dilemma by creating a false measure of success — the Connectathon. They defined some technical specs for sharing data (at the time, IHE profiles) and got together in a room to show that the technology “worked”. And it did — in this completely artificial environment. But the technologies were in no way appropriate to actually deploy in the real world, so things fell into place nicely — success at the Connectathon allowed EHR vendors to claim high ground on interoperability, but without any risk that it would actually be implemented for real.

So what about these technologies made them unsuitable for real use? One issue was technical complexity for sure. But what became clear over time was that the two killer issues were discoverability and authentication/authorization — problems that made interoperability unworkable at scale.

“Discoverability” helps machines find each other on the Internet. If Clinic A wants to request data from Clinic B, it needs to know the “address” for Clinic B. The telephone is pretty good at this for voice interactions; if I know your number I can pick up almost any phone and find you. And we’ve created many ways for us to share and remember phone numbers (contact lists, databases, searching on the web, etc.). The same wasn’t true for old-school healthcare interop solutions.

“Authentication and authorization” help machines decide if they should talk to each other. There are pretty complicated rules for this in healthcare, starting from who has a right to ask about a patient and how to communicate who that patient actually “is,” but going much deeper than that. Person-to-person, we typically work these out ok. But old-school interoperability just punted on the problem. In the Connectathon it was all magically (and painstakingly) set up ahead of time.

Direct tries to help

The Direct Project was an attempt to simplify interoperability and create something that could actually scale. In particular, we tried to build in at the very core solutions for discoverability and auth. The hope was that if we could tackle these, people would find more and more reasons to leverage the basic technologies, and we’d see a network effect start to take hold.

We addressed discoverability by latching onto the one technology that works for this on the Internet today: email addressing. Without creating anything new at all, machines can take a Direct address and find its home. And because it’s just an email address, we can use all the same ways we keep track of these today — they’re easy to add to patient or provider records in EHRs, we can post them on the internet, we can use directories like LDAP, etc. … all ready to go today.

Auth was tougher, because neither email nor the domain naming system which underlies its addressing were created with security in mind. And if every single person with a Direct address had to decide for themselves if they should trust every other Direct address, things wouldn’t scale very well!

We decided to address this by using the natural groupings offered by HIPAA. Direct users don’t have to decide if they trust an individual, they decide if they trust a group of individuals (which can be a small group like a clinic, or a large group like a member of DirectTrust or NATE). The technology innately understands the dynamics of these groups so that the problem can become manageable.

I think we did a pretty good job. And there may still be just enough oomph in this approach to get it over the hump — where virtually zero messages have been exchanged at scale using old-school methods, millions move every month via Direct.

Even so, it hasn’t had nearly the impact we’ve hoped. Why? Well, technology can only do so much. The hard work of building it into clinical workflows, and of getting people on board with the trust associations, takes time and commitment. Those aren’t fun problems — they’re a slog in the best of worlds. People like to look for the next shiny thing, and that’s where the momentum is moving right now.

I’m still hopeful, but it’s tough to watch.

Why APIs aren’t the answer

That next new shiny thing is healthcare “APIs.” An API (or “application programming interface”) is just a fancy way of describing how machines can talk to each other. In fact, Direct is an “Messaging API.” But in hipster parlance, an “API” usually means using web protocols like HTTP and JSON to talk between machines. And often implied is that the consumer of an API is a mobile app, like the ones that run on your iPhone.

APIs can make some cool things happen. Amazon revolutionized affiliate marketing and enabled the creation of some incredible tools by opening up their catalog with an API. Government has done the same thanks to the early work of Todd Park and friends. The Google Maps API has spawned thousands of awesome new mashups. And so on.

And they have their place in healthcare too. I won’t go deeply into that now, but I am a 100% fan of using APIs to extend EHRs within the clinic, adding functionality and all kinds of great stuff. The SMART project is at the forefront here, although it seems that Epic may have just given then the stiff arm with their new stuff.

BUT. Acknowledging all that awesome — they simply won’t help for healthcare interop. Why? Because they solve neither discoverability nor auth. We’re just back to the old days.

APIs work really well when they are delivered either by a single big instance (Amazon, Facebook, Open Government), or users only need to talk to one instance (your email server, or your instance, or your internal EHR). When LOTS of machines need to talk to LOTS of different instances — things break down very quickly. And unfortunately, that’s healthcare.

Let’s go back to the simple case of Clinic A asking Clinic B for some data on a patient. In the API world, Clinic A first needs to configure their system to know where Clinic B “is” on the Internet. Clinic B then needs to grant Clinic A access to make calls into their system — say Clinic A gets a password to do this. They need to save that password somewhere. OK, good enough.

This now has to be repeated dozens or maybe hundreds of times, each time Clinic A wants to get data from some new clinic’s API. To keep things secure, each clinic probably expires its passwords periodically. Oops, things stop working until IT departments get involved. And maybe Clinic B wants to upgrade their software, and the API changes a bit. Dang, that connection to Clinic A is broken.

This stuff just spirals and grows into a huge mess. And the API crowd is explicitly ignoring it. As in, ignoring it on purpose by saying “that’s out of scope.”

Or course, all of these problems are “solvable” — but THEY ARE THE PROBLEMS TO BE SOLVED. Not data formats or query patterns.

That was why we spent some much effort on the innards of Direct. C/CDA and other data formats may be stupid and overly complex, but that ultimately doesn’t matter. The hard problem is solving discoverability and auth in the context of an incredibly diverse and distributed ecosystem.

We have a path forward on this. I wish we’d have the discipline to stick to it.


As an interesting aside, the “Connectathon” problem rears its head all over the place … people are increasingly “dishonest” about the state of problems and ideas, e.g.,:

  • How many news articles do you see that talk about a “cure for Alzheimer’s” … but when you look closer they just have discovered an interesting mechanism that maybe in fifteen years after a ton more work could be useful?
  • How many awesome new gadgets do you see talked about on the web as if they exist … but when you try to buy it you find they’re just a Kickstarter trying to raise money for an idea they haven’t even prototyped yet?
  • …. And on and on … we are increasingly awash in a world that pretends neat ideas are finished solutions. And increasingly I’m a crabby old guy, kind of pissed off about it. 😉

Natural “vaccination”, or, another thing we seem to have gotten wrong.

My daughter just turned twenty-one a few months ago. Watching her plow through a plate of Queso Espanoles (with Sangria) at one of her favorite spots near school, you’d never believe that as a toddler she couldn’t get near anything even resembling dairy. At about two years old, a single piece of buttered popcorn found between the sofa cushions just about closed up her throat.

There have always been people with allergies. But it’s clear to even the most casual observer that things are insane today compared to the past. Sorting out the nut, dairy, gluten, strawberry, etc. allergies is just an expected part of kid gatherings these days. So WTF is going on?

When our kids were little, the accepted recommendation was to strictly avoid common allergens until at least age six. We thought that exposure to this stuff too early was the, or at least one, cause of allergies. And boy did we listen — after all, I still can’t believe the world trusted me to raise an actual human — I wasn’t going to screw it up.

Whoops. Or at least, maybe whoops.

The scintillatingly-titled Randomized Trial of Peanut Consumption in Infants at Risk for Peanut Allergy, published last Monday in the New England Journal of Medicine, seems pretty clearly to show that we were exactly, 100% wrong with this approach.

Let’s make things really simple. Basically, one group of kids got peanuts in their diet, and the other did not. The peanut kids developed allergies 3.2% of the time. The non-peanut kids? 17.2%.

Holy crap, are you kidding me? By avoiding peanuts, risk of allergy increased by 14%. THAT IS A SCARY-HUGE DIFFERENCE.

peanutsOf course, there are lots of ways to pick at studies, and you can do that here too. But even if you seriously handicap things — say, assume that every kid that fell away from the study had the “wrong” reaction (all peanut kids got allergies, all non-peanut kids did not) — the numbers are still impressive at 4.8% to 16.8%. You may get some bias out of the lack of diversity in the kids, or because the families knew which cohort they were in. But the difference is so significant, it’s hard to imagine any of those would flip the results completely.

In retrospect, this make sense. By sensitizing the immune system to various substances at low doses, you give it a chance to learn. Cells that freak out in the presence of these pseudo-antigens will be suppressed by normal selection processes, so they aren’t given a chance to expand in the first place.

The longer I’m around this stuff, the more I believe that adaptive immunity really is the key to almost everything. Hopefully these results will be duplicated and we can reverse what is otherwise a pretty unsettling trend.

As for our daughter … sorry Alex, my bad.

PS. All this reminds me of that awesome Canadian professor who postulated that eating boogers could help boost the immune system. Yum!

R + immunoSEQ = Awesome!

We are really proud of the immunoSEQ Analyzer — it’s a great tool with a ton of charts and visualizations specifically tareted at immunosequencing. And since the field is new to many folks, it’s great to have an environment where we can share the approaches and techniques we’ve developed over the years. If you’re an Adaptive customer and you’re not using it — you’re missing out.

But at the same time, there are a zillion analysis and visualization tools out there, and each of them has their own strengths. Our goal is to make sure that immunoSEQ data shines in all of those environments, because better tools == faster discoveries == better treatments for real people.

I’m really excited today to show off our latest such integration. “rSEQ” is an installable package that makes it super-easy to work with immunoSEQ data with R. You can read all the gory details in this technote (free Analyzer login required), and I’ll walk through a simple example here so you can get an idea of just how awesome it is.

You’ll need the popular devtools package in order to install rSEQ. The technote provides some details on this, but in its simplest form all you need to do is:

  • packages("devtools")
  • library(devtools)
  • install_url("")

Now you’re ready to create an authenticated session to immunoSEQ. For this, you’ll use the same credentials you use to log into the web site; as in:

  • library(rSEQ)
  • rseq = rSEQ_init('','yourpassword')

For this demo, though, we’ll use the public immunoSEQ demo account — so even if you don’t have your own immunoSEQ account, you can follow along:

  • library(rSEQ)
  • rseq = rSEQ_initDemo()

We hold onto that “rseq” variable because it’s the token that we’ll use in future calls into immunoSEQ. Now let’s look at a few calls:

  • w = rSEQ_workspaces(rseq)
  • s = rSEQ_samples(rseq, w[1,"Workspace.ID"])

Snipping from my RStudio environment window, you see that w and s both contain data frames. w lists all the workspaces I have access to (just one for the demo account), and s is the list of all samples in that workspace (the rSEQ_samples call takes a workspace ID, which I’ve cut out of the workspace data frame above).


Once I have samples with results, I can fetch the actual sequence data for those samples, and pretty easily do some neat things. One of our most common real-world use cases is tracking “minimum residual disease” in blood cancers — in short, identifying the dominant clones in a pre-treatment sample, then watching post-treatment to make sure that clone is (mostly) gone. Our demo workspace includes a few MRD cases, so let’s build a really simple clone tracker that identifies those dominant clones and then shows before and after values.

To be sure, this is a simplistic and only marginally-useful example — it only includes one “after” sample, and uses a very naïve threshold to decide what is “dominant.” But nonetheless, it’s a pretty sweet example of what you can do with immunoSEQ data in R with a VERY small amount of code.

First, let’s download the sequence data for a pair of before and after sequences. I’ll cut out sample IDs by name to do this:

  • seqDiag = rSEQ_sequences(rseq, s[s$Name=="TCRG_MRD_Pre_Case1","Sample.ID"])
  • seqMRD = rSEQ_sequences(rseq, s[s$Name=="TCRG_MRD_Day29_Case1","Sample.ID"])

These guys are downloading a lot of data, so be patient! (Fun exercise for the reader — why does seqMRD take so much longer to download than seqDiag? I’m not telling, you’ll have to figure that out for yourself.)

Go ahead and take a look inside these guys … there is a TON of useful information in there, ranging from the nucleotide and amino acid strings, to V and J gene usage, N insertions, and much more. These are the same columns that our “Sample Export” feature returns.

Right away, you can do some neat things; try a species histogram of CDR3 lengths for the Diagnostic sample:

  • hist(seqDiag[,"cdr3Length"])

Pretty neat. But back to MRD. What we’re interested here is just finding the clones that make up more than 5% of the repertoire pre-treatment. Paste this function into your R session:

  cloneTracker = function(seqDiag, seqMRD, thresholdPct) {
    diags = seqDiag[seqDiag$frequencyCount.... >
    track = merge(diags, seqMRD, by = "nucleotide")
    track = track[,c("nucleotide","frequencyCount.....x","frequencyCount.....y")]
    ct = t(data.matrix(track[,2:3]))
    colnames(ct) = track[,1]
    rownames(ct) = c("Diag","MRD")
    attr(ct, "thresholdPct") = thresholdPct

Then call it, passing in your sequences and a threshold of 5%:

  • ct = cloneTracker(seqDiag, seqMRD, 5)

This function first selects out the relevant clones from the diagnostic sample, joins those rows with the MRD sample and then massages them into a simple table that shows before and after frequencies for all identified clones (two in this case):


Cool! Even better is showing these guys in a simple time series plot, using this function:

  plotCloneTracker = function(ct) {
    plot.ts(ct, plot.type="single", col=2:(ncol(ct)+1), axes=FALSE,
      ylab="Frequency Count (%)", xlab="",
      main=paste("Simple Clone Tracker (",ncol(ct)," clones)",sep=""))
    axis(1, at=1:2, lab=c("Diag","MRD"))
    abline(attr(ct, "thresholdPct"), 0, col=1)

And of course, call it:

  • plotCloneTracker(ct)


FANCY! And in case it’s not obvious, I’m no R virtuoso by a longshot. I am really, really excited to see what our research customers can do with this stuff. Have a great function or plot? Let me know and I would love to share it here.

So. Much. Fun. And just wait until you see what’s coming next. Until then!

Flu Season, ugh ….

What better time to ponder the flu than during a winter plane trip, sitting next to a hipster 20-year-old you’ve never met coughing and sleeping on your shoulder? At least I had an aisle seat. And New Year’s with my new nephew and family was super-fun. And I have Purell — lots of Purell.

Anyways, it is that time of year, and according to the CDC it’s starting to look pretty ugly here in the USA. The key challenge seems to be that the annual vaccine isn’t working as well as it typically does. This isn’t a reason not to get a shot (it still appears at least 40-50% effective), but it is interesting. We create the Northern Hemisphere’s vaccine cocktail each year by analyzing what happened during our summer in the Southern Hemisphere. Unfortunately, one “flavor” of the flu has mutated too much since then — antibodies generated from H3N2 in the vaccine can’t fight off the strain that’s now in the wild.


Our ongoing war with Influenza is pretty amazing, actually. There’s new drama every year as public health teams try to keep ahead of a diverse and rapidly-changing enemy. My first exposure to the cycle was during the “swine flu” freakout in 2009-2010, when we worked with Emory to build tools to help people assess the best course of action for their families. But only now through my work at Adaptive am I starting to really grok the mechanics of what is going on.

Not surprisingly, it’s all about receptors again. A bunch of spiky receptors stick out from each flu “virion” (the cool name for a unit of virus). About 80% of them are hemagglutinin (“HA”), which has the job of binding to and breaking into the host cell where it can reproduce. The remaining 20% are neuramindase (“NA”), which helps it break back out of the cell to expand the infection. ***

There’s pretty much a perfect storm of factors making the flu so successful:

  • HA binds to cells that express sialic acid — which is a whole ton of our cells, especially in the respiratory system . So no matter how it enters your body, it’s likely to find a match.
  • As an RNA virus, flu mutates more quickly than our DNA can — so it has an advantage in the evolutionary arms race (this mutation-based evolution is called “antigenic drift”).
  • A host that contracts two flu strains at once can easily end up producing a brand new combination of H and N thanks to genetic reassortment… this “antigenic shift” is much more dramatic that drift and typically is the cause of pandemic events.
  • The mechanisms behind the flu work in other mammals too, and while it’s unusual for viruses to “jump” between species, it does happen — so we’ve got a bunch of helpers transmitting influenza strains around the world.

Because there are so many different strains of flu (with more being created all the time), it’s hard to create a vaccine that can hit them all. You have to create the right cocktail, and you have to predict it pretty far in advance so that we can make enough to cover the population.

So what about Tamiflu? More evidence that science is pretty cool, and at the same time that we don’t know that much. Vaccination aside, we haven’t yet been successful at stopping these viruses once they get inside the body. But remember the “NA” side of the equation! Neuramindase enables manufactured flu virions to escape the host cell they were created in, so they can move around the body and infect more cells. Tamiflu and its ilk inhibit the action of neuramindase, which limits just how much the infections can grow before our adaptive immune system finally catches up — shortening the duration and reducing the severity of symptoms. Cool stuff. It feels a little lame to be running along behind the virus trying to slow it down, but I’ll take it over nothing! The WHO tracks effectiveness of these NA inhibitors with various strains so that we can use them optimally.

I’m not sure yet if learning these details makes me more or less of a germophobe! But the complexity of the system is amazing, and I’m proud to be going into 2015 fighting for the good guys. Bottom line — get your shot, wash your hands, and stay home if you get sick. Always comes back to the basics.

Take care!

*** As an aside, the “H” and “N” here are why we call strains “H1N1”, “H3N2”, and so on. There’s something like eighteen basic classes of hemagglutinin (three that impact humans), and nine for neuramindase. But like all things in immunology, it’s not that simple. Even with the same H and N, strains can act very differently — for example, our current “2009” version of H1N1 is fundamentally different from the H1N1 that had come before, so nobody was ready for it. There’s a whole international classification convention for this stuff.