R + immunoSEQ = Awesome!

We are really proud of the immunoSEQ Analyzer — it’s a great tool with a ton of charts and visualizations specifically tareted at immunosequencing. And since the field is new to many folks, it’s great to have an environment where we can share the approaches and techniques we’ve developed over the years. If you’re an Adaptive customer and you’re not using it — you’re missing out.

But at the same time, there are a zillion analysis and visualization tools out there, and each of them has their own strengths. Our goal is to make sure that immunoSEQ data shines in all of those environments, because better tools == faster discoveries == better treatments for real people.

I’m really excited today to show off our latest such integration. “rSEQ” is an installable package that makes it super-easy to work with immunoSEQ data with R. You can read all the gory details in this technote (free Analyzer login required), and I’ll walk through a simple example here so you can get an idea of just how awesome it is.

You’ll need the popular devtools package in order to install rSEQ. The technote provides some details on this, but in its simplest form all you need to do is:

  • packages("devtools")
  • library(devtools)
  • install_url("https://clients.adaptivebiotech.com/assets/misc/rSEQ.zip")


Now you’re ready to create an authenticated session to immunoSEQ. For this, you’ll use the same credentials you use to log into the web site; as in:

  • library(rSEQ)
  • rseq = rSEQ_init('you@example.com','yourpassword')


For this demo, though, we’ll use the public immunoSEQ demo account — so even if you don’t have your own immunoSEQ account, you can follow along:

  • library(rSEQ)
  • rseq = rSEQ_initDemo()


We hold onto that “rseq” variable because it’s the token that we’ll use in future calls into immunoSEQ. Now let’s look at a few calls:

  • w = rSEQ_workspaces(rseq)
  • s = rSEQ_samples(rseq, w[1,"Workspace.ID"])


Snipping from my RStudio environment window, you see that w and s both contain data frames. w lists all the workspaces I have access to (just one for the demo account), and s is the list of all samples in that workspace (the rSEQ_samples call takes a workspace ID, which I’ve cut out of the workspace data frame above).

w_s

Once I have samples with results, I can fetch the actual sequence data for those samples, and pretty easily do some neat things. One of our most common real-world use cases is tracking “minimum residual disease” in blood cancers — in short, identifying the dominant clones in a pre-treatment sample, then watching post-treatment to make sure that clone is (mostly) gone. Our demo workspace includes a few MRD cases, so let’s build a really simple clone tracker that identifies those dominant clones and then shows before and after values.

To be sure, this is a simplistic and only marginally-useful example — it only includes one “after” sample, and uses a very naïve threshold to decide what is “dominant.” But nonetheless, it’s a pretty sweet example of what you can do with immunoSEQ data in R with a VERY small amount of code.

First, let’s download the sequence data for a pair of before and after sequences. I’ll cut out sample IDs by name to do this:

  • seqDiag = rSEQ_sequences(rseq, s[s$Name=="TCRG_MRD_Pre_Case1","Sample.ID"])
  • seqMRD = rSEQ_sequences(rseq, s[s$Name=="TCRG_MRD_Day29_Case1","Sample.ID"])


These guys are downloading a lot of data, so be patient! (Fun exercise for the reader — why does seqMRD take so much longer to download than seqDiag? I’m not telling, you’ll have to figure that out for yourself.)

Go ahead and take a look inside these guys … there is a TON of useful information in there, ranging from the nucleotide and amino acid strings, to V and J gene usage, N insertions, and much more. These are the same columns that our “Sample Export” feature returns.

Right away, you can do some neat things; try a species histogram of CDR3 lengths for the Diagnostic sample:

  • hist(seqDiag[,"cdr3Length"])


Pretty neat. But back to MRD. What we’re interested here is just finding the clones that make up more than 5% of the repertoire pre-treatment. Paste this function into your R session:

  cloneTracker = function(seqDiag, seqMRD, thresholdPct) {
    diags = seqDiag[seqDiag$frequencyCount.... >
      thresholdPct,c("nucleotide","frequencyCount....")]
    track = merge(diags, seqMRD, by = "nucleotide")
    track = track[,c("nucleotide","frequencyCount.....x","frequencyCount.....y")]
    ct = t(data.matrix(track[,2:3]))
    colnames(ct) = track[,1]
    rownames(ct) = c("Diag","MRD")
    attr(ct, "thresholdPct") = thresholdPct
    return(ct)
  }

Then call it, passing in your sequences and a threshold of 5%:

  • ct = cloneTracker(seqDiag, seqMRD, 5)


This function first selects out the relevant clones from the diagnostic sample, joins those rows with the MRD sample and then massages them into a simple table that shows before and after frequencies for all identified clones (two in this case):

preposttable

Cool! Even better is showing these guys in a simple time series plot, using this function:


  plotCloneTracker = function(ct) {
    plot.ts(ct, plot.type="single", col=2:(ncol(ct)+1), axes=FALSE,
      ylab="Frequency Count (%)", xlab="",
      main=paste("Simple Clone Tracker (",ncol(ct)," clones)",sep=""))
    box()
    axis(2)
    axis(1, at=1:2, lab=c("Diag","MRD"))
    abline(attr(ct, "thresholdPct"), 0, col=1)
  }

And of course, call it:

  • plotCloneTracker(ct)


ctplot

FANCY! And in case it’s not obvious, I’m no R virtuoso by a longshot. I am really, really excited to see what our research customers can do with this stuff. Have a great function or plot? Let me know and I would love to share it here.

So. Much. Fun. And just wait until you see what’s coming next. Until then!

Flu Season, ugh ….

What better time to ponder the flu than during a winter plane trip, sitting next to a hipster 20-year-old you’ve never met coughing and sleeping on your shoulder? At least I had an aisle seat. And New Year’s with my new nephew and family was super-fun. And I have Purell — lots of Purell.

Anyways, it is that time of year, and according to the CDC it’s starting to look pretty ugly here in the USA. The key challenge seems to be that the annual vaccine isn’t working as well as it typically does. This isn’t a reason not to get a shot (it still appears at least 40-50% effective), but it is interesting. We create the Northern Hemisphere’s vaccine cocktail each year by analyzing what happened during our summer in the Southern Hemisphere. Unfortunately, one “flavor” of the flu has mutated too much since then — antibodies generated from H3N2 in the vaccine can’t fight off the strain that’s now in the wild.

fludata

Our ongoing war with Influenza is pretty amazing, actually. There’s new drama every year as public health teams try to keep ahead of a diverse and rapidly-changing enemy. My first exposure to the cycle was during the “swine flu” freakout in 2009-2010, when we worked with Emory to build tools to help people assess the best course of action for their families. But only now through my work at Adaptive am I starting to really grok the mechanics of what is going on.

Not surprisingly, it’s all about receptors again. A bunch of spiky receptors stick out from each flu “virion” (the cool name for a unit of virus). About 80% of them are hemagglutinin (“HA”), which has the job of binding to and breaking into the host cell where it can reproduce. The remaining 20% are neuramindase (“NA”), which helps it break back out of the cell to expand the infection. ***

There’s pretty much a perfect storm of factors making the flu so successful:

  • HA binds to cells that express sialic acid — which is a whole ton of our cells, especially in the respiratory system . So no matter how it enters your body, it’s likely to find a match.
  • As an RNA virus, flu mutates more quickly than our DNA can — so it has an advantage in the evolutionary arms race (this mutation-based evolution is called “antigenic drift”).
  • A host that contracts two flu strains at once can easily end up producing a brand new combination of H and N thanks to genetic reassortment… this “antigenic shift” is much more dramatic that drift and typically is the cause of pandemic events.
  • The mechanisms behind the flu work in other mammals too, and while it’s unusual for viruses to “jump” between species, it does happen — so we’ve got a bunch of helpers transmitting influenza strains around the world.

Because there are so many different strains of flu (with more being created all the time), it’s hard to create a vaccine that can hit them all. You have to create the right cocktail, and you have to predict it pretty far in advance so that we can make enough to cover the population.

So what about Tamiflu? More evidence that science is pretty cool, and at the same time that we don’t know that much. Vaccination aside, we haven’t yet been successful at stopping these viruses once they get inside the body. But remember the “NA” side of the equation! Neuramindase enables manufactured flu virions to escape the host cell they were created in, so they can move around the body and infect more cells. Tamiflu and its ilk inhibit the action of neuramindase, which limits just how much the infections can grow before our adaptive immune system finally catches up — shortening the duration and reducing the severity of symptoms. Cool stuff. It feels a little lame to be running along behind the virus trying to slow it down, but I’ll take it over nothing! The WHO tracks effectiveness of these NA inhibitors with various strains so that we can use them optimally.

I’m not sure yet if learning these details makes me more or less of a germophobe! But the complexity of the system is amazing, and I’m proud to be going into 2015 fighting for the good guys. Bottom line — get your shot, wash your hands, and stay home if you get sick. Always comes back to the basics.

Take care!

*** As an aside, the “H” and “N” here are why we call strains “H1N1”, “H3N2”, and so on. There’s something like eighteen basic classes of hemagglutinin (three that impact humans), and nine for neuramindase. But like all things in immunology, it’s not that simple. Even with the same H and N, strains can act very differently — for example, our current “2009” version of H1N1 is fundamentally different from the H1N1 that had come before, so nobody was ready for it. There’s a whole international classification convention for this stuff.

The Amazing that is PD-1 Blockade

The office is clearly closing up shop for the holiday; this’ll be my last task before heading out myself for a few days. I actually really like these quiet days at work; I can often pick up a few of the long-running side projects that are so hard to squeeze in when everybody’s here. This week it’s been some integration work with Illumina’s BaseSpace service, which is honesty pretty sweet.

cover_nature_V515_Number7528_0End of the year holiday time is also always good for a bit of reflection, and I’ve been appreciating the opportunity I’ve had to start working here with the folks at Adaptive. In particular, a paper was just published by a customer that just has me shaking my head at how incredible this space is.

PD-1 Blockade drugs like Nivolumab are the current face of immunology — these incredible therapies seem to be showing more success against cancers than anything we’ve tried in years. Rather than just trying to knock out tumor cells with radiation or toxic chemicals, PD-1 blockades unleash our own immune systems to do the job. Here’s the deal:

Our T- and B-Cells express a particular protein called “programmed cell death 1.” Like other receptors, PD-1 is anchored in the cell kind of like a blade of grass sits in the ground, with part of it inside the cell wall, and the rest dangling outside.

PD-1’s job is to slow down our immune response. It waits for special proteins (PD Ligands) to float by and bind to its receptor end, which can result in one of two behaviors. Normal immune cells just commit suicide; regulatory T-Cells do the opposite and get busy. The net effect is that when PD-1 is activated, your immune system starts to go quiet.

This is normally a good thing — for example, PD-1 keeps our immune system from attacking itself. But many types of cancer have “figured out” the game; they artificially accelerate the production of PD ligands themselves! This is amazing to think about — the cancers have actually evolved to suppress our own immune system so that we can’t fight back.

Once you see how this works, PD-1 blockade therapies seem pretty obvious — create a drug that stops PD-1 from binding with its ligands, and the immune system is freed up to go nuts. And, holy crap, this actually works!

But it’s also expensive, and it doesn’t always work. That’s where Adaptive comes in: we’ve now shown that the clonality* of a person’s immune repertoire can predict their response to PD-1 blockade. This makes sense if you think about it — all the drug is doing is opening the gates; there has to be an immune response ready to fight in the first place.

Well, it turns out that we basically invented the technology that can measure immune system clonality (amongst other things) using next-generation sequencing. Anybody else see value in a quick lab test that predicts the effectiveness of a miracle drug costing hundreds of thousands of dollars?

Me too. WOOOOOO HOOO!

* When your immune system is gearing up to fight a particular bad guy, it creates millions of copies or “clones” of the specific sequences that recognize just that antigen. This is quite different when you’re healthy — your immune system then is much more diverse, and has smaller amounts of lots of different sequences, all hunting for invaders to show up.

Context is king!

Our marketing materials often quote the fact that our database contains billions of unique T- and B-Cell sequences. Seriously, that is an insane number. But in isolation, it can also be misleading. After all, your body pretty much creates sequences at random in its attempt to find ones that work — and a billion rolls of a die don’t tell you very much.

The magic happens when you add context to these sequences — tracking diagnoses and outcomes, demographics, medication use, other markers, really anything that can help complete the whole picture. This stuff combined with our sequence data is what’s (really) changing the world.

In fact, one of the toughest things about good science is picking the right attributes — traditionally they’re expensive and complex to track, and lots of them don’t even matter. This is the reason we try to help our customers with Analyzer features like projects and tagging. And over the next few years, we’ll continue to add more and more features that make it easier and quicker to figure out exactly what matters most in immunology.

Zooming out, though — what if it wasn’t so hard to collect and track this context? What if we could just track “everything” about our subjects and then use statistics to automatically figure out which attributes matter. Yes, this is the promise of big data we all love to talk about, but there are many more barriers to making it real than just more and bigger computers.

Companies like Patients Like Me and 23andMe have taken a really novel approach to this challenge. What if we enlist the subjects themselves to contribute data over time, from lots of different sources? What if researchers could re-contact those subjects to ask them new questions along the way? And what if the subjects gave consent for lots of different folks to use their information in flexible and informal ways, freeing up at least a little work from the slow-moving IRB process?

The tradeoff is a fascinating one — are we better off using small bits of highly reliable and curated data, or using tons and tons that we know is noisy? Well, actually it’s not a tradeoff at all. So long as you know what you’re working with, both can be incredibly productive ways to increase our understanding of the world.

We think about this a lot, and are making real investments to help us all better understand the adaptive immune system. Thanks for joining us on this really, really fun ride.

The definition of insanity…. works!

Not everything we do here is about fancy biology; sometimes it’s about fancy web engineering. Late last week was a good example — my favorite bug since starting at Adaptive. Fair warning, this post ranks pretty high on the geek scale.

Nothing hurts my stomach more than knowing my systems are misbehaving in some way I can’t explain. I just don’t get how folks can sit by and just ignore this; it’s way too much of a threat to my ego. Screw you, Skynet — I tell YOU what to do!

Anyways, here’s the setup. Quite frequently — not enough to reproduce it in a debugger, but often enough that we were getting a steady stream of user complaints — our web servers were sending garbled responses. This manifested in a bunch of different ways. Sometimes the browser would just render a bunch of un-interpreted HTML. Other times it would screw up AJAX logic and just make the pages act wonky. It wasn’t clear at first that these were all the same things — it just felt like the site was on fire, and we had no obvious leads to work from. But we had just propped new code before this started happening, so of course that was the obvious target.

If you want to get good at debugging, especially in distributed systems, here is the #1 thing you have to remember: KEEP LOOKING. Our local hardware store has one of those big signs they put pithy statements on, and one of their favorites is “The definition of insanity is doing the same thing and expecting a different result.” At least inasmuch as it applies to debugging, this is crap.

Again and again, it’s been made clear to me that good debuggers are the ones that keep looking at the data over, and over, and over, until the patterns finally pop out. Most people peter out and say “that’s impossible” or “there’s nothing to see here” … and that is simply WRONG. The pattern is always hiding in there somewhere, and if you keep looking you will find it.

In this case, I looked at the same logs dozens of times, and followed a bunch of dead ends, before the pattern finally peeked out. Not exactly at the same time, but really close to it, we were always seeing “HEAD” requests to the server right around the calls that would fail. I ignored these for hours because they shouldn’t have made any difference. But…..

OK, here’s where things get super-nerdy. Starting way at the beginning … your web browser talks to web servers using something called “HTTP” or Hypertext Transfer Protocol. In a nutshell, the first version of HTTP worked like this:

  1. The browser opens up a connection to the server computer. This is like dialing a phone and having the server answer.
  2. The browser sends a message over the connection that says “I’d like your homepage, please.”
  3. The server sends the HTML code that represents the site’s homepage and then hangs up the connection.

This worked great, except that step #1 was kind of slow — typically a browser will need to request not just one but many different pages and resources from the server, so “redialing” over and over was wasteful. So the protocol was updated with something called “keep-alive”, in which case the connection is kept open and used for multiple requests.

But this presented a small problem. The only way the browser knew the page was “done” was by noticing that the server had hung up the connection. If that connection stays open, how does the client figure this out? Very simply — in this new version, the server tells the browser how much data it’s going to send:

  1. The browser opens up a connection to the server computer.
  2. The browser asks for page #1.
  3. The servers says “ok, this page is 4,000 bytes long. Here you go.” And then sends the data.
  4. The browser reads out those 4,000 bytes and then using the same connection asks for page #2.
  5. The server says “ok, this one is 2,000 bytes long. Here you go.” And so on.

This is way more efficient. OK, so file that one away for a moment.

Another feature of HTTP is that the browser can ask for data in a few ways. The most common is “GET”, which just asks the server to send the data for the page, thank you very much. But sometimes the browser doesn’t need the actually data for a page, it just needs to see if it’s still there and check if it’s changed since the last time it looked. For this, it can make a “HEAD” request. The HEAD request works like this:

  1. The browser opens up a connection to the server computer, like normal.
  2. The browser makes a “HEAD” request for page #1.
  3. The server says “ok, this page is 4,000 bytes long, and it last changed on 12/1/2014.” But it doesn’t send the actual data … just general information like the size of the page.

These two concepts — “keep-alive” and “HEAD vs. GET” — were the key to this bug.

Last setup: our app is built on an open-source technology called the “Play Framework.” Play helps us match up browser requests to code, makes it easier to build pages, blah blah … not very important here. But what *is* important is that we don’t expose the Play application directly to browsers. We use a common technique called “proxying” that isolates the system a bit from the Internet. We do this with another open-source tool called the Apache web server. So our setup looks like this:

  1. Browser makes an HTTP request to Apache.
  2. Apache “relays” this request to Play.
  3. Play responds to Apache.
  4. Apache sends the response back to the browser.

Definition of Insanity

The key here is that those connections between Apache and Play just use plain old HTTP. And they use keep-alives, so that many different browser requests can “reuse” the same proxy connection between Apache and Play.

Back to those HEAD requests. When a browser makes one, Apache dutifully relays it to Play. And FINALLY, here is the bug: Play was answering “ok, this page is 4,000 bytes long, and it last changed on 12/1/2014.” BUT IT WAS ALSO SENDING THE PAGE DATA, even though this was a HEAD request. This is a violation of the HTTP protocol! So after Apache read off the first part, it just stopped reading, which left all the other stuff waiting, unread, in the connection buffer.

But remember, because of keep-alive, that connection is still open. So the NEXT time that a browser asks for a page, Apache again dutifully relays it to Play over that connection, and then tries to read the response. But because it never read out the contents from the first request, all it sees is what now looks like a bunch of garbage!

From here on out things can go a bunch of different ways, depending on the specific garbage that is sent back. But it doesn’t really matter, the damage is done. Until that connection gets reset, every browser request that uses it ends up being wonked up.

And guess what? This bug has been sitting in our code since the site launched, long before I even started working at Adaptive. But it was never really exposed, because HEAD requests are generally pretty rare. As it turns out, our operations team had (ironically) just turned on a new monitoring tool that, quite legitimately, used HEAD as one of its ways to see if the site was working properly. So the bug had nothing to do with that code prop. It was classic Heisenberg.

DAMN, SON. That was a long way to go for a stupid little bug.

But there was a point, and it’s worth saying again: KEEP LOOKING. Look at the logs, again. Try running the same request, again. Look at source, again. Look at network traces, again. Look at the code, again. It is the only way to break some of these logjams. Eventually, you will pick out the pattern.

If you’re good at this — I will hire you in a millisecond. You’re gold.

We made a thing!

kitboxI’ve had a lot of jobs, but they’ve always been about building software or services — virtual stuff. Drugstore.com was fun because we shipped actual things, and visiting the warehouse was like a super-cool playground of awesome machines (pick-to-light was my absolute fav). But I’ve never actually been a part of making real, physical things for sale — until last week!

Check out our announcement of the immunoSEQ (TM!) hsTCRB kit. “hsTCRB” is apparently obvious secret code for “human t-cell receptors, beta chain,” i.e., the first of many versions of the assay that we’ll be selling in this form for research use.

This is cool because it basically explodes the volume of tests we can do. Traditionally, we’ve run a service business — folks send us physical samples (blood, tissue, etc.) and our lab deals with everything from there — DNA extraction and concentration, both amplification steps and sequencing. Only then does my team jump in, run the data through our processing pipeline and deliver results and visualizations through the immunoSEQ Analyzer.

Running a lab is a big deal — it takes equipment, sequencers, reagents, and perhaps most of all lots of people. I love our lab team and we’ll need them and more of them forever, but we simply wouldn’t be able to scale the physical processes fast and cost-effectively enough achieve our goals with an exclusively service-based business. Beyond that, lots of institutions just want to do their own chemistry, for reasons ranging from economics to privacy and environmental control.

The kit (mostly) frees us from these physical limitations and lets us scale up digitally, something I know pretty well. All those steps before the data pipeline can be done by our customers, then they use a little utility tool to send the raw sequencer output my way and we’re off to the races. Especially as we transition our pipeline up into the cloud (“on the line,” Steve!) … this gives us near infinite ability to accommodate new customers. Pretty sweet!

(You know what the limiting factor becomes? It’s the dirty secret of big data — bandwidth. Our first kit works exclusively with the Illumina MiSeq, which is a cool but mini-sequencer that generates about 1-4GB of data per run. Internally we mostly use HiSeqs, which generate 250GB of base call data alone. This stuff takes a long time to move! So much so that even on our internal networks we consider data transfer time when we’re forecasting how much we can process. Crazy.)

Anyways. Some fun issues:

  • Hey, the label printed about half a centimeter askew, and now you can’t read the unique number that is the KEY TO THE WHOLE PROCESS.
  • Wait, our customers aren’t all in Seattle. Does relative primer efficiency change at different ambient temperatures? Humidity? Altitude? Better do some tests ….. a LOT of tests.
  • Well, this is an interesting race between customs processing time and dry ice melt time…..
  • Wow, these screenshots we printed into the manual LAST YEAR don’t look right anymore…
  • I’m not actually sure if our supplier has a shelf big enough for this stuff.
  • Expiration date? Hmm. More tests.

And now, finally, our brand new sales team can get out there and sell the heck out of this thing. I’m thinking stocking stuffers for your whole family. Interested? Hit me up!

Big data for Thanksgiving

Got up late, went for a run, helped get things ready for Thanksgiving, ate a TON, and watched a bit of football, all with the entire Nolan West clan in attendance*. Not a bad way to spend my favorite holiday!

Before I doze off again, I wanted to share a quick story from yesterday at work that illustrates just how awesome Adaptive is, and why I am thankful to be part of it. “Big data” at Adaptive isn’t just empty marketing; it’s a tool we use to help real people in incredibly concrete ways.

One of my goals for early in 2015 is to help scale up clonoSEQ, our diagnostic test that helps detect relapse in blood cancer patients. As part of planning this, I checked in with our Director of Translational Medicine to understand the work he does to develop the narrative “interpretations” that accompany the numeric results of these tests.

As with many types of tests, we use thresholds to understand what is meaningful vs. not. In a very rough way, if a particular clone makes up more than 5% of a patient’s immune repertoire, we believe it’s a diagnostic marker for their cancer. When numbers are way higher or lower than this, life is easy. But what about when they’re bouncing right around the cutoff?

We know that there are some sequences that show up in many different people at relatively higher concentrations, just because they’re “easier” for the body to make (less mutation, etc.). Further, repertoires can fluctuate for lots of common reasons that we never even notice. So when we see these borderline sequences, they very well could be signal OR noise.

We use lots of tools to make these distinctions. But one that is incredibly cool is — effectively in real time, we can search the billions of sequences we have seen in the real world to understand if a borderline sequence has been seen in other patients. If it has, odds are high that the sequence is unrelated to their cancer.

Think about that for a second. We’re creating both the technology and the data sets to determine not just what ONE immune system looks like, but thousands and millions. Armed with this information we can start to detect population-level patterns and understand mechanisms that nobody has ever had the remotest chance of seeing before. And that means better diagnostics and treatments in the real world.

So this year, I’m thankful to be a part of a new company that is helping real people in amazing ways.

And of course, for my wicked awesome new nephew Asher!

Hope you all have a great holiday.

* Hmm, now that Ben, Kelly and Asher are in Boulder, does that mean “Nolan West” has new citizens? Still strictly on the wrong side of the Divide, but it’s pretty close. Will have to think on this.