Not everything we do here is about fancy biology; sometimes it’s about fancy web engineering. Late last week was a good example — my favorite bug since starting at Adaptive. Fair warning, this post ranks pretty high on the geek scale.
Nothing hurts my stomach more than knowing my systems are misbehaving in some way I can’t explain. I just don’t get how folks can sit by and just ignore this; it’s way too much of a threat to my ego. Screw you, Skynet — I tell YOU what to do!
Anyways, here’s the setup. Quite frequently — not enough to reproduce it in a debugger, but often enough that we were getting a steady stream of user complaints — our web servers were sending garbled responses. This manifested in a bunch of different ways. Sometimes the browser would just render a bunch of un-interpreted HTML. Other times it would screw up AJAX logic and just make the pages act wonky. It wasn’t clear at first that these were all the same things — it just felt like the site was on fire, and we had no obvious leads to work from. But we had just propped new code before this started happening, so of course that was the obvious target.
If you want to get good at debugging, especially in distributed systems, here is the #1 thing you have to remember: KEEP LOOKING. Our local hardware store has one of those big signs they put pithy statements on, and one of their favorites is “The definition of insanity is doing the same thing and expecting a different result.” At least inasmuch as it applies to debugging, this is crap.
Again and again, it’s been made clear to me that good debuggers are the ones that keep looking at the data over, and over, and over, until the patterns finally pop out. Most people peter out and say “that’s impossible” or “there’s nothing to see here” … and that is simply WRONG. The pattern is always hiding in there somewhere, and if you keep looking you will find it.
In this case, I looked at the same logs dozens of times, and followed a bunch of dead ends, before the pattern finally peeked out. Not exactly at the same time, but really close to it, we were always seeing “HEAD” requests to the server right around the calls that would fail. I ignored these for hours because they shouldn’t have made any difference. But…..
OK, here’s where things get super-nerdy. Starting way at the beginning … your web browser talks to web servers using something called “HTTP” or Hypertext Transfer Protocol. In a nutshell, the first version of HTTP worked like this:
- The browser opens up a connection to the server computer. This is like dialing a phone and having the server answer.
- The browser sends a message over the connection that says “I’d like your homepage, please.”
- The server sends the HTML code that represents the site’s homepage and then hangs up the connection.
This worked great, except that step #1 was kind of slow — typically a browser will need to request not just one but many different pages and resources from the server, so “redialing” over and over was wasteful. So the protocol was updated with something called “keep-alive”, in which case the connection is kept open and used for multiple requests.
But this presented a small problem. The only way the browser knew the page was “done” was by noticing that the server had hung up the connection. If that connection stays open, how does the client figure this out? Very simply — in this new version, the server tells the browser how much data it’s going to send:
- The browser opens up a connection to the server computer.
- The browser asks for page #1.
- The servers says “ok, this page is 4,000 bytes long. Here you go.” And then sends the data.
- The browser reads out those 4,000 bytes and then using the same connection asks for page #2.
- The server says “ok, this one is 2,000 bytes long. Here you go.” And so on.
This is way more efficient. OK, so file that one away for a moment.
Another feature of HTTP is that the browser can ask for data in a few ways. The most common is “GET”, which just asks the server to send the data for the page, thank you very much. But sometimes the browser doesn’t need the actually data for a page, it just needs to see if it’s still there and check if it’s changed since the last time it looked. For this, it can make a “HEAD” request. The HEAD request works like this:
- The browser opens up a connection to the server computer, like normal.
- The browser makes a “HEAD” request for page #1.
- The server says “ok, this page is 4,000 bytes long, and it last changed on 12/1/2014.” But it doesn’t send the actual data … just general information like the size of the page.
These two concepts — “keep-alive” and “HEAD vs. GET” — were the key to this bug.
Last setup: our app is built on an open-source technology called the “Play Framework.” Play helps us match up browser requests to code, makes it easier to build pages, blah blah … not very important here. But what *is* important is that we don’t expose the Play application directly to browsers. We use a common technique called “proxying” that isolates the system a bit from the Internet. We do this with another open-source tool called the Apache web server. So our setup looks like this:
- Browser makes an HTTP request to Apache.
- Apache “relays” this request to Play.
- Play responds to Apache.
- Apache sends the response back to the browser.
The key here is that those connections between Apache and Play just use plain old HTTP. And they use keep-alives, so that many different browser requests can “reuse” the same proxy connection between Apache and Play.
Back to those HEAD requests. When a browser makes one, Apache dutifully relays it to Play. And FINALLY, here is the bug: Play was answering “ok, this page is 4,000 bytes long, and it last changed on 12/1/2014.” BUT IT WAS ALSO SENDING THE PAGE DATA, even though this was a HEAD request. This is a violation of the HTTP protocol! So after Apache read off the first part, it just stopped reading, which left all the other stuff waiting, unread, in the connection buffer.
But remember, because of keep-alive, that connection is still open. So the NEXT time that a browser asks for a page, Apache again dutifully relays it to Play over that connection, and then tries to read the response. But because it never read out the contents from the first request, all it sees is what now looks like a bunch of garbage!
From here on out things can go a bunch of different ways, depending on the specific garbage that is sent back. But it doesn’t really matter, the damage is done. Until that connection gets reset, every browser request that uses it ends up being wonked up.
And guess what? This bug has been sitting in our code since the site launched, long before I even started working at Adaptive. But it was never really exposed, because HEAD requests are generally pretty rare. As it turns out, our operations team had (ironically) just turned on a new monitoring tool that, quite legitimately, used HEAD as one of its ways to see if the site was working properly. So the bug had nothing to do with that code prop. It was classic Heisenberg.
DAMN, SON. That was a long way to go for a stupid little bug.
But there was a point, and it’s worth saying again: KEEP LOOKING. Look at the logs, again. Try running the same request, again. Look at source, again. Look at network traces, again. Look at the code, again. It is the only way to break some of these logjams. Eventually, you will pick out the pattern.
If you’re good at this — I will hire you in a millisecond. You’re gold.