Know your latency: a simple hack using Graphite and Memcache

“You know,” said Arthur, “it’s at times like this, when I’m trapped in a Vogon airlock with a man from Betelgeuse, and about to die of asphyxiation in deep space that I really wish I’d listened to what my mother told me when I was young.”

“Why, what did she tell you?”

“I don’t know, I didn’t listen.”

- The Hitchhiker’s Guide to the Galaxy

Application profiling is good

Total page latency measure in milliseconds shows time from initial HTTP GET to window.onLoad event. Mobile browsers suffer the worst slowdowns due to slower processors and network latency.

At Thumbtack, we have a healthy obsession with tracking. We track everything. Hardware usage, network throughput, database queries, logins, signups, template rendering times, email send volume, and much more. During every deploy we closely monitor a wide range of metrics to ensure that the numbers are looking good across the board. When you’re working on a large web application, it’s impossible to have complete test coverage, and we rely on these metrics to help us quickly catch and diagnose problems.

To make all this possible, we first had to build a monitoring system. For this, we took inspiration from Etsy’s statsd tool. We built a similar tracking system called Tycho, written in Python and fully distributable across all our servers in various datacenters, EC2, etc. Like Etsy, we added a front-end to our tracking system that is based on Graphite. Ours is called Observatory, and is a simple wrapper around Graphite that serves a collection of dashboards.

Profiling the whole application, Javascript included

Now, the interesting part. At Thumbtack, we’ve long tracked our server response times. But what we really wanted to know was how long it took a user to go from an initial request to having a fully rendered page, domReady and all necessary Javascript loaded up from the CDN.

Our answer is to combine Memcached and some jQuery callbacks to track total application responsiveness.

There are three simple steps.

Store a unique key + timestamp in Memcache when the server starts responding to a new GET request.
When the page is fully rendered, issue an AJAX request containing the same unique key back to the server.
Look up the initial timestamp based on the key, and track the total time from GET to page load.

Here’s some pseudo-code.

start_time = now()
request_id = unique_id()
template = new Template('homepage.html')
template.set('request_id', request_id)
http.respond(template.render())
end_time = now()

// now, persist the data so we can grab it later
// assumes a reasonable TTL - 5 minutes
memcache = new Memcache()
memcache.save(request_id, {
     page_type: 'homepage',
     start: start_time,
})

// do any traditional server-side tracking
tracking = new Tracking()
tracking.track('homepage', 'server_time', end_time - start_time)

Now, in the homepage.html template, we’ll trigger an AJAX call once the page is loaded. Note that you probably don’t need to do this on domReady (though tracking that would certainly be interesting), but on a later event that indicates the page is fully loaded and the user can start interacting with it.

<script>
$(window).load(function() { // replace with something more appropriate to your application
    $.post('/responsiveness', {request_id: request_id});
});
</script>

The /responsiveness endpoint looks up the timing information we saved previously, then tracks the new timing information:

memcache = new Memcache()
tracking = new Tracking()
data = memcache.fetch(request_id)
tracking.track(data.page_type, 'total_time_to_window_onload', now() - data.start_time)

And there you go: a simple hack to track total responsiveness for your application.

Why Memcache? We chose Memcache instead of another data storage option for several reasons. We needed temporary persistence with a short TTL (a few minutes). Transient failures are acceptable since responsiveness tracking is purely best-effort. The amount of data we need to store is small: a key (a unique ID for each page), and a value (a timestamp). A relational database or NoSQL storage would both be overkill for this feature, and don’t support temporary storage as easily or as efficiently. Memcache was the much better solution for all these reasons.

Conclusion

What have we learned at Thumbtack? We’re learning to tune our Javascript to better perform in different browsers. We’ve learned that mobile browsers are slower at executing Javascript than we’d realized, and we’re working on solutions for that. Sometimes we find bugs.

Tracking response times proves that caching is king

We’ve learned the landscape of page performance across the site, and have identified places with particularly weak performance. For example, the “welcome” page for new service providers is one of the slowest performers, even though it is one of our “simpler” pages.

As Thumbtack’s Javascript codebase grows, tracking JS responsiveness helps us understand when new client-side code disrupts the user experience. For example, we recently updated some of our maps to use the Google Static Maps API rather than the Javascript API; we tested this and discovered a substantial speed improvement.

This approach has some nice perks:

Integrates server and client side processing times to give a more accurate picture of the full user experience.
Easily expandable to track a variety of additional metrics especially on the client side.
Lightweight, minimal interference with existing code on both the server and the client.

Interested in working on interesting problems like client-side Javascript performance? Thumbtack is hiring product engineers to help build amazing user interfaces.