The key to performance for high-traffic websites is caching. Facebook is famous for being a prolific memcached user, with 28 terabytes of memcached servers as of December 2008. Part of why the Yale Daily News was able to survive massive traffic spikes during the Annie Le coverage was our judicious use of caching. Read on to learn more about the caching strategies employed by the Courant News platform.
Full-page caching
The simplest caching strategy is to cache the output of entire pages and serve them directly to users before they hit your real code.
The recommended deployment strategy for Courant News employs an nginx proxy server in front of Apache running mod_wsgi. Fortunately, nginx has a module for memcached integration which can be used to return objects directly from memcached without hitting Apache (and all the relatively heavy Python code) at all.
Our nginx config file might contain the following:
location / { if ($request_method = POST) { proxy_pass http://backend; break; } if ($http_cookie ~* "sessionid=.{32}") { proxy_pass http://backend; break; } add_header Memcached True; default_type "text/html; charset=utf-8"; set $memcached_key "/ydn-$uri"; memcached_pass localhost:11211; error_page 404 502 = /django; } location /django { proxy_pass http://backend; include /etc/nginx/proxy.conf; break; }
What this does is check memcached for a fullpage cache if this is not a POST request (form submission) and the user is not logged in (has a session cookie). If it can’t find the entry in the cache, it just passes it along to Apache.
When the YDN got Drudge’d, all the users were anonymous users, and nginx could serve them the cached pages in its sleep. The site remained responsive for the logged-in staff members who had to update content and moderate comments.
Full-page caching is great, but most sites customize the site for visitors with site accounts.
Template Caching
If nginx doesn’t serve up a full-page cache, the request gets passed on to the actual code powering the site. The final HTML is generated by processing template files and then returned to the visitor’s browser.
Because there is often a fair amount of logic and database querying involved with processing the templates, the server load can be high if every visitor had to regenerate the final HTML every time. So we can cache fragments of our templates that we know involve alot of work and avoid that overhead.
Django has a template tag to handle this, and it does a pretty good job. The tag is given an expiration time, a name, and a set of extra variables that the cache should vary on. Courant News’ caching system is based on Django’s, but includes two extra pieces of functionality: anti-dogpiling and automatic cache invalidation.
Anti-Dogpiling
Caching can help performance greatly, but the cache itself has to be created before it can be used. Generally, one wants a cache to be created the first time it is viewed.
However, page views are not instantaneous; it may take several tens or hundreds of milliseconds to process a page and create a cached version of it. During this time, many more visitors will hit the same page, not find a cached entry, and want to create the cache themselves. All of a sudden, you have dozens of people generating the same cache, until eventually one of them finishes and subsequent visitors can use the cached version.
This can cause undue burden on the server and lead to performance troubles if a cache expires in the middle of a large traffic spike. Often we are simply invalidating an existing cache instead of creating a new one from scratch. In that case, we can do soft-expiration of the cache.
Normally when a cache is invalidated, it is simply deleted altogether, and future requests for it will cause it to be regenerated. Instead, we can have only the first person after a cache expiration generate a new version, while giving subsequent visitors the old cache until the new one is done being generated.
This is what we have implemented in Courant News, originally based on this snippet. It essentially uses an additional cache key to create a second, earlier, soft expiration time, and uses some logic to determine whether a given user should generate the new cache or be served the stale copy.
Cache Invalidation
Most sites don’t stay the same forever, so at some point it is necessary to clear the caches so that new content or modifications can appear on the site.
There’s a famous quote by Phil Karlton that “[t]here are only two hard things in Computer Science: cache invalidation and naming things.”
Because there may be links to a given article on dozens of pages on your site, there is no easy way to know which of those pages need to be regenerated when that article’s headline changes or a new article should replace it.
One solution is to simply set a low cache duration, so that the cache expires in a short amount of time (e.g., 30 minutes or an hour). However, what if you have breaking news that needs to appear immediately? Or what if you have an embarrassing typo in an article that you need to fix ASAP?
A really simple answer is to have the ability to flush the entire site cache. This indiscriminately deletes all entries in the cache, which can cause a sudden spike in server usage as all your visitors cause the caches to be regenerated.
This is what we’ve been using at the YDN for the past few years, and this is workable but a bit overkill.
Automated Template Fragment Cache Invalidation
Ideally, we would prefer that the system magically know what pages contained what content objects, and then whenever those objects changed would invalidate all of those caches. This is an extraordinarily difficult problem to solve for all cases, but we have attempted to solve the most common cases, and leave full cache clearing as a fallback for the unusual cases.
Our implementation hinges on the concept of tracking the “dependencies” of each template fragment. Because it is impossible to algorithmically inspect a template’s output and determine which content objects were used to generate it, we instead rely on the template author to provide some explicit help.
If a template cache block depends on only a single object, template authors can use an extra ‘for’ clause on the end of the cache tag:
{% cache 3600 article_page article.id for article %} ... {% endcache %}
This tells the caching system that, if the object called ‘article’ changes, it should invalidate this cache block.
However, often a block will contain many objects, such as a section listing or most popular box. In those cases, additional ‘cache_dep’ template tags can be used inside the ‘cache’ tag to register these additional objects.
{% cache 3600 section_age section.id for section %} <ul> {% for article in section.articles.all %} {% cache_dep article %} <li>{{ article.heading }}</li> {% endfor %} </ul> {% endcache %}
In this case, if ‘section’ or any of its articles get changed, this template fragment cache will be invalidated and regenerated.
While this adds some extra clutter to templates, we feel it is a reasonable trade-off if performance is of importance to a site.
In any case, the caching system will listen for post_save signals from objects and then look them up in a special database table. If we find any template fragments depending on this object, we invalidate them, and we invalidate the full-page cache for that fragment’s page.
We have also included an admin action that allow editors to manually tell an object to expire its caches.
Conclusion
If there is anything I’ve learned from my time at the YDN the past 2+ years, it’s been that caching is one of the most important aspects of designing a high traffic website. We’ve taken our years of experience optimizing caching for a news site and implemented a system for Courant News that will allow other sites using the platform take advantage of our experience and ensure their sites can survive large traffic spikes with aplomb.
In the future I’ll write more about our recommended deployment environment for Courant News and how that can influence site performance. If you are curious, you can find the Courant News caching code on our code site.