Courant News: Caching


11.07.09 Posted in Courant News by Max

The key to per­for­mance for high-traffic web­sites is caching. Face­book is famous for being a pro­lific mem­cached user, with 28 ter­abytes of mem­cached servers as of Decem­ber 2008. Part of why the Yale Daily News was able to sur­vive mas­sive traf­fic spikes dur­ing the Annie Le cov­er­age was our judi­cious use of caching. Read on to learn more about the caching strate­gies employed by the Courant News platform.

Full-page caching

The sim­plest caching strat­egy is to cache the out­put of entire pages and serve them directly to users before they hit your real code.

The rec­om­mended deploy­ment strat­egy for Courant News employs an nginx proxy server in front of Apache run­ning mod_wsgi. For­tu­nately, nginx has a mod­ule for mem­cached inte­gra­tion which can be used to return objects directly from mem­cached with­out hit­ting Apache (and all the rel­a­tively heavy Python code) at all.

Our nginx con­fig file might con­tain the following:

location /
{
  if ($request_method = POST) {
    proxy_pass http://backend;
    break;
  }

  if ($http_cookie ~* "sessionid=.{32}") {
    proxy_pass http://backend;
    break;
  }
  add_header Memcached True;
  default_type "text/html; charset=utf-8";
  set $memcached_key "/ydn-$uri";
  memcached_pass localhost:11211;
  error_page 404 502 = /django;
}

location /django {
  proxy_pass http://backend;
  include /etc/nginx/proxy.conf;
  break;
}

What this does is check mem­cached for a full­page cache if this is not a POST request (form sub­mis­sion) and the user is not logged in (has a ses­sion cookie). If it can’t find the entry in the cache, it just passes it along to Apache.

When the YDN got Drudge’d, all the users were anony­mous users, and nginx could serve them the cached pages in its sleep. The site remained respon­sive for the logged-in staff mem­bers who had to update con­tent and mod­er­ate comments.

Full-page caching is great, but most sites cus­tomize the site for vis­i­tors with site accounts.

Tem­plate Caching

If nginx doesn’t serve up a full-page cache, the request gets passed on to the actual code pow­er­ing the site. The final HTML is gen­er­ated by pro­cess­ing tem­plate files and then returned to the visitor’s browser.

Because there is often a fair amount of logic and data­base query­ing involved with pro­cess­ing the tem­plates, the server load can be high if every vis­i­tor had to regen­er­ate the final HTML every time. So we can cache frag­ments of our tem­plates that we know involve alot of work and avoid that overhead.

Django has a tem­plate tag to han­dle this, and it does a pretty good job. The tag is given an expi­ra­tion time, a name, and a set of extra vari­ables that the cache should vary on. Courant News’ caching sys­tem is based on Django’s, but includes two extra pieces of func­tion­al­ity: anti-dogpiling and auto­matic cache invalidation.

Anti-Dogpiling

Caching can help per­for­mance greatly, but the cache itself has to be cre­ated before it can be used. Gen­er­ally, one wants a cache to be cre­ated the first time it is viewed.

How­ever, page views are not instan­ta­neous; it may take sev­eral tens or hun­dreds of mil­lisec­onds to process a page and cre­ate a cached ver­sion of it. Dur­ing this time, many more vis­i­tors will hit the same page, not find a cached entry, and want to cre­ate the cache them­selves. All of a sud­den, you have dozens of peo­ple gen­er­at­ing the same cache, until even­tu­ally one of them fin­ishes and sub­se­quent vis­i­tors can use the cached version.

This can cause undue bur­den on the server and lead to per­for­mance trou­bles if a cache expires in the mid­dle of a large traf­fic spike. Often we are sim­ply inval­i­dat­ing an exist­ing cache instead of cre­at­ing a new one from scratch. In that case, we can do soft-expiration of the cache.

Nor­mally when a cache is inval­i­dated, it is sim­ply deleted alto­gether, and future requests for it will cause it to be regen­er­ated. Instead, we can have only the first per­son after a cache expi­ra­tion gen­er­ate a new ver­sion, while giv­ing sub­se­quent vis­i­tors the old cache until the new one is done being generated.

This is what we have imple­mented in Courant News, orig­i­nally based on this snip­pet. It essen­tially uses an addi­tional cache key to cre­ate a sec­ond, ear­lier, soft expi­ra­tion time, and uses some logic to deter­mine whether a given user should gen­er­ate the new cache or be served the stale copy.

Cache Inval­i­da­tion

Most sites don’t stay the same for­ever, so at some point it is nec­es­sary to clear the caches so that new con­tent or mod­i­fi­ca­tions can appear on the site.

There’s a famous quote by Phil Karl­ton that “[t]here are only two hard things in Com­puter Sci­ence: cache inval­i­da­tion and nam­ing things.”

Because there may be links to a given arti­cle on dozens of pages on your site, there is no easy way to know which of those pages need to be regen­er­ated when that article’s head­line changes or a new arti­cle should replace it.

One solu­tion is to sim­ply set a low cache dura­tion, so that the cache expires in a short amount of time (e.g., 30 min­utes or an hour). How­ever, what if you have break­ing news that needs to appear imme­di­ately? Or what if you have an embar­rass­ing typo in an arti­cle that you need to fix ASAP?

A really sim­ple answer is to have the abil­ity to flush the entire site cache. This indis­crim­i­nately deletes all entries in the cache, which can cause a sud­den spike in server usage as all your vis­i­tors cause the caches to be regenerated.

This is what we’ve been using at the YDN for the past few years, and this is work­able but a bit overkill.

Auto­mated Tem­plate Frag­ment Cache Invalidation

Ide­ally, we would pre­fer that the sys­tem mag­i­cally know what pages con­tained what con­tent objects, and then when­ever those objects changed would inval­i­date all of those caches. This is an extra­or­di­nar­ily dif­fi­cult prob­lem to solve for all cases, but we have attempted to solve the most com­mon cases, and leave full cache clear­ing as a fall­back for the unusual cases.

Our imple­men­ta­tion hinges on the con­cept of track­ing the “depen­den­cies” of each tem­plate frag­ment. Because it is impos­si­ble to algo­rith­mi­cally inspect a template’s out­put and deter­mine which con­tent objects were used to gen­er­ate it, we instead rely on the tem­plate author to pro­vide some explicit help.

If a tem­plate cache block depends on only a sin­gle object, tem­plate authors can use an extra ‘for’ clause on the end of the cache tag:

{% cache 3600 article_page article.id for article %}
...
{% endcache %}

This tells the caching sys­tem that, if the object called ‘arti­cle’ changes, it should inval­i­date this cache block.

How­ever, often a block will con­tain many objects, such as a sec­tion list­ing or most pop­u­lar box. In those cases, addi­tional ‘cache_dep’ tem­plate tags can be used inside the ‘cache’ tag to reg­is­ter these addi­tional objects.

{% cache 3600 section_age section.id for section %}
  <ul>
    {% for article in section.articles.all %}
      {% cache_dep article %}
      <li>{{ article.heading }}</li>
    {% endfor %}
  </ul>
{% endcache %}

In this case, if ‘sec­tion’ or any of its arti­cles get changed, this tem­plate frag­ment cache will be inval­i­dated and regenerated.

While this adds some extra clut­ter to tem­plates, we feel it is a rea­son­able trade-off if per­for­mance is of impor­tance to a site.

In any case, the caching sys­tem will lis­ten for post_save sig­nals from objects and then look them up in a spe­cial data­base table. If we find any tem­plate frag­ments depend­ing on this object, we inval­i­date them, and we inval­i­date the full-page cache for that fragment’s page.

We have also included an admin action that allow edi­tors to man­u­ally tell an object to expire its caches.

Con­clu­sion

If there is any­thing I’ve learned from my time at the YDN the past 2+ years, it’s been that caching is one of the most impor­tant aspects of design­ing a high traf­fic web­site. We’ve taken our years of expe­ri­ence opti­miz­ing caching for a news site and imple­mented a sys­tem for Courant News that will allow other sites using the plat­form take advan­tage of our expe­ri­ence and ensure their sites can sur­vive large traf­fic spikes with aplomb.

In the future I’ll write more about our rec­om­mended deploy­ment envi­ron­ment for Courant News and how that can influ­ence site per­for­mance. If you are curi­ous, you can find the Courant News caching code on our code site.



Leave a Reply