Clockwork websites live on a cluster of redundant, load-balanced servers. On any given day, this cluster handles hundreds of requests per second, or millions per day. This is boring; it’s what we do all day, every day. I want to talk about a different kind of day.
One of our clients is the Minnesota State Lottery. Every Wednesday and Saturday evening there is a Powerball drawing that drives a few thousand people to visit http://www.mnlottery.com/ to see if they’ll be going to work on Thursday morning. Every so often, however, the Powerball drawing spikes up to several hundred million dollars. After these unusually large drawings, the Minnesota State Lottery’s website will see hundreds of thousands of unique visitors, about 15% of which visit between the drawing time (10:00 pm CST) and midnight. During this two-hour window, the spike in traffic results in roughly 10 times our average requests per second entering our web cluster. Here’s what that looks like:
What we do to prepare
As most things around here do, it all starts with people. In this case, our Support and SysAdmin teams, who continually monitor prize amounts. When Powerball prizes reach the point at which we know extra traffic will result, the team begins addressing the several possible choke points to ensure things go smoothly — for the Lottery’s visitors, but also for all our other clients who may have similar spikes in traffic. Here’s what a day in the life of a huge Powerball drawing looks like:
The network connection to our Internet Service Provider (ISP) has a cap that on any normal, or even higher-than-normal day, is sufficient. But, again, this is a different kind of day. Knowing we’ll have excessive peak traffic, we make arrangements with the ISP to raise our limit for the 24 hours surrounding the drawing.
Shed Idle Sessions
“Sessions” are a firewall resource that keep track of everyone currently connected to our web sites. During high traffic events, we tweak the session management options so the firewall gets rid of idle connections more quickly. There is a slight risk that this could produce an error message for people on extremely slow connections, which is why we don’t run in this mode normally, but it results in our firewall being able to manage a much larger number of connections before its CPU maxes out.
Now we get to the final choke point, the ability of our web, file, and database servers to generate web content and deliver it back to the browser. To optimize this process, we use an administrator’s secret weapon: caching. For several years now we’ve been using varnish, a near-miraculous piece of software that can dramatically improve web performance.
Every request to our web servers passes through varnish first. Varnish checks to see if it has saved (or “cached”) a copy of that page and, if so, immediately sends the saved copy back to the browser without bothering our web servers. When a large Powerball drawing is coming up, we set up specific varnish rules that tell it to save everything that it possibly can, including the pieces we normally consider to not be cacheable.
Here is the previous graph that shows the number of unique visits hitting the varnish server stacked over a graph (from the exact same time frame) that shows traffic on our web servers.
If all of the requests that were hitting varnish made it through to our servers, they’d quickly run out of memory and CPU and would slow down dramatically. In addition to event-specific varnish rules, our software performs similar types of caching internally, providing even greater speed improvements.
What this means for you
Ideally, this is all invisible to our clients. Our goal is for these spikes in traffic to go entirely undetected by you, and your site’s visitors. Faster page loads mean less CPU and memory usage on the web server and, more importantly, increased capacity to handle all traffic — from regular usage to surges.
We have yet to discover the upper limit of traffic we can handle with these measures in place. We’re ready for the challenge.