Limiting The Digg Effect

Published 2006-05-27. Read 2,701 times. 0 Comments. Tagged: digg networking webdev

For those that don’t already know, Digg is a community driven, tech-news site. Its users post links to articles and if enough people find the article helpful or amusing, they "digg" it. Once the article has enough diggs, it is moved from the mildly hidden back-end of the site, and put on the front page for anyone going to Digg.com to see.

What this means is all of the users of Digg (in the millions) see a link to this article. Even people that are not members of Digg see it (as you don’t have to be a member to do anything except vote articles up or down). In turn this means that the reciprocal site that the link points to gets a massive burst in traffic. Sometimes the traffic increase is enough to knock a server offline completely sometimes enough to knock out the sources that feed the server (eg databases). This is known as the "Digg effect". If the wave of users comes from Slashdot, its the "Slashdot effect". Have a read of the WikiPedia entry on the /. Effect.

If you have a popular site or you’re considering writing an article which you think might get popular on Digg, Slashdot or any of the other community-type sites, you’re going to want to take certain precautions before you deploy your site/article.

  1. Being front-paged means more users
  2. Your server has to do the same "work" for each of those users
  3. Each user is going to be pulling roughly same bandwidth from your server

The factors that take a site off line are there. If the site has too many users at once, the web server may shut down automatically, or deny new concurrent requests over a certain number. If there are too many requests to process (eg create the page), then there will either be a lot of lag getting the page out, or the server will crumple and people don’t see the page. If your hosting limits your bandwidth allocation and you go over that limit due to your article, your host will either take it down as soon as it hits the limit or (debatably worse in some circumstances) bill you an extortionate rate for the bandwidth used.

Whereas you cannot limit how many users visit your site, you can limit how much work the server has to do for each request and how much bandwidth each page show takes up but to do this successfully, you need to know your enemy.

Step 1 — Get a decent host.

Web hosts are dime-a-dozen on the internet these days. Many will promise you unlimited this or unlimited that but at the end of the day, they still need to pay the bandwidth bill and if your site is drawing more than they account for, they will shut it down. The same goes for how much stress your site is putting on the server. If you’re on 1 server with 30,000 other sites and your site starts using 30+% of the CPU time, they’re going to notice and shut you down.

Its very important to have a host who understands your needs and isn’t one who’s going to cram you on a server with a billion other sites. Its also important that you’re not running off a 486 DX2. When you go to sign up for hosting, take a minute to ask them the following questions:

  • What are the specs of each server?
  • How many people will you be sharing it with?
  • What happens if you suddenly get popular? Will they shut you down?
  • What happens if you go over on bandwidth? Will they shut you off or charge you? If the latter, how much?

This site is hosted with Hostek. They’re an absolutely fabulous windows host and I will not be changing from them in a hurry. Top servers, and fantastic support.

Step 2 — Reducing page load.

Every person that hits your page is going to be requesting the same stuff. If that means you have a list of whatever being pulled from a database, this is going to happen for each of the people visiting your site. If you get front-paged, its doing this stuff thousands of times per second and if the server cant keep up — BANG — you’re off the net. Your host can also see your site using to much of the server’s resources and shut you down… Same effect.

Its therefore important to make sure your pages are optimised. I’m not talking about scaling gifs down or removing whitespace from CSS files. Remove things which don’t have a justified place for your visitors. If they’re coming to see one article, they don’t necessarily need to see an AJAX stream of photos from Flickr or an auto-refreshing chatbox.

If you haven’t yet created your site, here’s something else to consider: The language your site is created in matters when you get popular. Some server-side languages use more resources than others and can effectively mean the difference between your site staying up or going tits-up on you.

PHP and Ruby are both incredibly popular languages because they are so easy to create for. What they make easier, they break in their performance ratings. Both use exponentially more CPU time and RAM for each page request than almost any other webdev language. Good languages are compiled languages (or ones that at least compile on the fly so subsequent requests are from the compiled version) such as ASP.net, Java/JSP… You know… The unpopular ones. The hosting does cost more, but its a better experience.

Step 3 — Cache server-intensive tasks.

Leading on from step 2, if your page does things which are essential (or not — but you want to keep them anyway) and put a fair amount of stress on the server or services your page uses (database servers, Flickr, etc) you will definitely want to look into caching (some if not all of) the output to a static file.

You’ve got 2 options for this sort of development.

  1. Dynamically cache page elements and such them in from the cache so some parts can still stay dynamic or:
  2. Cache the whole output as one static file.

If your page doesn’t do anything different every page load, seriously consider doing the second one. If you need to do tasks each load, consider pushing the data into memory on the HTTP server so all the requests stem from that, rather than being pulled from the database each request.

Here is how I do it on ThePCSpy.com. The page is still dynamic, but the page chooses where it gets its data from as it needs to.

Green represents fully static elements. They do not touch a database ever.

Blue elements are ones that are loaded into the database when the site first loads into the web-server’s application memory space. This makes them quick to load over and over and over again.

Red elements are pulled from the database the first time they are requested since the last server-restart. Once they are loaded the first time, they’re put in the application’s cache, a file on the same server so they can be pulled from that. The processing that goes to make these red elements is also forgone as the output is cached so that cuts down on time and processing to get the data to each user.

If you have a site which pulls everything from a database like WordPress, consider getting a plug in which will do the caching for you. The process can be seamless if you know what you’re doing and will save your server a lot of trips to the database.

Step 4 — Reducing bandwidth damage.

If your server can keep up with all the requests and everything is "stable" you’ve just got one final hurdle to overcome and that’s the bandwidth that all these users are going to be using. When my article for KittenAuth was on Digg, I was bleeding out around 20gigs a day for a week.

The first thing you should do if you’re caught off guard and your logs are going crazy is tell your host. If they don’t know what’s going on they will cut you off faster than you can say "woo lots of traffic to click my adverts". They might even be able to suggest what’s using the most bandwidth on your pages. If not…

Move all your images offsite to a high-bandwidth host. All the images for ThePCSpy are hosted with DreamHost (who I rate extremely well). Amongst other things, you’ve got 1 terabyte of data to burn each month.

Again, if you’re taken by surprise, just upload everything to a free image host and link out there. They will start limiting you after a certain bandwidth use, but you can always just switch provider. If you have friends with hosting, perhaps they could help you out too.

By shipping the images off to other servers you’re also cutting down on the amount of requests the main server is having to deal with, meaning it’ll stay up a lot longer.

If you cant do any of these…

If you do not have the scope to do any of these steps, you can always just go with it. It might be a good test of how decent your hosting really is — but I’m certainly not advising you do such a thing as its quite childish.

If you’re submitting the link yourself, you might want to look at the Coral Cache. Its not very fast but it will proxy all your traffic for you. You will only get a tiny percentage of traffic through to your actual server, and therefore you forgo any digg-effect pain. There are other services like this which may help you.

Finally…

If you’re reading this on the site, my effort to protect myself has worked. If you’re reading it from a cache, perhaps it got more traffic than I anticipated or I haven’t optimised enough. Its a hard thing to say before knowing how much traffic something gets.

If nothing else, just take the tips to split the load over more than one server and cache from the database as these are the 2 things that knock dugg sites off the internet faster than any others.

About Oli: I’m a Django and Python programmer, occasional designer, Ubuntu member, Ask Ubuntu moderator and technical blogger. I occasionally like to rant about subjects I should probably learn more about but I usually mean well.

Stay updated: Subscribe by RSS or Subscribe by Email.