Gmail Overload Controls And Service Availability at 99.9
I work in a company where we develop software for Wireless carriers and all our software is expected to provide service availability at 99.999% (Five Nines). That means just about 5 minutes outage for an entire year. And its really a big deal if that target is missed (and result in hefty penalties). And as consumers of a telephone service or a cell phone we expect that cell phone works all the time. You never know when we need it.
On the web though its a different story. For many services we are quite fine with outages now and then. Unless you are a business consumer and your business depends on having an internet portal up and running all the time ( like Dell.com or Amazon.com), an outage normally doesn’t affect anything. We regularly see ‘Maintenance or Scheduled Outage’ Notifications on most web applications and we thank them for the notice.
As per Gmail’s official blog, Gmail service availability is at 99.9%, which means about one and half a minute per day outage or about 45 minutes outage in a month or about 9 hours outage for an entire year.
Gmail web interface was down yesterday for about 100 minutes due to overload issues. And I am pretty sure, most of us are not even aware of that. But Google appeared to make a big deal about it and trying to make sure it won’t happen again.
Here is what happened:
This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.
While Google has fixed the issue by bringing in more servers up, one of their action plan to modify overload control intrigued me.
“if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load”
In most Telecom software applications the approach we take in times of overload is to continue all the calls that are already setup without any impact (like no degradation of voice quality) and refuse setting up any newer calls until overload is subdued. In most cases, you may hear an announcement like ‘network busy. try after some time’.
I am not sure during Gmail outage, each and every user on the web interface is affected or only a handful of users that trying to login during that time were affected. If each and every user is affected, I think Google must try to fix the problem in a way not to affect already logged in users and refuse all new logins. That way, only a handful will be affected.
It appears, Google want to affect everybody with delayed responses instead of refusing service to some. Isn’t it refusing some with a proper message is better than making everybody feel that Gmail is slower. For people like me, the perception of Slow response is much bigger deal than outage for a few minutes. I don’t want to spend 10 minutes to access and send an email. I am quite fine to do that after 10 minutes.