Skip to main content

Gmail Overload Controls And Service Availability at 99.9

I work in a company where we develop software for Wireless carriers and all our software is expected to provide service availability at 99.999% (Five Nines). That means just about 5  minutes outage for an entire year. And its really a big deal if that target is missed (and result in hefty penalties). And as consumers of a telephone service or a cell phone we expect that cell phone works all the time. You never know when we need it.

On the web though its a different story. For many services we are quite fine with outages now and then. Unless you are a business consumer and your business depends on having an internet portal up and running all the time ( like or, an outage normally doesn’t affect anything. We regularly see ‘Maintenance or Scheduled Outage’ Notifications on most web applications and we thank them for the notice.

As per Gmail’s official blog, Gmail service availability is at 99.9%, which means about one and half a minute per day outage or about 45 minutes outage in a month or about 9 hours outage for an entire year.

Gmail web interface was down yesterday for about 100 minutes due to overload issues. And I am pretty sure, most of us are not even aware of that. But Google appeared to make a big deal about it and  trying to make sure it won’t happen again.

Here is what happened:

This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.

However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.

While Google has fixed the issue by bringing in more servers up, one of their action plan to modify overload control intrigued me.

“if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load”

In most Telecom software applications the approach we take in times of overload is to continue all the calls that are already setup without any impact (like no degradation of voice quality) and refuse setting up any newer calls until overload is subdued. In most cases, you may hear an announcement like ‘network busy. try after some time’.

I am not sure during Gmail outage, each and every user on the web interface is affected or only a handful of users that trying to login during that time were affected. If each and every user is affected, I think Google must try to fix the problem in a way not to affect already logged in users and refuse all new logins. That way, only a handful will be affected.

It appears, Google want to affect everybody with delayed responses instead of refusing service to some. Isn’t it refusing some with a proper message is better than making everybody feel that Gmail is slower. For people like me, the perception of Slow response is much bigger deal than outage for a few minutes. I don’t want to spend 10 minutes to access and send an email. I am quite fine to do that after 10 minutes.

Popular posts from this blog

You Are What You Think People Think About You

There are about 6.7 Billion people in this world that we know of.  Whether you believe in ‘Creation’ or ‘Evolution’, this human race started with a tiny number. It is quite amazing to see how fast it multiplies. What is more amazing is that every single individual in that 6 billion crowd is born ‘unique’.  Quite literally, you are born to be one in a billion, whether you believe it or not. “ This was the Introduction to my latest and last speech in Toast Masters club, ‘One in a Billion’ as part of International Speech contest. 
As much as I believe that each one of us can be that 'one in a billion' personality, I admit the reality as I perceive it and some times feel alone in that belief.
A famous quote says 'You are what you think'. It is also true that 'you are what you think people think about you'. If you think people think you are smart, then you act smart and become smart. If you think people think you are dumb, you will become dumb even if you are not, a…

Cooking looks like an unforgiving art

When you are writing software, you always get a second chance. In fact, lots of chances to get it correct. You have compiler warnings, failed test cases and some times crashes alert you that something is not right and will give you a chance to correct. And you get literally unlimited chances to apply those corrections. 
Well, cooking looks to be totally unforgiving in this respect and on any given day, you may get just one chance to get it right. If you fail, you fail. Try again right away if you have patience of starting it all over. Or start over some time later or next day. But not much of a second chance to correct a mistake. 
More ruthless, when it comes to salt. If you put just a little more, even a tiny little more, it never hesitate to show what it got. Totally ruthless. End result will be a failed dish that no one will be able (and/or happy) to eat. And most dishes, you may not be able to add something little more to offset it.

Little trick I learned the hard way, start on …

Did NDTV Just Twisted Words?

I have recently spotted quite a few places where NDTV title doesn’t exactly say the same as the details in the article says. Lost in translation? or just plain twisting for journalistic sensationalism?Title says “'AAP doesn't treat women as humans,' says founder member Madhu Bhaduri as she quits”, but the quote in details says, slightly differently: “In this party, women are not considered humans” (see the text highlighted).Source : NDTV.comYou may say, they effectually mean the same thing. Is it? Even if they mean the same,  Why not use the same exact phrase in both places?