Why was GOOGLE down ?
Google was down for about 45 min on Dec 14 2020.
Google was down for 45 min on Monday 14 Dec 2020.The Outage was Worldwide, But what went wrong..
It was not because of some Software developer missed Semi-colon and that bad code was deployed to server. NO that’s not what happened. There is a funny story at the end.
The issue was not with YouTube Or Gmail Or Google Drive , Not all Google services were down . Users were not able to use those services which require user authentication. But some were able to use those services in incognito mode and other services which does not require user authentication.
What was the Problem then.
The issue was with the Google’s Central Identity Management System(CIMS).
What is CIMS
Google’s CIMS manages the authentication of the users who want access their services, think of it as your security guard in your building who recognizes you and let’s you go inside whichever building you want, But if he fails to recognize you then he won’t allow you inside .
Same happened with the Google’s CIMS. When users wanted to visit any of the services CIMS was not able to authenticate and the users were not able to access the services and thus users were getting errors.
Before accessing any of the services the users have to pass through the CIMS barrier .
In System Design we call these as Single Points Of Failure(SPOF).
Single Points Of Failure
SPOF means it’s a point where the entire system can crashes if that one single point fails.
But why did such an important system went down ?
It was because of Automated Quota Management System (AQMS), which reduced the capacity of the CIMS which eventually made CIMS go down. why was the capacity of AQMS reduced as of now there is no answer from google about that.
According to Google :
The root cause was an issue in our automated quota management system which reduced capacity for Google’s central identity management system, causing it to return errors globally. As a result, we couldn’t verify that user requests were authenticated and served errors to our users.
Many were able to access through incognito mode, How was that possible ?
It was because while using incognito mode you don’t need to login and hence there is no need for authentication, So there was no part for CIMS here and users were able to access all the google services. We will go back to security guard story, When the guard is drunk and he fails to recognizes you and you plan to go from backside by jumping the wall and going inside the building.
This is not the first time Google went down, it became a big outage because it was globally. This happens in Big-Tech Companies like AWS went down couple of days back and Microsoft AZURE went down.
These are software's which are running on machine written by humans and Humans have tendency to make mistakes, these are bound to have errors
Fun Story :
There was a incident were a guy had connected all his bulbs, refrigerator and all other smart devices to Google Home, When outage happened he was not able to control his light bulbs or anything because Google Home requires Google Assistant which helps to control all the devices when you say something because google was down the Assistant was not able to communicate with the servers and it was getting error. So this guy now cannot control his bulb, smart devices because google home went down.
Lesson :
So we should not depend our life so much on these applications unless and until we are okay with these outages.
So next time when website goes down don’t blame software engineers for missing semi-colon ;