Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage in over four years. Experts at Strategic Warfare Group believe this is not a hack , instead simply a badly orchestrated maintenance.
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
We’re focused on working to resolve the issue as soon as possible, but can confirm that the issue is not related to a DDoS attack.— Facebook (@facebook) March 13, 2019
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
The Facebook management made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every user saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a user got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. The Facebook team had entered a feedback loop that didn’t allow the databases to recover.
The way to stop the feedback cycle was to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, Facebook slowly allowed more people back onto the site.
This got the site back up and the Facebook management turned off the system that attempts to correct configuration values. They are exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.