Facebook’s Gut Punch

Facebook’s Gut Punch

Looking at the internal impact of a total network failure

In case you hadn’t heard, Facebook went down. And of course, when someone says, “Facebook went down,” it covers much more than facebook.com. Instagram, Oculus, and WhatsApp, all part of the global Facebook system, tend to suffer the same fate as their parent company when problems arise, especially when they involve core services.


But while everyone is talking about the service itself being down to customers and consumers, we at Open Security thought it might be interesting to discuss what is going on inside a business like Facebook during an outage like the one they experienced yesterday. News sources have speculated that Facebook is losing an estimated $160 million per hour from its various revenue-generating services. Still, a total outage across one of the largest enterprises in the world will bring with it a loss far greater in uncounted costs and business risk that may be even more interesting than lost money when one takes a closer look.
Many may not realize that Facebook relies heavily on its own projects in order to run business operations. Between internal chat applications, a special version of Facebook itself for employee use only, and other services, the company needs itself in order to function. Of course, this is not out of the ordinary for a company the size of Facebook, but it does present an incredible amount of risk if any single point of failure exists.

During the outage yesterday, several critical services failed at the same time. According to Facebook employees that the Open Security team spoke to across several campuses, failures occurred everywhere, from Security Operations Centers to identity verification systems. Of course, any one of these failures would cause problems when introduced into a typical working environment. Still, when several (or all) services fail simultaneously, it is obviously a recipe for disaster.

The following sections will dive into these failures from a security perspective and explain how several domains of information security were affected by this event.

Internal Communications and Tools

When internal applications fail in the way they did during the outage, work grinds to a halt for the day-to-day operations of a company like Facebook. But what happens when every policy, procedure, notification system, and communication method depend on the same critical assets? Unfortunately, the answer is nothing, and for a long time.

Sources within the Facebook campuses spoke about “gizmos” and “wayfinders” going down, which means vendors, guests, contractors, or other visitors had no way to sign in or figure out where they were going. In a major outage, a company like Facebook may depend on specialists who don’t typically work in the buildings they need to get to in order to fix the problem. In this particular case, these specialists could not sign in, find the right place to go, ask relevant managers over chat, or even get on the WiFi to use a different application for communication. If they did get to the right place, there was no way to verify they were who they claimed to be or even give them access to the tools they needed to do their jobs.

The business risks from these cascading failures can’t be overstated. Failures that generate other failures make it difficult to do root cause analysis on the issue at hand and impact the abilities of the people and processes charged with the responsibility of responding to and fixing problems. In this case, people couldn’t communicate, reference disaster recovery plans, or even access the areas they needed in order to fix the outage. A vast majority of the issues having to do with fixing the network were due to their recovery plan’s reliance on the same systems experiencing downtime.

Physical Security

For Facebook employees, one of the most noticeable failures during the outage was that of the security door and badging system. This system is what lets employees through the doors to the buildings and rooms that they work in on a daily basis. For some, this meant a day working from home (not that they could do that either). For a vast majority, who could not be contacted due to the outage, working during the outage meant driving to their workplace either to sit in front of their blank computer screen or be turned away when their identity could not be verified.

Some employees, however, needed to be in the building and had plenty of work to do.

Security guards that protect everything from the front entrance to the main server rooms depend on systems such as badging and door locks to assist them with their daily jobs. But when these systems fail, physical security becomes a much more difficult task – especially when specialists and incident response teams need privileged access at the fastest possible speed.

Employees reported a roughshod application of “improv security” whereby people were positioned at doors that seemed important to their managers at the time with very little information or instruction on what to do when they got there. In cases where special teams were deployed to server rooms or other critical areas, security teams were in some cases told to let them in regardless of standard security procedures.

Information Security

Ultimately, Facebook is an information company. It is widely known that their business model is to soak up data from its users to sell it for advertising or other uses. Therefore, the most important thing for its continued ability to operate is the ability to gather and then protect as much information as possible. An outage like the one described here affects the gathering and protection phase, but protection is the most important from a security perspective.

When every single networked system fails, this means that security appliances and programs also fail. Of course, when the network is down, one may think that an attack is also highly unlikely. After all, how could an attacker operate on a network that isn’t working? And while that may be true, in a chaotic environment like Facebook during a network outage, there are other ways to steal, modify, or delete data. And when cameras, security doors, and all standard procedures are broken or suspended, anyone can do just about anything.

The Aftermath

Consider that physical security protocols were abandoned in order to allow specialists into server rooms without any real verification that they were who they claimed to be. Consider that every type of software meant to scan, secure, or monitor critical systems was also out of commission. Consider that even in a perfect environment, it is still possible for a well-equipped attacker to accomplish at least some sort of nefarious goal when it comes to an internet-connected network – never mind an insider with access they typically use but might not often abuse. Is Facebook considering these potential risks in the aftermath of the outage? If they did, what would they be able to do about it at this point? More importantly, what can everyone else learn from this?

The biggest lesson to be learned from an event like the one experienced at Facebook and friends is that anyone can become a victim to a major network failure for almost any reason. Whether it’s a direct and targeted attack or an intern pushing the wrong configuration file to the production server, the outcome is the same, and the plan to fix the issue will be tested.

At the end of the day, cyber security operations are meant to enable the business to do its job efficiently and safely regardless of the threat. So have a plan for every piece of your network failing individually and in tandem with others regardless of how that failure comes to pass. Invest in the resilience of your network because your entire business depends on it.