The mega Facebook outage won't be the last of its kind - here's why

Facebook outage
(Image credit: Shutterstock/Hadrian)

Facebook has said that a global outage that recently took its services and internal communications tools offline for several hours was due to a "faulty configuration change" to its routers. 

Although all affected apps are now back online, it has still left many wondering what happened, if the situation could have been avoided and whether a similar outage could happen again anytime soon.

The company revealed that a misconfiguration within its BGP routing design was allowed to propagate across its routing fabric internally (iBGP) and then externally (eBGP). 

Ronan David, VP Business Development and Marketing, EfficientIP, told TechRadar Pro that while global DNS servers were able to provide resolution to requests for Facebook domains, the public IPs provided in the DNS responses could not be used to route the ensuing external client traffic into Facebook systems.

This was exacerbated by the internal DNS architecture at Facebook impacted by the BGP misconfiguration.

BGP did it 

BGP (Border Gateway Protocol) is today's protocol for routing internet traffic, replacing legacy routing protocols such as RIP and OSPF for public internet infrastructure. 

BGP is responsible for selecting the best available routes to communicate data from a source to a specific destination.

“Due to Facebook's continued improvement in reducing their attack surface the issue was further compounded by an inability to access their internal management network (OOB - Out-of-Band), significantly delaying the time to resolve the issue due to not being able to access their own network and fix the configuration; a bit like forgetting your root or admin password and irreversibly losing access to your workstation, though at global internet scale,” added David.

Facebook's authoritative name servers are advertised to the rest of the internet via border gateway protocol (BGP). 

David explained that to ensure reliable operation, Facebook's DNS servers disable BGP advertisements if they themselves can not speak to their data centers.

“In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements,” he explained.

“The end result was that their DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find their servers.”

DNS is the internet’s equivalent to the list of contacts on a phone, which tells a browser what to do by translating a URL into a numbered IP address. The Domain Name System (DNS) is designed to provide translations, converting hostnames, or URLs, to IP addresses (via name resolution), 

Lavell Juan, CEO of vertically integrated social network company Brag House mentioned to TechRadar Pro that the foundation of any good social network is usability and a scalable infrastructure. 

“Ensuring the design and frontend development engages the user and makes all the functionality easily accessible is just as important as having an infrastructure that can grow with and support the user base. From there, it's all about finding the right languages and frameworks to best create your end product,” he said.

“The most common misconfigurations involve software and data servers. Software can be outdated or missing a key security patch, while servers could require an upgrade or be incorrectly sized. The best way to avoid these issues is through proper documentation and automating processes to reduce manual work.”

System outage in red on a computer keyboard

(Image credit: Shutterstock/hafakot)

Preventing further outages 

With the advent of cloud-scale network fabrics that lean extensively on automation both to enable scale and also to remove human error, there is still a human component to the overall process. 

David explained that the concept of 'guardrails' being used to ensure critical infrastructure decisions are controlled and validated before being deployed are absolutely vital to the stability and continuity of services at internet-scale. 

“Guardrails apply not only to the cloud service providers' management of infrastructure but also to the businesses that build upon these platforms,” he said.

“Owners of websites need to be careful about cloud vendor lock-in and design-in the ability to migrate their business assets and processes to other competing cloud platforms, which in turn puts pressure on these cloud service providers to provide the best possible service or lose their clients.”

Juan pointed the finger at human error causing the majority of interruptions and suggested that testing should be an integral part of the development process and should catch the vast majority of these misconfigurations before they get pushed to production.

While gatekeeper may be too strong a term, Amazon, Facebook, Apple, and Google have become custodians of access to some of the largest marketplaces today. As much as the fear of vendor lock-in is acknowledged there is also the fear of marketplace lock-out. 

“All of these businesses apply the tactics and economics of platform strategy - each provides technologies such as identity and authentication for their user base enabling their users to access apps within and across internet ecosystems; without access how will the businesses built upon those platforms reach their customers,” said David. 

“Companies have little choice but to ensure they are integrated into these ecosystems and are already, in many cases, entirely dependent on them for their own success. Multi-provider strategies are key though, within the oligopoly of Facebook, Apple, Amazon and Google, technology alone will not be a panacea for mitigation of these risks.”

This does not mean that there are no options available, albeit these options are more likely to be valid for larger businesses. The world has already seen the likes of Netflix and Dropbox migrate away from Amazon to run their own cloud infrastructures. 

The key takeaway, David says, is that much of the know-how and technology has been commoditized and is available to businesses - however, the availability of a highly-skilled workforce is still lacking to take advantage and benefit from this. 

“Investing in training and organic growth to ensure companies can compete at the same levels of technological maturity must be prioritized as a competitive strategy for tomorrow's businesses,” he concluded.

So in short, can big outages like that of Facebook and WhatsApp happen again? Yes - but because this outage was caused by underlying technological issues such as a bug or human error, minimizing these disruptions is something that can be achievable through regular testing throughout the development process.

TOPICS
Abigail Opiah
B2B Editor - Web hosting & Website builders

Abigail is a B2B Editor that specializes in web hosting and website builder news, features and reviews at TechRadar Pro. She has been a B2B journalist for more than five years covering a wide range of topics in the technology sector from colocation and cloud to data centers and telecommunications. As a B2B web hosting and website builder editor, Abigail also writes how-to guides and deals for the sector, keeping up to date with the latest trends in the hosting industry. Abigail is also extremely keen on commissioning contributed content from experts in the web hosting and website builder field.

Read more
Twitter social media application change logo to X. Elon Musk CEO of twitter rebranded Twitter to 'X'. Social media application technology concept.
X is back – here's what we know about the 'massive cyberattack' that caused Twitter to go down multiple times
A person standing in front of a rack of servers inside a data center
Is your business primed to respond to downtime?
Eu
Is your business ready for DORA? Cisco ThousandEyes outlines the "three pillars" everyone needs to have in place to be resilient
Slack
Slack is back and running smoothly, so get back to work everyone!
A digital representation of a lock
The true threat of business downtime
Internet outage
Nearly all companies expect a major outage in 2025
Latest in Pro
US flags
US government IT contracts set to be centralized in new Trump order
Google Gemini AI
Gmail is adding a new Gemini AI tool to help smarten up your work emails
Insecure network with several red platforms connected through glowing data lines and a black hat hacker symbol
Coinbase targeted after recent Github attacks
hacker.jpeg
Key trusted Microsoft platform exploited to enable malware, experts warn
IBM office logo
IBM to provide platform for flagship cyber skills programme for girls
Teams
Microsoft Teams is finally adding a tiny but crucial feature I honestly can't believe it never had
Latest in Features
Hume AI
What is Hume: Bring emotional understanding to AI-generated voices
Beautiful.ai
What is Beautiful.ai: Create modern presentations in as little time as possible
A still of Kirsten Dunst in a wedding dress in a pond from the movie Melancholia
4 great free movies with over 80% on Rotten Tomatoes worth streaming on Tubi, Pluto TV, Plex and more this week (March 24)
The Claude, ChatGPT, Google Gemini and Perplexity logos, clockwise from top left
The ultimate AI search face-off - I pitted Claude's new search tool against ChatGPT Search, Perplexity, and Gemini, the results might surprise you
The home screen on an iPhone 16e smartphone
I think the iPhone 16e is too expensive – and as it turns out, so does nearly everybody else
Helly R and Mark S look shocked in Severance season 2
5 questions Severance season 3 needs to answer when the Apple TV+ hit returns