How Microsoft is stamping out Azure failures and improving reliability

Ideally, the system would also roll back updates that cause problems automatically; this already happens on Bing, but that's not always possible. "We have some metrics that allow us to tell if a build is doing well or not and to go backwards if we need to. But that's not always something we can do. For example if we've made an update to a database schema, we can't roll it back because you'd lose data. If we've put out a feature that allows a customer to create something, we can't go back because they'd lose what they've built. So we have to do what we call roll forward – fix the bug and get it out."

Many Azure failures happen without anyone ever noticing, he points out. "The system is designed to auto recover from failure as much as possible. For example, with a service like AzureDB or Azure Storage there are machine failures all the time in the clusters because we're running at such large scale. We lose 2-3% of our servers every year but it's completely transparent to anybody running on those services because of the way they're designed. We monitor it to see if there's a problem causing those failures, but a server failing is invisible to customers."

Lessons in Azure

One lesson that Microsoft has learned from the most recent Azure outages is that it needs a system outside Azure where customers can check on the status of the service. "The dashboard is a higher level piece of the platform and it depends on parts of the system above Azure Storage and the core services. It needs to be higher level to be public facing, but it totally makes sense that even if Azure is down that we need to communicate with customers. So we've come up with a system to fail that dashboard over outside Azure if it runs into problems because it's depending on another part of Azure."

But Microsoft isn't going to try that with any other Azure features, he says. "It's just a communication interface – it's not a service. We can't fail over an Azure service out of Azure, because we'd need another Azure to run it on!"

TOPICS
Contributor

Mary (Twitter, Google+, website) started her career at Future Publishing, saw the AOL meltdown first hand the first time around when she ran the AOL UK computing channel, and she's been a freelance tech writer for over a decade. She's used every version of Windows and Office released, and every smartphone too, but she's still looking for the perfect tablet. Yes, she really does have USB earrings.

Latest in Pro
Epson EcoTank ET-4850 next to a TechRadar badge that reads Big Savings
I searched for the best printer deal you won't find in the Amazon Spring Sale
Microsoft Copiot Studio deep reasoning and agent flows
Microsoft reveals OpenAI-powered Copilot AI agents to bosot your work research and data analysis
Group of people meeting
Inflexible work policies are pushing tech workers to quit
Data leak
Top home hardware firm data leak could see millions of customers affected
Representational image depecting cybersecurity protection
Third-party security issues could be the biggest threat facing your business
An image of network security icons for a network encircling a digital blue earth.
Why multi-CDNs are going to shake up 2025
Latest in News
An image of Pro-Ject's Flatten it closed and opened
Pro-Ject’s new vinyl flattener will fix any warped LPs you inadvertently buy on Record Store Day
EA Sports F1 25 promotional image featuring drivers Oscar Piastri, Carlos Sainz and Oliver Bearman.
F1 25 has been officially announced, with this year's entry marking a return for Braking Point and a 'significant overhaul' for My Team mode
Garmin clippd integration
Garmin's golf watches just got a big software integration upgrade to help you improve your game
Robert Downey Jr reveals himself as Doctor Doom to a delighted crowd at San Diego Comic-Con 2024
Marvel is currently revealing the full cast for Avengers: Doomsday, and I think it's going to be a long-winded announcement
Samsung QN90F on yellow background
Samsung announces US prices for its 2025 mini-LED TV lineup, and it’s good and bad news
Nintendo Switch Lite
Forget the Nintendo Switch 2, the original Switch is getting one last hurrah in a surprise Nintendo Direct tomorrow