The true cost of outages and why monitoring AI dependencies is crucial

An image of network security icons for a network encircling a digital blue earth.
(Image credit: Shutterstock) (Image credit: Shutterstock)

In the digital age, where businesses and consumers alike thrive on seamless connectivity and uninterrupted service, recent major outages have sounded the alarm. From ChatGPT’s blackouts to other tech giants grappling with unanticipated downtime, the financial repercussions of these disruptions can be staggering and extend beyond just monetary loss. According to Dun & Bradstreet, 59% of Fortune 500 companies endure a minimum of 1.6 hours of downtime each week, averaging a weekly cost ranging from $643,200 to $1,056,000. 

Businesses have also seen their reputations take a hit due to these costly moments. Beyond the immediate losses lies a new concern—how can businesses effectively shield themselves against the steep impact of future outages? Downtime, the period when systems are either inaccessible or not functioning optimally, severely disrupts user access to online services, halts employee productivity, and/or prevents customer engagement with an organization.

As the internet is an intricate web of interconnected systems, networks and applications, these disruptions can quickly escalate, significantly damaging an organization's reputation. The statistics paint a grim picture. Forrester’s 2023 Opportunity Snapshot found that:

1/ 37% estimated their companies lost between $100,000-$499,000, and 39% lost $500,000-$999,999 due to internet disruptions. 

2/ Disruptions also damage companies internally by increasing employee churn (55%) and reducing workforce productivity (49%). 

3/ Without adequate visibility, companies are experiencing 76 disruptions per month on average. 

4/ 75% of respondents said IPM would have a significant or large positive impact on their business.

The U.S. AI market has an estimated value between $87.18 billion and $167.3 billion, and its growth is causing the digital landscape to evolve at breakneck speed. The growing dependence on AI-driven applications is shining a spotlight on the need for proactive monitoring against downtime. The February 14th ChatGPT outage impacted both ChatGPT service and its customers who were running GPT-based chatbots through an API. Monitoring AI dependencies will be critical to all businesses, from startups to enterprises.

Mehdi Daoudi

Co-founder and CEO of Catchpoint.

Case in point

In December 2023, Adobe's extensive customer base was impacted by a series of outages in the Adobe Experience Cloud lasting 18 hours. While AI hasn't been added into their platform just yet, many businesses are starting to rely on the technology more, and this outage serves as an example of what could happen once it is more deeply embedded. Indeed, the Adobe Experience Cloud outage overall highlights the vulnerabilities inherent in relying on third-party services within digital infrastructure. This disruption, caused by a failure in Adobe's cloud infrastructure, resulted in significant service disruptions, affecting critical functions across multiple platforms. 

According to Adobe, Data Collection (Segment Publishing), Data Processing (Cross-Device Analytics, Analytics Data Processing), and Reporting Applications (Analysis Workspace, Legacy Report Builder, Data Connectors, Data Feeds, Data Warehouse, Web Services API) were all affected by the outage. During the outage, users experienced delays and sluggish performance across the various Adobe services. Postmortem investigations revealed that the root cause of the disruption stemmed from issues within Adobe's cloud infrastructure, leading to latency spikes and prolonged loading times for users.

The failure within Adobe's infrastructure had far-reaching consequences, impacting businesses and users dependent on Adobe services for day-to-day operations. On top of that, Adobe risked incurring Service Level Agreement (SLA) violations for millions of customers. An SLA sets a definite timeframe in which tickets must be answered or chats and calls picked up. If they are not answered or picked up within the specified timeframe, an SLA Violation occurs. Payouts often follow. Customer loyalty may also be tested.

The Adobe outage was more than a disruption—it served as a wake-up call for businesses using their services to reevaluate their wider approach to digital resilience. The scale of the outage, impacting so many Adobe services, serves as a valuable reminder of the need for business to always make contingency plans and take proactive measures to safeguard against future disruptions.

So, how can businesses better navigate the risks and create a robust path to Internet resilience? This fundamentally requires a massive shift—one that prioritizes real-time visibility into app performance, enabling the identification of potential bottlenecks or other pain points before they snowball into full-blown crises. By monitoring AI (or other) dependencies with laser-like precision, organizations can preemptively address vulnerabilities, fortify their digital infrastructure, and mitigate the fallout of unforeseen disruptions.

Guarding against downtime in the age of AI

It’s undeniable that in today's fiercely competitive landscape, even the briefest interruption of service poses a major risk to consumer confidence and brand trust. To counter these risks, organizations must embrace a proactive stance toward performance monitoring, particularly those concerning AI-driven applications, which are so rapidly becoming part of everyday business. Unlike traditional applications, AI-driven systems often operate autonomously, making split-second decisions based on vast amounts of data. 

Any disruption to these systems can lead to a cascade of errors and delays, resulting in a breakdown of user interactions and ultimately, a loss of trust in the brand. Real-time visibility into application performance enables businesses to swiftly detect anomalies, optimize functionality, and uphold seamless user interactions. The ability to promptly identify and address issues as they arise empowers IT teams to maintain operational continuity and mitigate potential damages.

Predictive analytics and AI-driven anomaly detection play pivotal roles in preemptively identifying potential issues before they disrupt end-user experiences. As reliance on AI technologies continues to grow, uninterrupted service will only become more of a critical business imperative. Yet, achieving early detection can prove challenging. 

Many enterprises still rely on basic uptime monitoring, often restricted to monitoring only their homepage, leaving them vulnerable to intermittent or partial site failures when an AI-dependent service fails. To defend against AI-induced downtime, organizations should implement holistic monitoring strategies like Internet Performance Monitoring (IPM), spanning the entire spectrum of AI-driven applications, from the frontend interfaces to the backend data processing pipelines.

By proactively monitoring AI dependencies and deploying robust performance management frameworks, businesses can mitigate the risks of costly downtime and sustain operational continuity in an increasingly AI-driven landscape. Consider this a call to action to think ahead and best protect business by anticipating these challenges and equipping operations teams to best manage them.

We feature the best network monitoring tool.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Mehdi Daoudi, Co-founder and CEO of Catchpoint.