Rethinking cloud availability

The fragility of the cloud

I was going to call it “The hidden fragility of the cloud”, but there’s nothing hidden about a major cloud outage, it’s out there for all to see! In certain quarters, as my friend Paul Bevan, formerly Head of Infrastructure Research at Bloor Research and now semi-retired, often points out, organizations frequently move their workloads to the cloud and assume the process is complete. They treat it as though the responsibility is now external and no longer requires their attention.

Clearly, there’s a great deal more to migration into cloud, especially public cloud, than that – the shared responsibility model being the one thing that shatters that somewhat naïve thinking!

Cloud outages are not uncommon in today’s hyperconnected world and complex services. Despite the sophisticated design of hyperscaler platforms, disruptions can occur due to multiple factors, ranging from software updates, configuration errors, and network issues, to power interruptions, human error or cascading failures across dependent services.

As organizations increasingly rely on cloud environments for mission-critical workloads, even a short downtime can have a long-lasting ripple effect, impacting availability, customer trust, public perception, and revenue continuity. These incidents highlight a crucial truth: business service resilience cannot be outsourced; it must be architected.

When disruption occurs

When a major cloud outage occurs, the symptoms are felt instantly across industries. Applications slow down, application programming interfaces (APIs) become unresponsive, data processing halts, and customer-facing services go offline (websites display non-friendly error messages).

The immediate business impact includes lost transactional data, reduced productivity, and reputational damage. In government, healthcare, e commerce, and especially finance, a brief downtime of minutes or milliseconds can cause revenue loss, compliance breaches, missed service level agreements (SLAs), and potential data integrity issues.

However, the true cost is often hidden in recovery time, customer churn, and the operational effort required to restore normalcy.

Why traditional safeguards aren’t enough

To counter these risks, organizations have traditionally adopted a range of mitigation strategies such as multi-Availability-Zone (multi-AZ) deployments, data replication, disaster recovery (DR) setups, and even multi-cloud strategies.

While these approaches help to a degree, they are not foolproof. Multi-cloud, for example, may introduce latency, governance complexity, and skills, as well as operational overheads. Similarly, a backup or DR plan may only protect data, not necessarily guarantee uptime or seamless failover and failback capability. In other words, the current recovery plans may be too slow and cumbersome for some of today’s high-performing, data-streaming, and event-based architectures.

In other words, redundancy alone does not ensure availability. What matters is how well these systems are architected, automated, and periodically tested under failure conditions.

Building confidence through chaos

High availability (HA) is often the cornerstone of resilient cloud architecture. It focuses on minimizing downtime by designing systems that continue operating even when one, two, or all services fail.
But to move from theory to reality, enterprises need to go a step further – by embracing techniques, such as chaos engineering. This practice involves intentionally simulating failures in production environments to observe how systems behave when failures occur. By proactively identifying weaknesses before real incidents occur, organizations can strengthen their architecture and response mechanisms.

Netflix, one of the pioneers of chaos engineering, famously used this approach to ensure its streaming platform remains available despite frequent infrastructure changes. Similar practices can help enterprises gain confidence in their cloud designs, regardless of the provider.

Balancing availability against cost and complexity

Building a truly HA-capable cloud architecture (that is tested for smooth failover and failback capability) comes with a trade-off: increased cost and operational complexity. Redundant resources, multi-region deployments, and automated failovers all require additional investment.

I recently commented on someone’s online post that was talking about ensuring cloud services include “resiliency” in their designs. Well, first of all, resilience is not the same as HA – look up the definitions for yourself – do not mistake or conflate the two! Second, I painted the following scenario in a comment to the original post:

CEO calls in the CIO and CTO about a recent outage
The CIO and CTO explain to the CEO how much it will cost to introduce HA
The CEO goes to the CFO – the CFO says, “forget it”

I have heard this and similar conversations (and in most cases, with the same outcome) taking place in many organizations, with the only thing that’s agreed upon unanimously: cost is a major deciding factor!

Many organizations face pressure to optimize cloud spend, which often leads to resilience features being deprioritized or never being implemented due to budget constraints. Furthermore, cloud native architectures introduce multiple layers of dependencies – microservices, containers, APIs that need coordinated availability strategies.

Balancing cost efficiency with reliability becomes the key challenge. The solution lies in aligning business priorities with technical design. Not every workload needs 99.999 percent uptime, but mission-critical systems certainly do. So why not look at the possibility of creating a minimal HA implementation, elsewhere in the cloud, that allows a business to continue with the most critical business services, in the failed-over state?

The blueprint for HA architecture

Embrace failure: Accept that failures will happen, regardless of whether you are in the data center or public cloud – it’s a fact of doing business using any technology! Architect systems assuming individual components will fail. Use decoupled designs, cell-based architecture, asynchronous replication, battle-hardened failover/failback, and automated recovery mechanisms.
Prioritize workloads: Not all applications require the same level of resilience. Classify workloads based on business criticality and invest proportionally.
Implement observability: Real-time visibility and pattern detection (with AIOps) across systems is crucial. Monitor performance metrics, dependencies, and user experience continuously to detect early signs of degradation.
Test regularly: Conduct controlled failure simulations or game days to validate your recovery procedures. Document learnings and update architecture accordingly.
Automate recovery: Manual interventions delay restoration. Use Infrastructure as Code (IaC) and self-healing mechanisms for faster recovery.
Balance multi-cloud: Adopt multi-cloud selectively. For some workloads, diversity across providers may enhance uptime, but for others, it can add unnecessary complexity and cost.
Review SLAs and shared responsibility: Understand the shared responsibility model of your cloud provider. Ensure clarity on what aspects of availability and security are covered by the provider and what remains your responsibility
You think you’re finished? Think again! Check on what is sitting above you, in the cloud hierarchy. Chances are, while you may have good failover plans locally, a failure out of your control may put a spanner in those plans, because it occurs outside your control.

Turning outages into opportunity

Cloud resilience is not a milestone, it is a culture of preparedness. As digital ecosystems evolve, organizations must think beyond backups and redundancies. Building availability means combining sound architecture with disciplined testing, visibility, and governance.

Every outage, regardless of the platform, offers an opportunity to revisit assumptions and improve design. The question is not if an outage will occur, but how ready your organization is when it does.

Partnering for cloud confidence

At T-Systems, we partner with organizations across industries to design and implement highly available, secure, sovereign, and future-ready cloud architectures. Our approach blends deep cloud engineering expertise with proven frameworks for high availability, observability, and automated recovery.

As a trusted partner to leading hyperscalers including Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), we help enterprises harness the best of each ecosystem, navigating complexity, optimizing costs, and strengthening reliability.

From strategic design and architecture assessments to migration, optimization, and ongoing governance, T-Systems enables enterprises to build the foundation for uninterrupted digital operations and sustainable transformation.

Maintaining uptime is not about avoiding failure, it’s about being ready for it.

Rethinking cloud availability

Turning risk into readiness

The fragility of the cloud

When disruption occurs

Why traditional safeguards aren’t enough

Building confidence through chaos

Balancing availability against cost and complexity

The blueprint for HA architecture

Turning outages into opportunity

Partnering for cloud confidence

Richard Simon

You might also be interested in

Cloud Consulting and Transformation

Public cloud for digital innovation

We look forward to your opinion