Why you should not trust the 11 nines of durability claims?
TLDR Summary
Ignore the eleven and sometimes 16 nines of durability claims and secure your data like your business depends on it—because it does.
What Is The Claim?
Amazon's S3 FAQ states that the service delivers 99.999999999% durability—known as "11 nines"—meaning a single object is expected to be lost only once every 100 billion years, or one object every 10,000 years if you store 10 million objects. This is a tall claim. But it is echoed by the Google and Microsoft as well. Google (documentation) and Microsoft Azure (documentation) echo these figures—even extending to 16 nines for geo-replicated storage.
Some experts repeat these numbers without fully grasping their theoretical basis, creating a dangerous disconnect between marketing claims and operational realities.
Why It Is A Problem?
When decision makers cut investments based on these marketing buzzwords, they risk designing architectures that could lead to catastrophic business failures. Often, even inexperienced architects trust the marketing without delving deeper—even though detailed documentation tells a more cautious story.
Note: Durability refers to the long-term preservation of data bits, while availability measures the short-term system uptime. Durability is a theoretical construct based on mathematical modeling, not an empirically verifiable outcome.
What Cloud Vendors Actually Recommend
Despite the impressive marketing claims, technical documentation from providers suggests a more cautious approach: - Google Cloud advises maintaining secondary or even air-gapped backups. (Google Cloud Blog) - AWS recommends cross-account and cross-region backups. (AWS Backup Documentation) - Microsoft Azure promotes Geo-Redundant Storage and custom recovery plans.
This raises an important question: Why are these extra precautions necessary if the system is claimed to be so reliable?
Why the Claim Cannot Be Proven
Amazon asserts that if you store 10 million objects with them, you could expect to lose one every 10,000 years—an 11-nines durability claim. However, this is not about availability, which can be measured over shorter time spans; it concerns how reliably data is preserved over time. Measuring this is challenging because you would need to observe the system for thousands of years to validate the claim. Until then, it remains a theoretical concept.
It’s Just A Theoretical Construct
11-nines durability is not something anyone can empirically verify. Observing such durability over even a decade would require storing hundreds of billions of objects, tracking each one, and proving that none were ever silently corrupted, lost, or replaced. Even if this were possible, it assumes that future conditions will remain as ideal as the past. In reality, durability figures come from mathematical models based on assumptions about data replication, integrity checks, and hardware replacement speeds—assumptions that often do not hold true in real-world scenarios.
Real-World Infrastructure Has Real Problems
Real-world storage systems face constant risks. Consider these issues:
- Hard Drives Fail:
Industry data shows HDDs have an annualized failure rate (AFR) of 0.5% to over 2% depending on model and age. Seagate lists an AFR of 0.73% for some models (Seagate). A 2007 Google study reported AFRs up to 8.6% by year three (Wikipedia). Backblaze reported a 0.89% lifetime AFR for SSDs (ExtremeTech).
- Bit Rot Happens:
Silent corruption from cosmic rays, electrical interference, or magnetic decay can occur over time. CERN estimates that a cosmic ray-induced RAM error occurs monthly in consumer-grade hardware. Filesystems like ZFS use checksumming and scrubbing, but even these measures cannot eliminate all risks.
- Software and Firmware Bugs:
For example, Linux’s ext4 filesystem had a bug in 2012 that caused metadata loss, and a 2009 Seagate firmware bug rendered thousands of drives inoperable until patched.
- Human Error:
Uptime Institute’s 2022 report indicates that human mistakes cause about 40% of data center outages. GitLab's 2017 incident, where production data was deleted due to backup mismanagement, is a well-known example.
These issues are not rare anomalies—they are persistent risks in real environments. Ignoring them in durability claims leads to a dangerous disconnect between marketing and operational reality.
The Cloud Isn’t Immune
Despite their scale, engineering excellence, and redundancy, cloud providers are not immune to failures: - In April 2011, AWS experienced a major outage in US-East-1 due to a failure in the EBS control plane, which cascaded into EC2 and S3 disruption for major sites like Reddit and Quora (AWS Incident Report). - In December 2021, a replication rule misconfiguration during an AWS internal update led to permanent deletion of S3 objects for several users (The Register). - In March 2020, AWS S3 experienced disruptions in Northern Virginia due to capacity scaling delays, affecting multiple services downstream (AWS Post-Mortem).
Other major cloud vendors have faced similar failures: - Google Cloud suffered a widespread multi-service outage in March 2022 due to a misconfigured quota enforcement system that incorrectly throttled backend traffic (Google Incident Summary). - Microsoft Azure experienced a global outage in March 2021 due to a key rotation error in its identity service, impacting access to Azure Storage, Microsoft 365, and more (Azure RCA).
These incidents prove that no matter how advanced, cloud infrastructure can still fail due to cascading bugs, misconfigurations, and other complexities.
Durability should be auditable, testable, and independently observable. If a claim cannot be measured or verified by external observation, it is not reliable enough to base critical systems on. Trusting such unprovable assurances is a gamble no responsible architect or decision-maker should take.
What Should Be the Course of Action?
No amount of marketing, modeling, or replication eliminates the need for sound, verifiable engineering practices. Durability claims may sound convincing, but real-world protection comes from disciplined data management. Here’s how to approach it:
- Start with Business Impact Analysis:
Not all data is equal—some data is irreplaceable, while other datasets can be reconstructed. Classify your data and tailor your protection efforts accordingly.
- Follow the 3-2-1 Backup Strategy:
This proven method—three copies, two media types, and one offsite backup—is still the most effective way to ensure data availability even in the face of disaster.
- Introduce Physical and Logical Air Gaps:
Segment critical backups across isolated accounts, regions, or infrastructure. Offline or air-gapped backups protect against ransomware, human error, and cascading software bugs.
- Test Like It’s Production:
Backups are useless if you can’t restore them. Conduct regular recovery drills, simulate failures, and validate backup integrity through checksum comparisons or automated verification systems.
Redundancy is not Resilience
High availability features and replication won’t save your data from systemic bugs or operator error. Only operational discipline and measurable protections can.
Final Thoughts
I’m not saying S3 is unreliable. Far from it—I use it and trust it for many workloads. However, I do not believe in the 11-nines claim as a sole measure of durability. It remains a useful aspirational target, but without empirical verification, it is simply a mathematical model rather than a proven guarantee.
If something sounds too good to be true, it probably is.