Commercial Cloud Outages Are a Wake-Up Call

vchal/iStock

As cloud computing’s versatility gives it the potential to become the easily identifiable “central nodes” of America’s economy, these policies are needed to ensure resilience.

On Nov. 26, Amazon Web Services, the world’s largest cloud service provider, experienced a major outage in its US-EAST-1 data center due to a “relatively small addition of capacity” to the Amazon Kinesis real-time data processing service. Just over two weeks later, Google’s Cloud Platform suffered a major failure in its quota management system, severely reducing the capacity of its authentication system. The AWS outage caused services from major organizations such as Adobe, Autodesk, Fidelity, New York City’s Metropolitan Transport Authority and the Washington Post, to go down without warning. The GCP failure prevented users from logging in to their Gmail and Google Cloud applications, leading to interruptions for organizations using Google for their core office utilities.

These failures should serve as warnings for organizations to be vigilant over cloud computing’s pitfalls, but more urgently, act as a wake-up call for the U.S. government to reevaluate how it works to manage the risk from cloud services and consider cloud interoperability to bolster national and economic security. Organizations are increasingly integrating cloud computing for its convenience and the revolutionary technologies it unlocks. Critical infrastructure operators, from energy to health care, are making the cloud a key next step in their future development. And because cloud computing is an industry with extremely high barriers to entry, operators are likely to default to AWS, Microsoft Azure, Google Cloud, IBM or Oracle in the future, making any disruption in their services a potential threat to national and economic security.

The recent AWS and GCP outages follow a string of major cloud computing failures from the three largest providers. In April 2011, AWS’ Elastic Block Store, a widely used storage service, went down due to a similar routine capacity upgrade, leading to cascading disruptions in the US-EAST-1 region. In February 2013, Microsoft’s Azure cloud service experienced a global outage after certificates securing customer data expired. In August 2015, Google Cloud’s data center in Belgium suffered minor data losses after lightning strikes on power grids knocked out its primary power supply. And in February 2017, AWS’ Simple Storage Service (S3), a host of entire websites and applications, experienced a four-hour disruption in the US-EAST-1 region when debugging slower-than-usual performance.

These three internet giants’ positions as industry leaders do not exempt them from failure; if anything, they are pioneers in the current era of complex distributed computing systems, making failures due to internal mistakes or act of God events to some measure unavoidable. In spite of cloud vendors’ promises of cost savings and commercial pressures to adopt the cloud, these recent failures should remind organizations to assess whether they can tolerate these “growing pains,” especially with respect to their critical functions.

Each cloud failure has impacted the economy more harshly than the one preceding it. The earlier disruptions, such as the 2013 Azure outage, only caused relatively minor disruptions with the most apparent problems occurring in the Xbox Music and Video platforms. A more serious impact occurred in 2016 when a power disruption at a Verizon data center containing JetBlue Airways databases caused JetBlue flights to be grounded for hours. The 2017 S3 outage saw consequences across different industries, affecting popular services like GitHub, Quora, Expedia and many mobile apps. And in this latest outage, services from critical sectors such as finance (Fidelity, Coinbase) and transportation (NYC MTA) were meaningfully impacted.

These realities raise important issues for regulators. Steps such as requiring organizations to disperse their computing infrastructure across several cloud providers and requiring providers to increase interoperability aren’t just prudent, they’re necessary. Interoperability could be reflected in common architectures or a software middleware layer to help organizations relocate their workloads between cloud vendors easily and greatly reducing the probability that all of an organization’s systems could fail at once. Instituting some of these changes would keep problems “localized” and prevent industry-wide failures due to overreliance on any one provider.

Cloud vendors should also be evaluated against more frequently assessed and risk-sensitive regulatory models. Compliance with an annual audit does not measure the current state of cloud providers or their true ability to respond in a crisis. As cloud computing’s versatility gives it the potential to become the easily identifiable “central nodes” of America’s economy, these policies are needed to ensure resilience. In all likelihood, the oligopoly in place in the United States and some other major markets, reliance on Google, Amazon and Microsoft before "everyone else" is here for the foreseeable future. The economics of cloud computing rely upon achieving economies of scale to construct massive data centers and globe-spanning infrastructure so the government has a valuable and urgent role to play.

Cloud computing has become more than a specialized tool—it is quickly permeating all corners of society, and will soon be core to its functioning. We must accept that cloud computing is possible only because companies like Amazon, Microsoft and Google have the resources to invest in the amount of infrastructure required. But like all else, they are not impervious to failure. The key to mitigating this risk is addressing better inter-cloud interoperability early and implementing more frequent assessments of cloud providers' ability to adapt to failure and secure their infrastructure. This will undoubtedly entail numerous disagreements and raise complex engineering issues to the public policy sphere but there could hardly be a better time—before the next crisis is too urgent or catastrophic to easily overcome.

Tianjiu Zuo is a research assistant at the Atlantic Council’s Cyber Statecraft Initiative. He is an undergraduate at Duke University and researches various technology issues at Duke’s Sanford School of Public Policy.

Editor's note: This piece was written prior to the Microsoft Azure outage on March 15, 2021.