There were six hours during the night of April 10, 2014, when the entire population of Washington State had no 911 service. People who called for help got a busy signal. One Seattle woman dialed 911 at least 37 times while a stranger was trying to break into her house. When he finally crawled into her living room through a window, she picked up a kitchen knife. The man fled.
The 911 outage, at the time the largest ever reported, was traced to software running on a server in Englewood, Colorado. Operated by a systems provider named Intrado, the server kept a running counter of how many calls it had routed to 911 dispatchers around the country. Intrado programmers had set a threshold for how high the counter could go. They picked a number in the millions.
Shortly before midnight on April 10, the counter exceeded that number, resulting in chaos. Because the counter was used to generating a unique identifier for each call, new calls were rejected. And because the programmers hadn’t anticipated the problem, they hadn’t created alarms to call attention to it. Nobody knew what was happening. Dispatch centers in Washington, California, Florida, the Carolinas, and Minnesota, serving 11 million Americans, struggled to make sense of reports that callers were getting busy signals. It took until morning to realize that Intrado’s software in Englewood was responsible, and that the fix was to change a single number.
Not long ago, emergency calls were handled locally. Outages were small and easily diagnosed and fixed. The rise of cellphones and the promise of new capabilities—what if you could text 911? or send videos to the dispatcher?—drove the development of a more complex system that relied on the internet. For the first time, there could be such a thing as a national 911 outage. There have now been four in as many years.
It’s been said that software is “eating the world.” More and more, critical systems that were once controlled mechanically, or by people, are coming to depend on code. This was perhaps never clearer than in the summer of 2015, when on a single day, United Airlines grounded its fleet because of a problem with its departure-management system; trading was suspended on the New York Stock Exchange after an upgrade; the front page of The Wall Street Journal’s website crashed; and Seattle’s 911 system went down again, this time because a different router failed. The simultaneous failure of so many software systems smelled at first of a coordinated cyberattack. Almost more frightening was the realization, late in the day, that it was just a coincidence.
“When we had electromechanical systems, we used to be able to test them exhaustively,” says Nancy Leveson, a professor of aeronautics and astronautics at the Massachusetts Institute of Technology who has been studying software safety for 35 years. She became known for her report on the Therac-25, a radiation-therapy machine that killed six patients because of a software error. “We used to be able to think through all the things it could do, all the states it could get into.” The electromechanical interlockings that controlled train movements at railroad crossings, for instance, only had so many configurations; a few sheets of paper could describe the whole system, and you could run physical trains against each configuration to see how it would behave. Once you’d built and tested it, you knew exactly what you were dealing with.
Software is different. Just by editing the text in a file somewhere, the same hunk of silicon can become an autopilot or an inventory-control system. This flexibility is software’s miracle, and its curse. Because it can be changed cheaply, software is constantly changed; and because it’s unmoored from anything physical—a program that is a thousand times more complex than another takes up the same actual space—it tends to grow without bound. “The problem,” Leveson wrote in a book, “is that we are attempting to build systems that are beyond our ability to intellectually manage.”
Our standard framework for thinking about engineering failures—reflected, for instance, in regulations for medical devices—was developed shortly after World War II, before the advent of software, for electromechanical systems. The idea was that you make something reliable by making its parts reliable (say, you build your engine to withstand 40,000 takeoff-and-landing cycles) and by planning for the breakdown of those parts (you have two engines). But software doesn’t break. Intrado’s faulty threshold is not like the faulty rivet that leads to the crash of an airliner. The software did exactly what it was told to do. In fact it did it perfectly. The reason it failed is that it was told to do the wrong thing. Software failures are failures of understanding, and of imagination. Intrado actually had a backup router, which, had it been switched to automatically, would have restored 911 service almost immediately. But, as described in a report to the FCC, “the situation occurred at a point in the application logic that was not designed to perform any automated corrective actions.”
This is the trouble with making things out of code, as opposed to something physical. “The complexity,” as Leveson puts it, “is invisible to the eye.”
Click here to read the rest of the story on The Atlantic.