Some of the solution was technical but many of the big changes were made in the culture of the work environment.
For weeks after its launch, HealthCare.gov could barely be accessed, let alone enroll Americans in health insurance exchanges. Its failure and rapid rescue, as detailed by Steven Brill in Time magazine, shows what a difference it makes when engineers are managed well.
On Oct. 17, a new engineering team—made up of external volunteers, temporary subcontractors, and engineers from contractor QSSI—started to form and the turnaround was remarkable. Mike Abbott, a Kleiner Perkins partner who was previously key in making Twitter reliable, was a volunteer present for the first critical week and participated in as many as two or more conference calls a day through December. It made a huge difference to have talent on board that had spent their career dealing with problems of this scale.
Some of the site’s six week turnaround was due to basic technology fixes, like adding a cache to speed up information requests. But many of the big changes were made in the culture of the work environment.
The team had a few rules posted on a wall outside their central operations center that guided their daily stand up meetings and approach, according to Mikey Dickerson, the Google engineer who lead the rescue squad:
Rule 1: “The war room and the meetings are for solving problems. There are plenty of other venues where people devote their creative energies to shifting blame.”
Rule 2: “The ones who should be doing the talking are the people who know the most about an issue, not the ones with the highest rank. If anyone finds themselves sitting passively while managers and executives talk over them with less accurate information, we have gone off the rails, and I would like to know about it.”
Rule 3: “We need to stay focused on the most urgent issues, like things that will hurt us in the next 24—48 hours.”
Six weeks after the turnaround was initiated, the site was up 95% of the time compared to just 43% in early November. It was eventually able to handle a massive traffic spike leading up to an important December deadline.
Many of the engineers who helped fix the site were private contractors who had helped create the disaster. All they needed was better management.
The problem, according to Brill, wasn’t with individual engineers, but rather the fragmented contracting companies that managed them. Silicon Valley gets a lot of flack for its occasionally inflated self-regard, but its people definitely know how to build big websites.