San Jose, CA. In a keynote address at the EE Live event held here this week, Michael Barr, CTO and cofounder of the Barr Group, put the focus on “killer apps,” citing examples of software problems that over the past three decades have taken the life of 30 people and left more than 100 injured. He addressed the past, present, and future of software safety issues, covering applications ranging from missile defense to radiation therapy.
He began with commentary on the failure of a Patriot Missile Defense System in the Gulf War of 1991—the failure to intercept a Scud missile left 28 U.S. soldiers dead and 98 wounded. The problem, Barr said, stemmed from the system's inability to accurately represent one-tenth of a second in 24-bit binary format—a glitch that caused an accumulation of error after a reboot that would cause the system to misidentify the location of an incoming missile by seven meters after one hour of operation.
Despite the fact that the problem had been identified and the ironic and tragic fact that a fix was to arrive the next day, officials ducked responsibility, saying that the Scud had appeared to break apart in flight and that the anomaly had never appeared in thousands of hours of test. Barr asked, rhetorically, “What did those thousands of hours of test look like?” They probably involved frequent reboots that would zero-out accumulating errors, and they involved slower moving targets.
Barr next turned his attention to the Therac-25 radiation therapy machine—a “radiation by wire” system, he said, that in several cases in the 1980s delivered hundreds of times the prescribed radiation dose due in part to a race condition involving shared global variables. After one incident, the manufacturer asserted that the system could not have been at fault, because no similar incidents of patient damage had been reported, Barr said. In addition, the manufacturer claimed the system had undergone 2,700 hours of test, Barr said, but under questioning clarified that customers had used the systems for 2,700 hours without complaint—in effect, the customers were the testers. Further, Barr said, the manufacturer had omitted software from parts of its fault tree on the assumption that “software doesn't degrade.”
The missile defense and radiation therapy examples represent the past, he said. The present might be represented by a bug he thinks he identified in the pedometer-watch device he wears—although that bug has no safety implications. Nevertheless, he said today's devices are subject to adverse low-probability events:
- random events in the electronics, perhaps due to EMI,
- latent bugs in software, or
- unforeseen gaps in “fail-safe” implementations.
With respect to this last point, he said fail-safes are the net under the high-wire circus act, but there could be holes in the net.
Regardless of cause, Barr said, “testing cannot prove the absence of bugs,” and systems must be designed to be safe despite software bugs.
Barr then addressed unintentional acceleration problems with Toyota vehicles. Unfortunately, he said, he could not comment in detail because of confidentiality agreements he signed while preparing to testify as an expert witness in a related lawsuit. He did provide a review of publicly available information on two tragic cases—involving the Saylor family and Jean Bookout Barbara Schwarz.
He also noted much unpleasant time spent inside a noisy, poorly ventilated “clean room” in which he examined source code in preparation for expert testimony in the Bookout case. That time resulted in a 750-page report that even he and the judge in the case do not have access. In what he could reveal, he said that braking effectiveness was diminished when the throttle was open—whether due to software errors or inappropriate floor mats. Whatever the cause of specific accidents, he said, he and colleagues did identify software bit-flip problems that could result in unintended acceleration, as reported by Junko Yoshida last year in EE Times.
He then looked to the future toward a brave new world of autonomous vehicles: “Google's code driving Toyota's code—who feels safe?”
Despite the application area, he asked, “How can we make our software safe?” He answered, “Unfortunately, there are no quick fixes. But certainly, the answer isn't, 'It can't be the software!'”
He said that safety will rely on three factors: culture, process, and architecture. “I can't tell you about [the details of] unintended acceleration,” he said, “so you can't learn for yourself.” He concluded, “Sunshine is needed, with informed oversight and less code confidentiality.”