The Software Crisis
Aim: To explain the background of the software crisis and the need for an engineering approach.
When projects became too big and complicated to easily maintain, the “software crisis” was born, with programmers saying, “We can’t get projects done, and if we can, they’re too expensive!”
Source: Bruce Eckel, Thinking in C++
[The major cause of the software crisis is] that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.
– Edsger Dijkstra, The Humble Programmer
The causes of the software crisis were linked to the overall complexity of the software process and the relative immaturity of software engineering as a profession. The crisis manifested itself in several ways:
- Projects running over-budget.
- Projects running over-time.
- Software was very inefficient.
- Software was of low quality.
- Software often did not meet requirements.
- Projects were unmanageable and code difficult to maintain.
- Software was never delivered.
- The most visible symptoms of the software crisis are
- Late delivery, over budget
- Product does not meet specified requirements
- Inadequate documentation
- Some observations on the software crisis
- “A malady that has carried on this long must be called normal” (Booch, p. 8)
- Software system requirements are moving targets
- There may not be enough good developers around to create all the new software that users need
- A significant portion of developers’ time must often be dedicated to the maintenance or preservation of geriatric software
1965 to 1985: The software crisis
Software engineering was spurred by the so-called software crisis of the 1960s, 1970s, and 1980s, which identified many of the problems of software development. Many software projects ran over budget and schedule. Some projects caused property damage. A few projects caused loss of life. The software crisis was originally defined in terms of productivity, but evolved to emphasize quality. Some used the term software crisis to refer to their inability to hire enough qualified programmers.
- Cost and Budget Overruns: The OS/360 operating system was a classic example. This decade-long project from the 1960s eventually produced one of the most complex software systems at the time. OS/360 was one of the first large (1000 programmers) software projects. Fred Brooks claims in The Mythical Man Month that he made a multi-million dollar mistake of not developing a coherent architecture before starting development.
- Property Damage: Software defects can cause property damage. Poor software security allows hackers to steal identities, costing time, money, and reputations.
- Life and Death: Software defects can kill. Some embedded systems used in radiotherapy machines failed so catastrophically that they administered lethal doses of radiation to patients. The most famous of these failures is the Therac 25 incident.
Therac 25 incident:
Researchers who investigated the accidents found several contributing causes. These included the following institutional causes:
- AECL did not have the software code independently reviewed.
- AECL did not consider the design of the software during its assessment of how the machine might produce the desired results and what failure modes existed. These form parts of the general techniques known as reliability modeling and risk management.
- The system noticed that something was wrong and halted the X-ray beam, but merely displayed the word “MALFUNCTION” followed by a number from 1 to 64. The user manual did not explain or even address the error codes, so the operator pressed the P key to override the warning and proceed anyway.
- AECL personnel, as well as machine operators, initially did not believe complaints. This was likely due to overconfidence.
- AECL had never tested the Therac-25 with the combination of software and hardware until it was assembled at the hospital.
The researchers also found several engineering issues:
- The failure only occurred when a particular nonstandard sequence of keystrokes was entered on the VT-100 terminal which controlled the PDP-11 computer: an “X” to (erroneously) select 25MV photon mode followed by “cursor up”, “E” to (correctly) select 25 MeV Electron mode, then “Enter”. This sequence of keystrokes was improbable, and so the problem did not occur very often and went unnoticed for a long time.
- The design did not have any hardware interlocks to prevent the electron-beam from operating in its high-energy mode without the target in place.
- The engineer had reused software from older models. These models had hardware interlocks that masked their software defects. Those hardware safeties had no way of reporting that they had been triggered, so there was no indication of the existence of faulty software commands.
- The hardware provided no way for the software to verify that sensors were working correctly (see open-loop controller). The table-position system was the first implicated in Therac-25’s failures; the manufacturer revised it with redundant switches to cross-check their operation.
- The equipment control task did not properly synchronize with the operator interface task, so that race conditions occurred if the operator changed the setup too quickly. This was missed during testing, since it took some practice before operators were able to work quickly enough for the problem to occur.
- The software set a flag variable by incrementing it. Occasionally an arithmetic overflow occurred, causing the software to bypass safety checks.
Peter G. Neumann has kept a contemporary list of software problems and disasters. The software crisis has been slowly fizzling out, because it is unrealistic to remain in crisis mode for more than 20 years. SEs are accepting that the problems of SE are truly difficult and only hard work over many decades can solve them.