In 1980, five patients died after receiving overdose of radiation therapy drug caused by a computer system defect in a radiation therapy machine called Therac-25. The cause was a defective software program that wrongly calculated the dosage of radiation administered that resulted in overdose that killed the patients.
In 2012, Knight Capital Group (a US Financial Services company) lost US$ 440 million in 30 minutes of trading due to another computer system defect.
On October 29, 2018, Lion Air Flight 610 crashed. 189 people died. A few months later, on March 10, 2019, Ethiopian Airlines Flight 302 crashed. 157 people died. The aircraft on both incidents was a Boeing 737 Max. A few days later, the Federal Aviation Authority (FAA) of the United States grounded the Boeing 737 Max airplane. The airworthiness certification of the aircraft was revoked after recurring flight control issues were found.
After several months of investigation, the US investigators deemed that the cause of the two crashes was due to a design flaw in the Maneuvering Characteristics Augmentation System (MCAS) that caused the plane to erroneously activate. The MCAS is a system of hardware sensors and software that activate automatically without pilot control with the intention of correcting the airplane when it is approaching a stall. This control by artificial intelligence contributed to the death of 346 people. Boeing would later pay more than $2.5 billion to settle a criminal charge related to this incident.
All three incidents above were the effects of inadequate defect prevention measures that should have enabled the builders to detect them before or during the time the software systems were being developed.
In the industry of software development, these types of incidents above are some of the worst that can happen but there are some lesser publicized incidents that are more common among companies — defects that can cause organizations to lose money through delayed projects, cost overruns, contract cancellations, product recall, expensive fixes for systems that fail while in the hands of the customer. It’s a common risk faced by the software development industry. It’s a common risk that is often ignored until the delays in the project become too apparent already.
It’s a common risk indeed but it is rarely mitigated properly. Our research show that the only risk mitigation enterprises do is by testing, usually at the latter stages of the software development process, testing only when there is a tangible product to test. This is too late already. By this time, any defects found will be more expensive to fix, and any ‘fix’ introduced in a software system late in the construction stage may potentially spawn new defects further compounding the problem. This compounding defect problem will then translate to more schedule slippage, more manpower needs, which directly means more money burned.
The solution is to start thinking about how to address quality during the conceptualization stage, long before the first line of code is even written. We need to answer questions such as:
- What will it cost us if this product fails at the hands of our customers?
- If this product fails, will it result in loss of lives or money?
- How likely will it fail?
- How long is the intended life of the product?
- How much money should we put aside to fix problems in production?
- How many people should we hire to respond to problems in production?
- What is the estimated total cost of ownership of the product for the customer?
These are questions, if sufficiently thought out prior to construction will force us to tackle the hard questions before investing hard cash on an idea. It is by quantifying potential risks that one might incur in production that we should be able to quantify how much mitigation measures are we willing to spend during the construction process.
If the risks are high, then we should increase resource allocation to quality assurance efforts. We can extend our schedule to accommodate it. We can add more people to help. Sure it’s an added cost but it will be a question of – are we willing to invest on it now or do we risk facing the prospect of lost lives, penalties, lost trust, and possibly business failure. The effort and resources we invest to ensure quality is the Cost of Software Quality (CoSQ).
Fundamental Activities in Software Quality Assurance Planning
- Come up with a Software Quality Management Plan – defines the activities and tasks that need to be done to ensure quality – it clearly defines who will do what by when. Examples of activities are – define prioritization criteria for features by the product owner, perform design reviews with the audit department, perform security design review, perform code review, require all codes to be covered by unit tests, implement Test-driven development, implement Behaviour-driven development. All of these are examples of possible activities as part of our SQA activities. We may add to this list depending on the needs of the project to mitigate the risks. The important thing is we calibrate what needs to be done based on the risks identified. A marketing website does not have the same set of SQA activities compared to a medical device.
- Define quantifiable standards of measurement — these are quantifiable targets that the software must meet to be considered good quality. This must be set in accordance with the standards acceptable to the business objective. For example, declaring that the software should be reliable is vague. The more specific measure should be quantifiable in numbers. Example: “the system should be available for the estimated 10,000 concurrent users with no down-time of not more than one hour in one month.” All required standards must be documented for the entire team’s awareness. To execute these tests during the project, it is much more cost effective to automate the testing process by using automated testing techniques so that we can cover more scenarios compared to manual testing. Again, what must be automated should be calibrated depending on the risks faced.
The Software Quality Management Plan (SQMP) can be a simple document as long as it defines quantifiable targets. It should not be an abstract document with vague aspirations. It should be a document with quantifiable targets. What is quantifiable? If it is written in numbers. If we cannot express it in numbers, then it should not be in the SQMP document.
Remember, we cannot improve what we cannot measure.