Failures, errors, sustainability

Chapter 3

Errors, mistakes and fails in software are common, usually a fail cause inconvenience but no serious long-term damages or something as serious as huge money loss or even health damage. However, in some systems failure can have very big and serious consequences. This type of system is called critical system. There are three main types of critical systems:

  1. Safety-critical systems. Fails in this system may result in injury, death or environmental damage. For example, space shuttle with astronauts on board. If something goes wrong with navigation system, people may die.
  2. Mission-critical systems. Fails in this system may result in failure of some goal-directed activity and main objective of the system may not be reached. The same space-shuttle is an example of mission-critical system, because even without astronauts taken in count, the whole mission might be failed.
  3. Business-critical systems. Fails may cause loss of money for customers using this system. Bank money management system is an example.

The most important emergent property of a critical system is dependability. Systems that are unreliable and unsafe are often rejected by users. Possible failure cost may be so big users refuse to use the system. A system that may easily loose the information is very unsafe too, because data is often the most expensive part of organization.

As a summary of all seriousness of critical systems and software fails we can say that only trusted methods and techniques must be used for development.

Dependability of a system is the main component in “calculating” trustworthiness. Trustworthiness is a degree of user confidence that the system will operate exactly as it suppose to. Of course, calculating is not the right word, because such a value cannot be expressed numerically, but some abstract terms like “not dependable”, “very dependable” are used.

Trustworthiness and usefulness are not the same and not even directly related. Program may be very useful and easy to work with in many areas, but it may crash every time user hits more than three buttons at a time. Or vice versa, system may work as a solid stone, but all it does is printing random numbers. Four principal dimensions to system dependability are: Availability, Reliability, Safety and Security.

All of these may be decomposed into another, for example security includes integrity (ensuring that data is not damaged) and confidentiality. Reliability includes correctness, precision and timeliness. All of them are interrelated.

Some other system properties may be considered under the heading of dependability:

  1. Repairability. How easy and fast system can be restored after fail. Unfortunately, most of today’s software systems use a lot of third-party components and therefore system’s repairabilty does not depend on system only
  2. Maintainability. How east and fast system can be changed to adopt new requirements and rules.
  3. Survivability. Ability of a system to continue work and deliver services after fail or attack.
  4. Error tolerance. Ability of a system to avoid input errors

System availability and reliability are closely related to each other. Both of them can be expressed as numerical probabilities – availability is the probability that system will be up and running to deliver services; reliability is the probability that the system will work correctly. More precise definitions are:

  • Reliability – the probability of failure-free operation over a specified time in a given environment for a specific purpose.
  • Availability – the probability that a system, at a point in time, will be operational and able to deliver the requested services.

By definition, the environment for a system is quite important and has to be taken into account. Measuring the system in one environment doesn’t mean it will work with same results in different environment. Three complementary approaches that are used to improve the reliability of a system are:

  1. Fault avoidance. Development techniques used to minimise the possibility of mistakes before they result in system faults.
  2. Fault detection and removal. Identifying and solving system problems before the system is used.
  3. Fault tolerance. Techniques used to ensure that some system errors doesn’t not result in failure.

Safety-critical systems are systems where it is essential that system operation is always safe. The system should never damage people or system’s environment even in case of failure. There are two classes of safety-critical systems:

  1. Primary safety-critical software. This is software which is used as a controller in a system.
  2. Secondary safety-critical software. This is software that can be indirectly result in injury.

There is no 100% safe and reliable system, so various methods are used to assure safety. What we can do is to ensure that accidents do not occur or the consequences of an accident are minimal. Three complementary ways of doing that are:

  1. Hazard avoidance. System is designed in such way so that hazards are avoided, for example cutting machine can be ran only by pressing two buttons at the same time, so operator’s both hands are busy (still, he has plenty of other stuff that can be cut )))
  2. Hazard detection and removal. System should have some components responsible for analyzing possible hazards and removing them, for example speed limiter for cars.
  3. Damage limitation. If damage happened, system should make the result minimum. Cars always have airbags.

Security is a system attribute that reflects the ability of the system to protect itself from external attacks that may be accidental or deliberate. Nowadays, security is very serious issue, especially for Internet or network-related systems. Mistakes in designing security system can cause system faults because of possible attacks. Three types of damage that may be caused through external attack are:

  1. Denial of service. The system may be forced into a state when it doesn’t deliver services anymore and (from user’s point of view) doesn’t work at all.
  2. Corruption of programs or data. The system itself may be corrupted under attack, but also the data system operates and has access to.
  3. Disclosure of confidential information. The information system operates and has access to may be exposed to unauthorised people.

My thoughts:

We all got used to software failures and it became so common for us that programs are unstable and you can never be sure. I think this is bad thing, really bad, that’s why working on this chapter I was thinking of some “perfect future” where all software-based systems somehow managed to work perfectly fine, but actually, I don’t think this will ever happen. As you were saying on one lecture about mission-critical systems, the same software bug as there was 20 years ago came out and caused big damage. Even though computers are machines, they do mistakes, because they were created by human, and human indeed do mistakes.

 
software_engineering/failures_errors_sustainability.txt · Последние изменения: 2009/09/17 03:12 От freetonik
 
За исключением случаев, когда указано иное, содержимое этой вики предоставляется на условиях следующей лицензии:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki