Many computer based systems must recover from faults and resume processing within a pre-specified time. In some cases, a system must be fault-tolerant, meaning processing faults must not cause overall system function to stop. In other cases, a system failure must be corrected within a specified period of time or severe damage will occur (Damage can be economic, health related, physical, etc.)
To test a system’s capability to handle faults, recovery testing is performed. This testing approach forces the software to fail and verifies that recovery is properly performed. It is essential for any mission critical system (Example – FDA class 3 products, defense systems, etc.) The importance of these systems is such that that by their nature they impose a strict protocol of how the system should behave in case of a failure. Other examples include large financial systems, banking, logistics and many more.
Recovery testing is executed to estimate the length of time for recovery and how effectively the application can return to normal when it undergoes any disturbances to proper operation. The disturbances that are taken into account vary from system to system according to their nature and the analysis done as to the potential disturbances that might affect such systems. Of course each product / industry is involved with totally different recovery-related challenges that affect this analysis, such as:
The main question that comes to mind at this stage is how to do this analysis, where making it is actually making a large portion of the test plan itself. The next table presents key points of failure to think about, most of which are relevant for the majority of recovering application tests:
As a software recovery testing engineer, one does not only approve the recovery method, but also the reliability of the integral parts of that procedure.
The tester must ensure that the following tasks are done before executing a recovery test:
Testing a process for recovery of unpredictable problems is not simple and usually takes much in the way of resources, but on the other hand has the following benefits:
Fail-over cause | Possible impact | Impact severity (critical / High / medium / low)* | How to simulate | |
An External server un-reachable | I/O from server failure, causing error or crash. | High – critical | Disconnect the server in different points in time. Points should be selected so they simulate major possible app states. | |
Server reachable yet not responding as expected. | Probable error, crash in extreme cases | Medium – high – by error | Simulate wrong responses on the server side. | |
Power supply failure | Error up to total shutdown in case a failure in auxiliary power source | Critical | Unplug power source. Change Power strength suddenly. | |
Wireless network signal loss | I/O from network stops. Error in most cases. | Medium – high | Change network settings on the OS system or shutdown the network if it is a local one (if possible). | |
External device not responding | I/O from device stops, Error in most cases. | Medium – high | Shut down / unplug the device or change a relevant setting if there is that will stop generating I/O | |
External device responding in an unexpected way | Probable error, crash in extreme cases | Medium – high – by error | Simulate wrong responses on the device side. | |
Physical conditions, such as temperature, humidity etc. | Slower response, Application stuck , total shutdown. | Critical | Expose the whole environment to adverse physical conditions as possible, and run all tests within it | |
Electrical disturbances from near devices | Errors, Slower response. | Medium – high | Control / change proximity of signal generating devices from one another while tests are run. Should have distinct set of tests that we know their outcome to isolate this factor. | |
Service stopped | Error, Application stuck. | Sometimes no/minimal effect | Low – high | |
Missing resources (such as dll) | Error / crash | med – high | Removal of resources while working. | |
Co-existence (for example how does Chrome work if you have other browsers installed) | Wrong behavior / error | Low – Med | Use real life stations with other apps installed on a PC for example | |
DB overload | Slow response time / error | Low – med | Make a load test of the application with relevant tools. | |
Disconnected network | I/O from network stops. Error in most cases. | Medium – high | Change network settings on the OS system or shutdown the network if it is a local one (if possible) | |
Network overload | Slow response time on I/O from network to none, can cause errors / stuck | Medium – high | Generate load on the network, specifically on the components that should generate I/O to the AUT | |
Different Network issues such as: Jitter, Packet Loss, Packet mis-order | Different Errors / misbehavior | Low – med | Simulate each network issue. |
*Severity is relative to other failover issues, not to the AUT’s severity definitions
As a software recovery testing engineer, one does not only approve the recovery method, but also the reliability of the integral parts of that procedure.
The tester must ensure that the following tasks are done before executing a recovery test:
Testing a process for recovery of unpredictable problems is not simple and usually takes much in the way of resources, but on the other hand has the following benefits: