Applications in the modern world rely and depend on multiple servers, internal and external network communications, database services, operational processes, and a host of other infrastructure services that must all work uniformly and together. Here, the business ideal is a continuous flow of information — and any interruption costs the company money; hence, creating high-availability applications becomes an important business strategy.

Companies that increasingly rely on Web-based, distributed applications for important business activity need a range of availability engineering options to meet service level requirements in a cost effective manner. Realistically, not all applications require 24/7 uptime with instantaneous response. Some applications can fail with no consequence. Other applications can tolerate unplanned downtime but may require varying recovery strategies, while still others must provide very high availability using standby replication strategies to guarantee instant, transparent recovery with virtually no perceivable downtime.

As a general idea, availability is a measure of how often the application is available for use. More specifically, availability is a percentage calculation based on how often the application is actually available to handle service requests when compared to the total, planned, available runtime. The formal calculation of availability includes repair time because an application that is being repaired is not available for use.

The calculation for availability uses several measurements:

Name Acronym Calculation Definition
Mean Time Between Failure MTBF Hours / Failure Count Average length of time the application runs before failing.
Mean Time To Recovery MTTR Repair Hours / Failure Count Average length of time needed to repair and restore service after a failure.

The availability formula looks like this:

Availability = (MTBF / (MTBF + MTTR)) X 100

For example, consider an application that is intended to run perpetually. If we assume 1000 continuous hours as a checkpoint, two 1-hour failures during that time would result in availability of ((1000/2) / ((1000/2) + 1)) X 100 = (500 / 501) X 100 = .998 X 100 = 99.8%.

One popular way to describe availability is by the “nines,” such as three nines for 99.9% availability. However, the implication of measuring by nines is sometimes misunderstood. You need to do the arithmetic to discover that three nines (99.9% availability) represents about 8.5 hours of service outage in a single year. The next level up, four nines (99.99%), represents about 1 hour of service outage in a single year. Five nines (99.999%) represents only about 5 minutes of outage per year.

As you reach for higher levels of reliability, several things happen to your project:

  • The hardware costs for the application increase due to server, network, and disk redundancy.
  • Identifying and eliminating complex failures becomes increasingly difficult, requiring highly-trained, very skilled software engineers.
  • High availability requires comprehensive testing of every automatic and people-based procedure that may affect your application as long as it is in service.

The good news is that most business applications can run effectively at 99.9% availability. Given the right combination of people, engineering processes, and technology, three nines is very achievable, affordable, and common in typical service-level agreements.

Testing Availability

Testing for availability means running an application for a planned period of time, collecting failure events and repair times, and comparing the availability percentage to the original service level agreement.

Where reliability testing is about finding defects and reducing the number of failures, availability testing is primarily concerned with measuring and minimizing the actual repair time. That may seem odd at first, but take another look at the formula for calculating percentage availability: (MTBF / (MTBF + MTTR)) X 100. Notice that as MTTR trends towards zero, the percentage availability trends towards 100%. This idea becomes the essential focus of availability testing: reduce and eliminate downtime.

This idea of measuring repair time modifies the usual test emphasis in two ways. First, availability testing must target the entire range of events and procedural scenarios possible in the lifecycle of an application to see if all of the automated support processes and people-based procedures are really production ready. Secondly, when the application fails, the test clock is still running, and a knowledgeable recovery team had better be onsite fixing the problem.

The closer the testing is to real-world situations, the better the test confidence. Some organizations are reluctant to allocate fully configured server machines and isolated network environments to a long battery of availability testing. Just remember that a software defect found after deployment costs ten times more to fix than if found before deployment.

Test the Change Control Process

Applications are always evolving to fit new business requirements and improve behavior. Even mission-critical applications change over time. Because the change control process is a large source of downtime-causing errors, you had better test and validate all of the change control procedures. A business- or mission-critical application must not go into production until you can repeatedly perform error-free change control.

Consider this: if you haven’t tested the full deployment and configuration process, how do you know if it will work at midnight when the application goes into production? Also, don’t overlook the sometimes difficult problem of eventually removing your application without impairing the operation of other applications.

Test Catastrophic Failure

Before deploying your new application, make sure that the catastrophic recovery procedures you have created work as expected. Is the recovery team ready? They must be trained, equipped, and well-rehearsed. How do you know it is ready unless you test it? It is one thing to perform data backups and write a disaster recovery plan, but if you cannot actually do disaster recovery, the plan is useless.

What you need to do is create outages of a catastrophic nature and test the recovery process. For example, pull the building’s power plug at midnight and see how long it takes to recover. Catastrophic testing validates not only the correctness of the recovery procedures (and proof of a backup strategy that works), but it also provides a measure of confidence in a well-prepared recovery response team.

Test the Failover Technologies

Before deploying your new application, make sure that the failover technologies you have implemented work as expected. This test should include both servers and RAID disks. For example, pick a favorite piece of hardware, say a disk controller or drive, and loosen the card or pull the connection. Then watch and measure the recovery team as they locate and replace the failed hardware.

As another test idea: let the application run for a few hours, put a few dozen users on the system making client requests, and then pull the power plug on a front-end server and a back-end data store. Watch as the clustering failover technology restarts the failed application service on another server. The application should not only stay online, but every user process should be completed correctly. Again, observe and measure the recovery team as it identifies and replaces the server hardware.

Test the Monitoring Technology

Since you are staging all these tests, you should analyze the Windows Management Instrumentation (WMI) data using the intended monitoring reports and make sure that you plainly see resource consumption data, and all of the test outages especially. If you have implemented a management console (perhaps using Application Center 2000), make sure you are getting the necessary failure, availability, and trend analysis data.

Test the Help Desk Procedures

For critical applications, the help desk must be fully trained and ready to handle customer inquiries and failure scenarios. How soon can the help desk identify a problem? Do help desk representatives clearly understand how to resolve the crisis? Test the escalation process by staging a serious failure while several people on the call list aren’t home. Remember, the clock is running on all repair time scenarios.

Test for Resource Conflicts

Availability engineering requires in-depth consideration of an application’s interactions with other system processes. You must look at how a particular service is provided, evaluate all the ways some other application might interfere with the intended service, test for conflicts, and possibly consider design alternatives.

Applications often run slowly because of competition for resources such as CPUs, memory, disk I/O, and network bandwidth. When an application service must wait for hardware to become available in order to complete the task, or when several background events are occurring simultaneously, your application may run slowly. This affects perceived availability in this way: a slow application is technically “available,” but who wants to wait for it? It might as well have failed.

Testing for resource conflicts should be conducted in a full, production-like target environment where transient workloads cause multiple applications to compete for resource allocation. As a general idea, availability is a measure of how often the application is available for use. More specifically, availability is a percentage calculation based on how often the application is actually available to handle service requests when compared to the total, planned, available runtime. The formal calculation of availability includes repair time because an application that is being repaired is not available for use.

Like reliability, high-availability is certainly helped by choosing the right hardware and software technology infrastructure; there is no doubt that certain technology design choices are crucial to building a high-availability application. However, high levels of availability are not possible without a serious commitment to skilled personnel, quality lifecycle processes, and operational excellence. The toughest lesson to learn is this: high-availability applications generally owe their success to how you go about it rather than to any particular combination of technologies.