As part of an availability model that I am working on, I got stuck right at the beginning when trying to find a definition that fits. So I went back to base principles to try and decompose what is meant by availability. This is a conceptual view, and separate from the measurement of availability (‘nines’ malarky). Have a look at it and give me some input so that I can refine it further.
Availability is a term that is so widely used in different contexts that it is very difficult to define in a way that satisfies all audiences. At its most basic, availability is the ability of a system to provide the expected functionality to its users. The expected functionality means that the application needs to be responsive (not frustrating users by taking too long too respond), as well as being able to reliably be able to perform those functions. But that is not enough to understand the full story about availability.
Availability is simplistically viewed as binary — the application is either available at a point in time, or it is not. This leads to a misunderstanding of availability targets (the ‘nines of availability’), the approaches to improving availability and the ability of salespeople to sell availability snake oil off the shelf (see 100% availability offered by Rackspace).
Application availability is influenced by something and has a visible outcome for the consumer, as discussed below.
The outcome, or end result, of availability is more than just ‘the site is down’. What does ‘down’ mean? Is it really ‘down’ or is that just the (possibly valid) opinion a frustrated user (that is trying to capture an online claim after arriving late to work because they crashed their car)? The outcomes of availability are those behaviours that are perceived by the end users, as listed below.
The obvious visible indication of an unavailable application is one that indicates to the end user that something has failed and no amount of retrying on the users’ part makes it work. The phrase ‘is down’ is commonly used to describe this situation, which is an obvious statement about the users’ perception and understanding of the term ‘down’, rather than a reasonable indication of failure. The types of failure include,
- Errors — where the application consistently gives errors. This is often seen on web applications where the chrome all works but the content has an error, or garbage.
- Timeouts – an application that takes too long to respond may be seen as being ‘down’ by the user or even the browser or service that is calling it.
- Missing resources – a ‘404 – Not found’ response code can have devastating effects on applications beyond missing image placeholders, missing scripts or style sheets can ‘down’ an application.
- Not addressable – a DNS lookup error, a ‘destination host unreachable’ error and other network errors can create the perception that an application is unavailable regardless of its addressability from other points. This is particularly common for applications that don’t use http ports and network traffic gets refused by firewalls.
While it may be easy to determine that an application that is switched off is unavailable, what about one that performs badly? If, for example, a user executes a search and it takes a minute to respond, would the user consider the application to be available? Would the operators share the same view? Apdex (Application Performance Index) incorporates this concept and has an index that classifies application responsiveness into three categories, namely: Satisfied, Tolerating, and Frustrated. This can form a basis for developing a performance metric that can be understood, and also serves as a basis to acknowledge that in some cases we will experience degraded performance, but should not have too many frustrated users for long or critical periods.
In addition to features being snappy and responsive, users also expect that features can be used when they are needed and perform the actions that they expect. If, for example, an update on a social media platform posts immediately (it is responsive), but is not available for friends to see within a reasonable time, it may be considered unreliable.
While the availability outcomes receive focus, simply focussing on outcomes, by saying “Don’t let the application go down”, fails to focus effort and energy on the parts of the application that ultimately influence availability. Some of these availability influencers are discussed below.
The most important, and often unconsidered, influence on availability is the quality of the underlying components of the system. Beyond buggy (or not) code, there is the quality of the network (including the users’ own device), the quality of the architecture, the quality of the testing, the quality of the development and operational processes, the quality of the data and many others. Applications that have a high level of quality, across all aspects of the system, will have higher availability — without availability being specifically addressed. An application hosted in a cheap data centre, with jumble of cheap hardware, running a website off a single php script thrown together by copying and pasting off forums by a part time student developer will have low availability — guaranteed.
Considering that any system is going to have failures at some point, the degree to which an application can handle faults determines its availability. For example, an application that handles database faults by failing over to another data source and retrying will be more available than one that reports an error.
If a frustratingly slow and unresponsive application can be considered to be unavailable (not responsive or reliable) and this responsiveness is due to high load on the application, then the ability to scale is an important part of keeping an application available. For example, a web server that is under such high load that it takes 20 seconds to return a result (unavailable) may be easily addressed by adding a bunch of web servers.
If a fault occurs and an application needs to be fixed, the time to recovery is an important part of availability. The maintainability of the application, primarily the code base, is a big part of the time that it takes to find, fix, test and redeploy a fixed defect. For example, applications that have no unit tests and large chunks of code that has not been touched in years wouldn’t be able to fix a problem quickly. This is because a large code base needs to be understood, impacts need to be understood and regression tests performed — turning a single line code change into days of delays in getting an important fix deployed.
Modern web based applications don’t have the luxury of downtime windows for planned maintenance that exist in internal enterprise applications (where planned maintenance frequently occurs on weekends). The ability of an application to have updates and enhancements deployed while the application is live and under load is an important aspect of availability. A robust and high quality application will have low availability if the entire system needs to be brought down for a few hours in order to roll out updates.
Assuming that things break, the speed at which they can be fixed is a key influencer of availability. The degree of recoverability in an application is largely up to the operational team (including support/maintenance developers and testers) to get things going again. The ability to diagnose the root cause of a problem in a panic free environment in order to take corrective action that is right the first time is a sign of a high level of operational maturity and hence recoverability.
If availability is measured in seconds of permissible downtime, only knowing that the application is unavailable because a user has complained takes valuable chunks out of the availability targets. There is the need not only for immediate detection of critical errors, but for the proactive monitoring of health in order to take corrective action before a potential problem takes down the application.