High Availability

Rate this post

1-Introduction :
Today, everything has to be working all the time. It is more difficult within companies with a production activity (in services as well as in industries) that depends more and more on : the system of information of the company (CRM, websites, …) and, its connection to the external world. This is a presentation of the fundamental principles about the high availability.

High availability is becoming really important for the companies, by the dependance created by the internet and new technologies, that are not available most of the time. There is no standard for the duration of a service outage. This depends on the context and the criticality of the application.

2- Definition :

We define high availability as a system allowing to ensure the operational continuity of a service in a period of time. To measure the availability, a scale is used which consists of 9 steps. A highly available service is 99% available in less than 3.65 days per year. In order to calculate the availability, the metrics used are as follows:

  • MTBF ( Mean Time Between Failure): measures estimated time between two system failures.
  • MTTR (Mean Time To Recovery): measures estimated time to restore functionality.

The calculation formula of availability is : Availability = MTBF / (MTBF+MTTR).

3- High availability fundamentals:

In most companies, Internet is located in the heart of all activities, thus the need of its availability is mandatory. Internet is used either to communicate with the external world, to support various applications within the company (CRM, ERP, etc…), or for voice/video services (VoIP). It is necessary to distinguish the company’s needs on both levels : Available services for clients versus mandatory services for internal operation. One of the  most illustrative examples is the corporate websites, which are now at the center of the communication and most business professions. The high availability of websites is organised around different axes that can be primary :

  • The redundancy of materials,
  • The localisation of materials,
  • The application of the security update of the server applications,
  • The enterprise’s network security,
  • The permanent availability of backup/rescue/disaster recovery solutions.


Redundancy is a mechanism that allows one or multiple components of an architecture to be duplicated by one or more identical elements. To have “n” servers on “x” sites will allow a redundancy of information, with a risk of breakdown divided by “x”+”n”.
However, systems that automatically switch from one site to another are required. The most commonly implemented systems to ensure this redundancy are clusters.
Clusters can either be active/passive or active/active. The first case represents a group of backup machines on which the infrastructure will be switched, while an active/active system will allow both systems to operate in parallel ; in this case, only on of the two devices can operate solo.

Applications and software updates:

The applications can present bugs, the resolution by the updates allows to correct these defects. Thus, malicious people can be prevented from exploring a vulnerability that would allow access to company information. Having maintenance service is therefore important and sometimes taking into account the technical skills required, it is wise to outsource the maintenance operations.

Disaster Recovery:

This is a plan that allows to resume a total or partial activity following a disaster that has occurred on the information system. The purpose of this plan is to minimize the impact of the loss on the company’s business. The key points in a recovery plan are :

  • Backup of the equipment,
  • High availability of backup machines,
  • Disaster recovery solutions, with degraded mode.

4- Perspectives:

High availability has become essential for many services. The material becomes more and more powerful, the capacities and functionalities of the components are increasingly important. The techniques are better mastered.
So why do we not reach the 100% availability? The technique is a component but the human interaction is also present. New technologies and incessant improvements are increasingly enabling power to be increased without stopping the service, but there are always unforeseeable causes. The objective is precisely to anticipate these unforeseen events and above all to minimize their causes. In terms of prospects, the increase in flow and the disappearance of digital boundaries will make it possible to generalize this high availability. The objective would be that these application and data servers consume less and less to respect the environment.