Disaster recovery policies for hyper-converged infrastructures

10 March-2016:  Disaster recovery plans for hyper-converged infrastructures have very different considerations than the approaches used in traditional environments. Discover areas to examine.

Hyper-converged infrastructures are predominantly closed-loop systems. This means applications running on nonhyper-converged infrastructure servers are not part of the cluster and cannot access hyper-converged infrastructure resources such as compute, networking and storage. It is essential to keep this in mind when choosing, implementing and relying on hyper-converged infrastructure disaster recovery policies.
DR is essential for mission-critical workloads. A hyper-converged infrastructure does not eliminate or mitigate this requirement, but it can simplify it. There are some types of system failures for which hyper-converged infrastructures can be somewhat architecturally tolerant, for example, one or more server node or drive failures occurring concurrently.

In planning, implementing and executing hyper-converged disaster recovery policies and business continuity plans, there are three fundamental tasks that need to be considered.


Consideration 1: RPO and RTO

The first step is establishing recovery point objectives (RPO) and recovery time objectives (RTO) by application workload running in a physical host, virtual machine (VM) or container. RPO is the amount of data that can be tolerably lost when a disaster occurs. RTO is the maximum amount of downtime that can be tolerated before the application must be up and running with its data intact. There is a broad spectrum of RPOs and RTOs applicable to the different application workloads.


The most common alternative is to establish a single level of RPOs and RTOs for all application workloads, but this is a very costly mistake. It costs more to protect the data and even more for recoveries due to poor application differentiation during those recoveries. There is no way for a DR administrator to know which application is recovered first, which considerably delays mission-critical recoveries. So business continuity is delayed along with revenue generation. A better compromise is to have at least a few generalized levels of RPOs and RTOs. A good way to think about it is as several buckets that have similar RPOs and RTOs with names such as Gold, Silver and Bronze, and to then assign the workloads to a specific level. This compromise is not as detailed as assigning specific RPOs and RTOs by application workload, but it generally accomplishes the task.


Consideration 2: Types of disasters to protect against

This is known as disaster mitigation. Not all disasters are created equal or occur at an equal frequency. There are component disasters such as storage drive system nodes, which are typically handled by the hyper-converged infrastructure, albeit at a nominal level. There are also system-level disasters; site-level disasters; regional disasters; human maliciousness disasters, malware, espionage and sabotage; and human errors. Each disaster type may require different RPOs and RTOs, as well as various technologies. Determining the requirements of each disaster type is a must if recoveries will perform as planned.


Consideration 3: Type of technology

Only technologies that meet RPO, RTO and disaster mitigation requirements should be used. The next step is to determine the cost of implementation. Here are some examples of disaster recovery policies in hyper-converged infrastructures. Note that there are pros and cons for each option, as one technology may not be enough to meet an organization’s requirements.
VM replication enables up to the highest write level of RPOs and nearly instantaneous RTOs. It is very costly since duplicate hardware and licensing are required. Not all VM replication works the same way; for example, some systems time stamp each write so a malware infection can be rolled back. VM replication varies by hypervisor and most do not work across hypervisors. This is mostly a nonissue in a hyper-converged infrastructure ecosystem as there will normally be a single hypervisor. VM replication works for all disaster mitigation as long as the secondary hyper-converged infrastructure site is located more than 250 miles away. This is not a good technology for a container-centric hyper-converged infrastructure since it requires containers be run in VMs, eliminating much of the rationale behind container utilization.

Storage snapshots with asynchronous replication have low RPOs and very fast RTOs. But the software-defined storage used with a hyper-converged infrastructure must have that capability. Not all do. It requires the same or more storage capacities and compute capabilities as the primary hyper-converged infrastructure site, making it costly. This technology mitigates all disasters similar to VM replication and is also good for containers.

VM and container backup with fast guest mounting for recoveries deliver 24-hour RPOs but close-to-instantaneous RTOs. This tends to be a very effective data protection, DR and business continuity technology. It also requires a copy of the backed up data be resident both locally and at a remote location to protect against site or regional disasters. The issue with many VM and container backups is their dependence on the hypervisor’s snapshot capabilities, which tend to be resource intensive. This can slow other hyper-converged infrastructure application performance while snapshots are taken.


Containers have different issues. If they are not running nestled in a VM in a hypervisor — running on bare metal — then agents are utilized. One exception is Asigra software, which natively backs up Docker containers. Agents must be managed like all other server software and will consume quite a bit of hyper-converged infrastructure resources when backing up.

External shared SAN or file storage — NAS — is not an option, since hyper-converged infrastructures do not use the technology.