Understanding Hyper-V Replica

In my previous article, “Understanding Disaster Recovery,” I talked about the concepts of disaster recovery (DR) and business continuity planning (BCP). In this article I will move from concepts to actual software features by explaining Microsoft’s disaster recovery (DR) solution for their virtualization platform, Hyper-V Replica (HVR).  It’s probably fair to say that Microsoft underestimated just how popular this free DR solution would be. They knew small-to-medium enterprises (SMEs) would be interested, but the attention that HVR received from larger enterprises was a pleasant surprise. In this article, I will give an overview of Hyper-V Replica and how it works. I’ll also post some how-to KB articles at the end.

Hyper-V Replica Overview

HVR is a feature that is built into Hyper-V and does not require any additional licensing. The feature was first introduced in Windows Server 2012 (WS2012) and was enhanced in Windows Server 2012 R2 (WS2012 R2). HVR takes advantage of the fact that the easiest way to replicate complex applications to a DR site is to abstract them as virtual machines, and to replicate the virtual machines, which are just a few files. And this follows Microsoft’s drive to get people to virtualize more of their systems (see the enhanced scalability of Hyper-V in WS2012 and WS2012 R2).

Simple Bandwidth Requirements

Microsoft’s initial ambition with HVR was to create a DR solution for SMEs. They decided to use asynchronous replication with HVR because those SMEs usually face challenges with WAN and Internet connectivity. Synchronous replication requires high-quality and low-latency connections, which are usually out of the realms of possibility for the SME, either due to cost or local availability. HVR was also designed to deal with the connectivity outages that are all too common for the SME that is using commercial broadband. The design decisions taken by Microsoft on behalf of the SME also attracted the attention of the large enterprise. A large data center might use SAN replication to replicate to another identically configured data center, but large enterprises often have regional/branch offices and/or retail outlets that they want to provide DR solutions for. Putting in low-latency connections is out of the question, so a free replication solution that uses asynchronous replication, such as HVR, sounds like a good option.

Hardware Agnostic

HVR operates at the software layer, not caring what type of server or storage hardware you have on any site. This is another one of the things that larger enterprises like: They sometimes find themselves trying to provide a central DR site for regional offices where those offices have local purchasing autonomy. Trying to do hardware-based replication from one branch office with EMC storage, another with Dell storage, and a third with NetApp storage… well, that complicates things for the central IT staff, who will have to buy matching storage and acquire additional skills. A software-based solution such as Hyper-V rises above hardware fragmentation because it just does not care. You can replicate from a site with HP servers and an iSCSI SAN to another site with Dell servers and Storage Spaces on DataOn JBODs. It doesn’t get more complicated than that example.

 hyper-v replica: Hardware Agnostic

Hyper-V Replica is hardware agnostic replication.

Hosts vs. Clusters

The configuration is slight different for clusters so administration can be simplified, but HVR does not care if you use a standalone host or a Hyper-V cluster. You can replicate:

  • From one standalone host to another standalone host
  • From a standalone host to a Hyper-V cluster
  • From one Hyper-V cluster to another Hyper-V cluster
  • From a Hyper-V cluster to a standalone host

Note that you cannot replicate from one host to another if both hosts are in the same Hyper-V cluster.
The Hyper-V Replica settings of standalone hosts are managed in the Host Settings in Hyper-V Manager. Hyper-V Clusters will use a central point of administration and a single identity for HVR for all of the member nodes called a Hyper-V Replica Broker.

Hyper-V Replica and Security

Microsoft disabled HVR and inbound replication by default. A Hyper-V host or cluster will not accept replication traffic unless it is configured to. There are two basic policies:

  • Accept all authenticated hosts: Any host or cluster (the HVR Broker identity) that can authenticate itself can replicate to this host or cluster. All replica virtual machines are stored in a single location. This option should only be used in 1:1 implementations in small businesses or demo labs.
  • One policy per source host/cluster: Each host or cluster that can replicate to this host/cluster will have a policy to identify the source host/cluster and a unique location (volume and/or subfolder) for the replica virtual machines. This is a better solution for medium or larger businesses, or any implementation with multiple primary sites.

There are two ways that hosts can identify themselves and authorize using Hyper-V Replica:

  • HTTP/Kerberos: This method uses TCP 80 and Active Directory-based authorization. This type should be used when replicating within a forest across trusted networks.
  • HTTPS/SSL: Certificates are deployed to the source and replica hosts and all replication traffic is encrypted. This method is recommended for untrusted networks.

Service Providers

There is a business opportunity for service providers. Hosts can authorized and replicate using SSL, and a single host/cluster can accept inbound replication from many hosts/clusters. This service could be useful for smaller businesses that cannot afford or manage a DR site infrastructure.

Performing the First Copy to the Disaster Recovery Site

Small businesses with limited bandwidth and larger businesses with many terabytes of VMs will want to know what options are available to perform the initial copy of virtual machines to the DR site.  There are three ways to do this.

  • Network copy: The virtual machine is copied over the network to the secondary site and then replication starts. You can choose to delay the copy until a specific date and time. This method has the least manual effort (perfect for self-service clouds) but requires the most bandwidth.
  • Restore from backup: The desired virtual machine is restored from backup to the DR host/cluster. Replication is then configured to use the restored VM as a seed. A synchronization finds the differences between the VM in the primary site and the restored secondary site VM, fixes up the differences, and replication begins. This is bandwidth efficient (if the backup is recent) and requires little manual effort (if the backup is already in the DR site).
  • Out-of-band copy: The primary site VM is exported to removable storage, ideally encrypted (see BitLocker To Go). The storage is transported and connected to the secondary site/cluster, and the VM is imported. A synchronization fixes up the replica VM’s differences since the export, and replication begins. This is the most bandwidth efficient method but requires lots of manual effort.

The Replication Mechanism

HVR uses asynchronous replication to send data to the secondary site. A replication policy is created for each required virtual machine (including selecting virtual hard disks) in the primary site. Once completed, HVR will start to monitor the changes of each selected virtual hard disk. This is done by mirroring the changes to a Hyper-V Replica log (HRL) file, stored with the virtual hard disk.
In WS2012 Hyper-V, the asynchronous replication interval is fixed (cannot be changed) at every five minutes, meaning replication occurs every five minutes. In WS2012 R2, you can choose a replication interval of every 30 seconds, every 5 minutes, or every 15 minutes for each virtual machine.
The HRL file (or files) of a virtual machine is swapped out and replaced with a new one for the next interval. The replaced HRL is sent to the DR site and applied to the replica virtual machine, updating it with the latest changes.

 hyper-v replica: Replication

Hyper-V Replica Logs used for replication.

Restore Points

You can choose to maintain restore points in the secondary site for a replicated virtual machine. You can have up to 15 restore points in WS2012 (one per hour) and 24 restore points (one per hour) in WS2012 R2. This allows you to failover the virtual machine as it was maybe 1 hour ago, 15 hours ago, or even 24 hours ago (WS2012 R2).
HVR accomplishes this by creating checkpoints (snapshots in WS2012) of the cold replica virtual machine in the DR site. Each checkpoint is presented in a drop-down list box; you choose which restore point to use. Note that you can also tell Hyper-V to use Volume Shadow Copy Service (VSS) to create these checkpoints to guarantee application consistency.

Responding to Disasters

In the secondary site, HVR maintains a regularly updated (more on this later) offline or cold replica virtual machines. These are identical copies of the production virtual machines that are locked down into a replicating and powered down state. There are two types of disaster recovery that you can perform.

  • Planned Failover: In the event of a forecasted disaster, such as Hurricane Sandy, you can perform a planned failover. This will power down the production virtual machines, flush any remaining replication traffic to the DR site, and power up the DR site virtual machines. Replication is then reversed, converting the DR site into the primary site, and the normally production site into the secondary site. If the disaster does not happen then you reverse the entire process. The RTO of a planned failover is the time it takes to boot up your virtual machines (minutes). The RPO is zero because all remaining replication traffic is forced to the secondary site.
  • Unplanned Failover: This is the type of failover that is done after an unexpected disaster such as an earthquake or fire. The primary site is lost and replication stops. The virtual machines in the secondary site are started up using the unplanned failover option. The RTO is the length of time it takes to boot up the virtual machines. The maximum RPO is 30 seconds, 5 minutes, or 15 minutes depending on your chosen replication interval for each virtual machine. The vast majority of businesses will be delighted with this minimal data loss in conjunction with an economic and simple DR solution.

Testing the Business Continuity Plan

When we discussed the concepts of Disaster Recovery, we talked about the need for testing the Business Continuity Plan. Hyper-V Replica offers the option to perform a test failover, which does not impact the replica virtual machines. This is important because a disaster might happen during a test window:

  • Replication will continue
  • You can quickly bring online the replica virtual machines in a planned/unplanned failover

The method used to enable a test failover without impacting replication is quite elegant. A clone of the virtual machine configuration is created. This virtual machine is given a differential virtual hard disk that points to the virtual hard disk (or recovery point snapshot) as the parent disk. This gives you an instantly created and slimmed-down clone of the replica virtual machine. Not only do you have zero impact on replication, but the storage space consumed is minimalized.

Virtual Machine IP Addresses

Getting your VMs online in the DR site is pointless if they cannot be contacted. There are several ways to deal with this, including:

  • Stretched VLANs: One of the more complex options is to stretch VLANs from the production site to the DR site so the addresses of VMs/services do not need to change. This simplifies DNS but is a challenge for network administrators. This is not an option for multi-tenant DR sites.
  • IP Address Virtualization Appliances: An option for massive enterprises and telecoms companies is to use a device that abstracts the actual IP addresses being used by virtual machines and services in the primary and secondary sites. The appliances sit between the clients and the primary and secondary data centers. The virtual IP presented by the appliance is used by DNS and the appliance routes traffic according to service/VM placement.
  • DHCP: Virtual machines are given statically assigned DHCP addresses in each site. This requires static MAC addresses for each virtual NIC (required anyway for Linux) and considerable complexity.
  • IP Address Injection by HVR: The administrator of the secondary site Hyper-V hosts can configure alternative IP addresses for each virtual NIC. These addresses will be injected into the virtual machine during a failover (test or real).
  • Windows/Hyper-V Network Virtualization (WNV/HNV): Most of the other methods require some form of downtime as DNS sorts itself out. They also assume that changing the IP addresses of services won’t have consequences. HNV/WNV doesn’t change anything because the VLANs are abstracted as virtual subnets so the virtual machines and their services will carry on as normal. HNV/WNV should be used by multi-tenant cloud DR service providers.