In this post, I will explain how storage resiliency decreases downtime to virtual machines that are running on Windows Server 2016 (WS2016) Hyper-V. This is caused by transient storage issues.
Storage Is Not Perfect
No matter how much money you spend on storage, outages will happen. Some folks think that because they have spent a fortune on switches, SAN controllers, disk trays, disks, and cables that downtime will never occur to them. I do love to burst bubbles! Sadly, no storage system is impervious to problems. I have known of a few sites, including a rumor of a certain large software and services company, that have had massive SAN outages. This can lead to corrupted data.
Those headline outages are few and far between. More commonly, you will see the transient error. This is the brief glitch in the controller software, a faulty switch port, or an operator pulling the wrong cable. This is the sort of error that even though it only lasts a few seconds, can cause significant service disruption.
Let’s pretend that there is a storage glitch in your virtualization farm. Each of your virtual machines is performing reads/writes or inputs/outputs (IO) to the storage system. As soon as the glitch happens, the guest OS of each virtual machine will detect a failed IO. It will do what every operating system does. It will protect the integrity of itself and the hosted services by crashing.
After a few seconds, the glitch ends. This is no big deal, right? Wrong. Every virtual machine that was connected to the storage system has crash-dumped and is going to take several minutes to reboot. Most will be fine. Some will require manual intervention and some might even have more severe data or service issues. Those few seconds of a storage blip have just cost the business a ton of money.
A Focus Point for Windows Server 2016
Microsoft spent a lot of time talking to customers when planning for Windows Server 2016. Windows Server 2012 and 2012 R2 went a long way to win over hosting and large enterprise customers. Unfortunately, problems remained. Many of those problems resided outside of Hyper-V and Windows Server. Through meetings and analysis of support calls, Microsoft made some discoveries. Many of the problems that Hyper-V customers were experiencing were being caused by these transient issues in storage or networking. Hyper-V needed to become more tolerant of issues outside of the host.
WS2016 Hyper-V is resilient to these transient issues. There are some variations on how the feature works but the core concept is this:
- A virtual machine attempts to read or write to its virtual hard disk and the operation times out.
- Instead of the guest OS crashing, Hyper-V places the virtual machine into a paused-critical state.
- Soon after the storage reappears, the virtual machine resumes normal operations.
In effect, the virtual machine is frozen until the problem goes away. This saves countless crash-dumps, reboot storms, and the downtime while services return to the business. What might have been a 10-15 minute outage, is now a brief pause. Of course, there is some administrative effort to resolve critical reboot failures.
We normally use Shared VHD with guest clusters to increase service availability. This works by using redundant virtual machines. If the service fails on one virtual machine, we move the service to another virtual machine. This is done as quickly as possible.
Therefore, it makes no sense to pause a guest cluster node when IO to the shared VHD file is timing out. Shared VHD leads to some different behavior:
- The IO fails on vNode1.
- Hyper-V removes the shared VHD from vNode1.
- Clustering detects the storage issue on vNode1 and fails the service over to vNode2.
- The service resumes on vNode2.
- Hyper-V checks the connection to the shared disk from vNode1 every 10 minutes. When possible, it automatically reconnects.
The following are supported for Storage Resiliency:
- Generation 1 and Generation 2 virtual machines on WS2016 Hyper-V
- VHD, VHDX, and Shared VHD
- CSV on block storage — SAS, iSCSI, fiber channel, and FCoE
- SMB 3.0 storage with continuous availability — Scale-Out File Server
The following are not supported:
- Pass-through disks
- Local disks without CSV
- USB storage
- SMB 3.0 storage on normal file servers or file server clusters
Configuring Storage Resiliency
Storage Resiliency is managed, via PowerShell, on a per-virtual machine basis. There are two settings to note:
- AutomaticCriticalErrorAction: This is what to do when an IO fails.
- AutomaticCriticalErrorActionTimeout: This is how long a virtual machine can remain in a paused-critical state without resumption of storage. The virtual machine will be powered off after this timeout expires.
Set-VM VM1 -AutomaticCriticalErrorAction None
The AutomaticCriticalErrorAction setting has the following possible values:
- Pause: Pause the virtual machine as soon as IO fails. This is the default.
- None: Let the guest OS do a crash dump.
The AutomaticCriticalErrorActionTimeout will allow a virtual machine to remain in a paused-critical state for up to 30 minutes. This is the default. You can set this between 1 minute and 1440 minutes.
Set-VM VM1 –AutomaticCriticalErrorActionTimeout 1440
Hopefully, you will not notice Storage Resiliency in action. It is the sort of thing that should pause and resume virtual machines within very short amounts of time. You probably will not ever need to configure the feature. It is on by default. This is one of those things that will increase up-time for you without you having to do anything.