Upcoming Webinar
Understand How Your Peers Approach AD Forest Recovery
Join this webinar and not only learn about what your peers are doing, but also learn about a new patent-pending modern approach to AD forest recovery from Cayosoft.
New episode!
Unlocking the Secrets of App Security!
In this episode of UnplugIT, Stephen Rose talks to IT expert and technology engineer Sean Hurley about how to secure apps in the cloud.
Essential Eight + Microsoft 365 Backup Compliance
This download allows you to demonstrate exactly how the technology you use keeps customers covered and compliant.
UK Ransomware Guidelines + M365 Backup Compliance
This download allows you to demonstrate exactly how the technology you use keeps customers covered and compliant.

What Is the Storage Resiliency of Windows Server 2016?

In this post, I will explain how storage resiliency decreases downtime to virtual machines that are running on Windows Server 2016 (WS2016) Hyper-V. This is caused by transient storage issues.

Storage Is Not Perfect

No matter how much money you spend on storage, outages will happen. Some folks think that because they have spent a fortune on switches, SAN controllers, disk trays, disks, and cables that downtime will never occur to them. I do love to burst bubbles! Sadly, no storage system is impervious to problems. I have known of a few sites, including a rumor of a certain large software and services company, that have had massive SAN outages. This can lead to corrupted data.
Those headline outages are few and far between. More commonly, you will see the transient error. This is the brief glitch in the controller software, a faulty switch port, or an operator pulling the wrong cable. This is the sort of error that even though it only lasts a few seconds, can cause significant service disruption.
Let’s pretend that there is a storage glitch in your virtualization farm. Each of your virtual machines is performing reads/writes or inputs/outputs (IO) to the storage system. As soon as the glitch happens, the guest OS of each virtual machine will detect a failed IO. It will do what every operating system does. It will protect the integrity of itself and the hosted services by crashing.
After a few seconds, the glitch ends. This is no big deal, right? Wrong. Every virtual machine that was connected to the storage system has crash-dumped and is going to take several minutes to reboot. Most will be fine. Some will require manual intervention and some might even have more severe data or service issues. Those few seconds of a storage blip have just cost the business a ton of money.

A Focus Point for Windows Server 2016

Microsoft spent a lot of time talking to customers when planning for Windows Server 2016. Windows Server 2012 and 2012 R2 went a long way to win over hosting and large enterprise customers. Unfortunately, problems remained. Many of those problems resided outside of Hyper-V and Windows Server. Through meetings and analysis of support calls, Microsoft made some discoveries. Many of the problems that Hyper-V customers were experiencing were being caused by these transient issues in storage or networking. Hyper-V needed to become more tolerant of issues outside of the host.

Storage Resiliency

WS2016 Hyper-V is resilient to these transient issues. There are some variations on how the feature works but the core concept is this:

A virtual machine attempts to read or write to its virtual hard disk and the operation times out.
Instead of the guest OS crashing, Hyper-V places the virtual machine into a paused-critical state.
Soon after the storage reappears, the virtual machine resumes normal operations.

In effect, the virtual machine is frozen until the problem goes away. This saves countless crash-dumps, reboot storms, and the downtime while services return to the business. What might have been a 10-15 minute outage, is now a brief pause. Of course, there is some administrative effort to resolve critical reboot failures.

Shared VHD

We normally use Shared VHD with guest clusters to increase service availability. This works by using redundant virtual machines. If the service fails on one virtual machine, we move the service to another virtual machine. This is done as quickly as possible.

Therefore, it makes no sense to pause a guest cluster node when IO to the shared VHD file is timing out. Shared VHD leads to some different behavior:

The IO fails on vNode1.
Hyper-V removes the shared VHD from vNode1.
Clustering detects the storage issue on vNode1 and fails the service over to vNode2.
The service resumes on vNode2.
Hyper-V checks the connection to the shared disk from vNode1 every 10 minutes. When possible, it automatically reconnects.

Supported Systems

The following are supported for Storage Resiliency:

Generation 1 and Generation 2 virtual machines on WS2016 Hyper-V
VHD, VHDX, and Shared VHD
CSV on block storage — SAS, iSCSI, fiber channel, and FCoE
SMB 3.0 storage with continuous availability — Scale-Out File Server

The following are not supported:

Pass-through disks
Local disks without CSV
USB storage
SMB 3.0 storage on normal file servers or file server clusters

Configuring Storage Resiliency

Storage Resiliency is managed, via PowerShell, on a per-virtual machine basis. There are two settings to note:

AutomaticCriticalErrorAction: This is what to do when an IO fails.
AutomaticCriticalErrorActionTimeout: This is how long a virtual machine can remain in a paused-critical state without resumption of storage. The virtual machine will be powered off after this timeout expires.

Set-VM VM1 -AutomaticCriticalErrorAction None

The AutomaticCriticalErrorAction setting has the following possible values:

Pause: Pause the virtual machine as soon as IO fails. This is the default.
None: Let the guest OS do a crash dump.

The AutomaticCriticalErrorActionTimeout will allow a virtual machine to remain in a paused-critical state for up to 30 minutes. This is the default. You can set this between 1 minute and 1440 minutes.

Set-VM VM1 –AutomaticCriticalErrorActionTimeout 1440

Wrapping Up

Hopefully, you will not notice Storage Resiliency in action. It is the sort of thing that should pause and resume virtual machines within very short amounts of time. You probably will not ever need to configure the feature. It is on by default. This is one of those things that will increase up-time for you without you having to do anything.

by Aidan Finn
May 25, 2017

Aidan Finn, Microsoft Most Valuable Professional (MVP), has been working in IT since 1996. He has worked as a consultant and administrator for the likes of Innofactor Norway, Amdahl DMR, Fujitsu, Barclays and Hypo Real Estate Bank International where...

Windows Server Backup: A Step-by-Step Guide

Jan 5, 2024
Mar 22, 2024
Is AI Going to Change Backup and Recovery Strategies?

Aug 16, 2023
Top 5 Features to Look for in On-Premises Veeam Storage

Feb 20, 2024
Apr 08, 2024

Take our survey

PETRI NEWSLETTERS

Join Petri Insider

Create a free account today to participate in forum conversations, comment on posts and more.