This article describes some of the improvements that you’ll see in the availability services offered by failover clustering in Windows Server 2016 (WS2016) Technical Preview 5 (TP5).
Keeping Services Running with Failover Clustering
Failover clustering is all about keeping services running. And the failover clustering team is all about feedback. With each release of Windows Server, the team has improved uptime for services running on Windows Server and Hyper-V, and this continues again with WS2016. Most of what you’ll read about in this article is a result of feedback from customers and partners, and it proves that participating in the process works and gives us a better product.
Cluster Rolling Upgrade
Recent years have proven that Windows Server isn’t released every three years. It’s much more frequent, and customers have been left behind. This isn’t just because of license costs or fear, it’s because upgrading a Hyper-V, storage, or application cluster just wasn’t possible; we had to build a new cluster and swing the highly available services across. That’s why you’ll still find lots of Windows Server 2008 R2 clusters out there, even though those companies own Windows Server 2012 R2 licensing.
Microsoft needed to allow cluster upgrades, especially if we might have a more frequent release schedule in the future. This is why we have a Cluster Rolling upgrade, which is a process that’s similar to a domain upgrade. This process will work with Windows Server 2012 R2 and later. For each node in the cluster, you will:
- Drain it of resources, such as virtual machines.
- Evict the node from the cluster.
- Rebuild the evicted node with the newest version of Windows Server, such as Windows Server 2016.
- Join the node to the cluster.
The cluster will function in what is called mixed mode, offering down-level functionality. You should speed through this process as quickly as possible to avoid unforeseen issues. When every node is running the newest version of Windows Server, you will:
- Upgrade the functional level of the cluster with a single PowerShell command.
- Upgrade the version level of each Hyper-V virtual machine, in the case of Hyper-V clusters.
Now you will have all the features of the current version of Windows Server.
This new synchronous and asynchronous volume replication feature in Windows Server is supported by failover clustering and allows you to stretch clusters without using third-party storage replication.
Speaking of stretched clusters, a requirement has always been to have a file share in a third site that can act as a witness to avoid split-brain scenarios if the stretch cluster was to become fragmented due to network issues. Not everyone has a third site, that file share really should be on a cluster, and we’ve just added more complexity.
Cloud Witness allows you to use a storage account in Azure as an alternative to the file share witness. The blob consumes a tiny amount of storage and will probably cost just a few cents per month.
Site-Ware Failover Clusters
Stretch clusters are difficult, and a perennial challenge has been how to keep virtual machines running in the primary site during limited failures or maintenance. We can group nodes based on physical location in WS2016, and this affects many levels of operation within the cluster. The summarized version is that communications are more efficient, and virtual machines stay within their primary site unless they need to move to another location.
Virtual Machine Node Fairness
Load balancing of virtual machines will be possible without System Center. It’s a simple mechanism, but it should be more than enough for most, if not every, admin. The feature is turned on by default (you can turn it off, and it’s automatically disabled when you use SCVMM Dynamic Optimization) and aggressiveness of the algorithm can be controlled.
Virtual Machine Start Order
We can group and order the startup of virtual machines in WS2016 Hyper-V clusters. The groups can model tiers of services and dependencies can be modelled. Virtual machines will not start until their dependencies have previously been started.
Virtual Machine Resiliency
Microsoft has found that a lot of so-called Hyper-V outages are actually caused by third-party issues, such as storage or networking glitches that are either bugs, failures, or the result of human operational error. Microsoft was asked to figure out a way to make clusters more tolerant of these transient issues without crashing virtual machines, so that’s just what Microsoft did in WS2016.
There are two features. The first is compute resiliency, where the Hyper-V cluster will be more tolerant of a node becoming isolated and failing to heartbeat. Hyper-V nodes have a longer heartbeat, but sometimes a node becomes isolated and the cluster will failover virtual machines, which must boot up and therefore services have downtime.
Compute resiliency tells a cluster to be more tolerant and to wait a little longer. If, however, a node becomes isolated three times within an hour, the node will be quarantined, and virtual machines will be moved to other nodes, preferably by Live Migration. All settings are configurable.
Storage, even your $500,000 SAN, is not perfect and has issues. The historical result of a failed write of a virtual machine update is that the virtual machine crashes. Starting with WS2016, the virtual machine will pause, for up to a pre-determined time, until the storage resumes and the virtual machine can successfully write the changes to disk. Yes, the virtual machine will pause and be out of action, but there isn’t a time consuming reboot, and services are back as soon as the storage is stable again.
Sometimes bad stuff happens, and you need to figure out why. The cluster log files are usually a great place to dig deep, and Microsoft has made this easier with some improvements with the contents to make them easier to read.
The other improvement is a new Active Memory Dump, which filters out memory pages that are allocated to virtual machines. This means that crash dumps on Hyper-V hosts will be smaller, easier to upload to Microsoft, and have a lot less clutter.
Workgroup and Multi-Domain Clusters
Before I go anywhere with this topic, I think that Hyper-V admins should not start celebrating yet. For features such as Live Migration, we need a common Active Directory. However, I do think this is just step one, or I hope it is, and maybe one day, we will have AD-less Hyper-V clusters with full functionality. So what’s the fuss about? We can run clusters where:
- Nodes are in a single domain (the previously only supported scenario)
- Nodes are spread across different domains.
- Nodes are in a workgroup.
The two new scenarios will be good for DMZ and some SQL workloads, but make sure you understand the limitations before you start planning.