This post will explain the use of availability sets and how to deploy them with Azure Resource Manager (ARM) or Cloud Solution Provider (CSP) virtual machines.
Update Domains & Fault Domains
Like every form of computing (physical, virtual, or cloud), Azure has outages. Some of these outages can be planned, such as host patching, and some are unplanned, such as power failures. Microsoft has designed Azure to deal with this so that you can maximize the uptime of services that are running in virtual machines. This involves two concepts:
- Fault domain: A group of hosts that share common power and network connections. During a localized outage, the issue is constrained within a single fault domain. For example, a power distribution unit failure knocks a rack of hosts offline.
- Update domain: This is a logical boundary that controls how Microsoft will deploy planned maintenance. Microsoft will only perform planned maintenance on one update domain at a time. There will be several update domains within a fault domain.
Let’s imagine a scenario where you deploy a tier of a service, such as five load balanced web servers, in Azure. You’ve deployed 5 web servers because:
- You need scaled-out capacity
- You are probably allowing for one web server going offline
But what if Azure places 3 of those web servers in the update domain? When Microsoft deploys updates to Azure, the underlying host will experience a reboot and 3 of your web servers will go offline! What if all 5 of your virtual machines are placed into a single fault domain and a fuse blows in the common network switch? Now your entire web service is knocked offline and you are out of business. This is why, on premises, we deploy anti-affinity in Hyper-V or vSphere, and why we can use availability sets in Azure.
A couple of notes:
- When you load balance or NAT ARM virtual machines, you are forced into using availability sets.
- Microsoft announced on 27th July that they are reducing planned maintenance downtime for virtual machines to a maximum of 30 seconds (it sounds like Hyper-V Quick Migration), and will improve this to no perceivable downtime before the end of 2016 (Live Migration is coming to Azure?).
- Virtual machines must be in an availability set to qualify for the Azure virtual machine SLA – please don’t be silly, thinking that you’ll put a single domain controller and a single file server into an SLA to qualify!
An availability set is a way of tagging a group of virtual machines that perform the same task, such as a pair of domain controllers, the nodes of a SQL cluster, or a set of load balanced virtual machines. The availability set instructs Azure in how to place the virtual machines in different fault domains and update domains.
Classic or ASM virtual machines are pretty flexible about when you can assign an availability set; ARM virtual machines can only be assigned to an availability set at the time of the creation of the virtual machine. You cannot assign an ARM virtual machine to an availability set after creation.
Tip: If you need to assign an ARM machine to an availability set, delete the machine while keeping the disks. Re-create a new machine, using the existing disks, and make sure you assign it to the required availability set.
You can pre-create an availability set before you create your virtual machines; Browse to Availability Sets in the Azure Portal, click Add, name the availability set, and place it into the resource group that your virtual machines will reside in.
You can customize an availability set only at the time of creation. You can configure how many fault domains (between 1 and 3) and how many update domains (between 1 and 20) you wish to support. For example, if you used the defaults (3 fault domains and 5 update domains) and deployed 6 web servers in an availability set, your virtual machines would be:
- Spread across the 3 fault domains.
- Placed in the update domains in turn.
However, because there would be more virtual machines than update domains, you would have one update domain that contains 6 virtual machines.
Imagine I deployed an availability set with 3 fault domains and 20 update domains, and I created 60 web server virtual machines in this set. The 60 virtual machines would be spread evenly over the 20 update domains, with 3 machines in each. During planned maintenance, I should expect 3 virtual machines to be affected. My update domains cannot be spread evenly (20 / 3) so I should expect 7 update domains to exist in two of the fault domains, which means that a localized outage could impact up to 21 virtual machines at a time.
Availability sets are extremely rigid in an ARM/CSP deployment, so make sure you plan your availability sets before you start deploying your virtual machines or resource groups.