Amazon’s cloud service, AWS, recently had a major outage that affected almost an entire region. With that outage fresh in our memories, I thought I’d write a post on how to prevent services outages in Azure.
Note that the focus of this article is Infrastructure-as-a-Service (IaaS).
The AWS Outage
Amazon Web Services had a major outage on February 28th that affected thousands of its customers, including some very well-known names, and this likely affected millions of users. I was busy teaching a course on Azure at the time and did not notice the outage myself, but I did read that a large section of the Internet was affected. An after-action report by Amazon stated that the entire outage was the result of a typo by an administrator, who was supposed to remove a few machines (by command line, with obviously no process to prevent such mistakes) but accidentally remove a lot of machines.
It would be easy for a pro-Microsoft person to crow about Amazon’s failure, but Microsoft has not been immune to operational error, either — in 2014, a procedural update error brought down Azure’s storage system.
One of the four cloud myths that I debunk in my aforementioned training course is that you get disaster recovery by default from your cloud vendor (such as Microsoft and Amazon). Everything in the cloud is a utility, and every utility has a price. If you want it, you need to pay for it and deploy it, and this includes a scenario in which a data center burns down and you need to recover. If you didn’t design in and deploy a disaster recovery solution, you’re as cooked as the servers in the smoky data center.
In this post, I’ll explain a few things that you can do, from the basics to the advanced, to design in resilience for your in-Azure services.
Most outages in Azure are localized and planned, such as updates being deployed to the Hyper hosts by Microsoft. Luckily, the hosts are running a form of Nano Server (fewer updates and faster reboots) and Microsoft got warm reboots (no POST) working on its hosts (a feature that never made it past Technical Preview 1 in Windows Server 2016). Outages to virtual machines are generally very short, but we can still have services outages. There are two ways that we can get guaranteed uptime (i.e., the service level agreement or SLA) from Microsoft:
- Availability Sets: This feature enables us to scale out a service across 2 or more virtual machines and ensure that the virtual machines run in different fault domains and update domains. We can further enhance this uptime by using managed disks instead of un-managed disks (virtual hard disks in storage accounts); managed disks in an availability set ensure that disks are in different storage clusters.
- Premium Storage: A single virtual machine (no scale out or availability sets) gets the SLA from Microsoft if it is only using Premium Storage.
The primary purpose of an availability set is to get the SLA from Microsoft; however, hosting veterans will tell you that this is a financial bet by the hoster. The big outage can and will happen, and the finances cover the hoster for those outages, but will your business survive?
Azure Backup for IaaS Virtual Machines
We can backup running Windows or Linux Virtual machines using Azure Backup. We even have the ability to do item level restores from the backups. If a Microsoft engineer or one of your own administrators was to accidentally delete some machines, then Azure Backup can be used to restore those machines. If a region was to burn down, the default configuration of an Azure recovery services vault (storage used by Azure Backup) replicates the backups to a neighboring region, and you can do your restores there.
But this brings up a point that I constantly have to explain to those that are new to disaster recovery: backup is used when I need to do a small restore and disaster recovery is used when I need to get back an entire service or business. A restore from Azure Backup will take time, and that might be a lot of time if you’re talking about a large virtual machine or lots of virtual machines. I do not view Azure Backup (or any backup solution) as being suitable for a scenario such as the AWS one, but I still would want my backups at my secondary location so that I can still do those operational restores after the failover.
Design Disaster Recovery
As I said earlier, if you want to have a disaster recovery solution, then you need to design it. Right now, there is no solution for replicating virtual machines in Azure from one region to another, but it is on the way; Microsoft demonstrated Azure-to-Azure recovery using Azure Site Recovery (ASR) at Microsoft Ignite 2016. Instead, if you want a DR solution you’ll need to:
- Duplicate your workloads in a second Azure region.
- Replicate data across a VNet-to-VNet VPN connection.
- Use Traffic Manager to load balance or failover Internet services between the deployments in different Azure regions.
In the event of a disaster, you can failover from the primary location to the secondary one, either automatically or manually by configuring the Traffic Manager rules (I’d recommend scripting this via Azure Automation from a third region). The downside of this type of design is that you have had to double the number of virtual machines deployed and keep them running.
However, we do know that a better solution is coming from Microsoft; Azure-to-Azure Site Recovery will replicate most virtual machines (never mix application replication and virtual machine replication) from your production deployment to most other Azure regions (subject to service availability). In the event of an AWS-style outage, you would execute a recovery plan to failover your virtual machines to another location, and your services would be up and running in a matter of minutes.
Note that there are some that are considering the use of GRS storage accounts as a DR solution; there are two problems with this approach:
- Disk Replication: The metadata of your machine (name, CPU, ram, disk connections, and so forth) are not replicated in the storage account. You’ll have to rebuild everything from the replicated disks using PowerShell. That’ll take quite a few hours, assuming you already have all those scripts debugged and up-to-date with the latest changes to your environment.
- Disk Locks: The virtual hard disks will be locked for several hours by the now-missing virtual machines. You can remove the locks using PowerShell, but this will take time.
The previous issues with GRS storage combined with what’s coming from Azure Site Recovery are why I recommend using the cheaper LRS option, which is all you get from Managed Disks, anyway.
To be honest, the best possible solution is to use multiple cloud services. Imagine if the storage outage from 2014 was to repeat itself. Even if I was armed with a DR solution in Azure, the outage was global, and I would not be able to failover to another Azure region.
Ideally, I should use multiple cloud vendors. For example, my production system would run in Azure and my failover site could be in AWS. But this isn’t a VM replication solution; it’s a data replication solution, so I would have to run networks, storage, and virtual machines in the secondary cloud. If Azure was to have a global outage, I could failover from my production system in Azure to a secondary system in AWS.
The drawbacks of this design are huge:
- I would need to manage multiple vendor contracts, maybe including a third “witness” cloud vendor to redirect clients between the primary and secondary clouds.
- Costs would be more than double.
- The infrastructure is more complex.
- The cost of ownership has greatly increased.
- I would be restricted to common/abstracted services across the two clouds, such as virtual machines and containers.
The Silver Lining
Cloud-doubters often like to pick on these outages and say “this is why I don’t do cloud.” However, these things happen with on-premises deployments, too. Host clusters crash. SANs fail completely. Networks decide to go belly up. And the business suffers with each of these outages. The difference is that when an outage happens in the cloud, the very best of IT is working on a solution. As I like to remind my customers, if there’s a Hyper-V outage, I feel more confident in Microsoft fixing it in Azure than I do in a typical admin fixing it in their computer room or data center. When outages do happen within the big cloud platforms, they are typically short, and because of the financial penalties on the vendor, new processes are created to prevent a repeat. How many of us can truly state the same of our internal IT practices, product knowledge, and skills?