In this post, I will answer the question: is Azure SQL highly available? Spoiler alert! The answer is, yes, Azure SQL has automated high availability. Keep reading to learn how this is accomplished and what this means.
Attaining higher levels of service availability is an attitude thing. You either want to do it for real or you’re not really interested. I have worked in an environment with a 100 percent service level agreement (SLA) that was more of a marketing tool than a lock-down promise; attaining even 99.99 percent high availability is hard and expensive.
The reality is that things go wrong. Even in the cloud, things go wrong. At least in the cloud, the best people are fixing it with the best available tools. Operating systems crash. Patches don’t behave as expected when they’re deployed. Humans pull the wrong cable. And scheduled reboots happen.
I like the attitude of the Azure SQL group, as documented in their document on high availability. Sure, Azure SQL can give you a 99.99 percent SLA but if you fall into the 0.01 percent, then you’re not happy. And you’ll tell everyone you know about it. The Azure SQL team understands that data availability is critical to its customers, so they have engineered a platform that aims to ensure that the database remains online as long as data center does (see Disaster Recovery later in this post).
The key to understanding how Azure SQL remains online is Service Fabric. Service Fabric powers a lot of Microsoft’s cloud services in Azure and outside of Azure, including PowerBI, Skype for Business, Intune, Event Grid, Cosmos DB, and … Azure SQL.
By the way, Microsoft liked its own dog food so much that they released it to the world. Service Fabric is also available to customers using Windows Server or Azure as a microservices platform (Linux or Windows).
The feature of Service Fabric that is relevant is the reliable service partition. A service, or more correctly, a microservice, can be deployed as several instances with each handling a share of a workload across a cluster of machines. This gives us one level of high-availability and scalability by spanning many machines. But each instance can be deployed as a partition. A partition is made up of 1 primary copy and at least one secondary replica. The number of secondaries is definable. Each of these copies are also spanning different machines, giving us a second dimension of high availability.
When a “client” accesses the microservices instance, the transaction is processed by the primary copy of the instance. If the microservice is stateful, data is kept locally on the machine. In this case, data transactions happen only at the primary replica but are replicated to the secondaries. If the microservice is stateless, the data is stored remotely, such as on a backend storage cluster and the backend storage is responsible for data availability.
If the primary replica fails, the service coordinator will automatically switch over to a secondary copy. The data is already there, so there is no service interruption. This secondary is now the primary, and if necessary, the old primary is replaced by a newly created secondary.
SQL Always On
Azure SQL is built on Service Fabric and it uses the reliable service partition concept to provide us with a solution that uses SQL Server Always On. Note that Azure SQL uses the term tenant ring instead of partition.
Each database that you deploy in an Azure region is deployed as a tenant ring in a Service Fabric cluster in a single data center. The tenant ring is made up of a primary and two secondary replicas, each running on independent physical systems. All transactions are carried out by the primary.
There are two kinds of storage in Azure SQL. If you deploy an Azure SQL database using the Basic or Standard tiers, then the database uses remote storage (Service Fabric stateless service). Exactly one copy of the database is kept in blob storage in an Azure storage cluster and it is the responsibility of the storage cluster to provide redundancy, durability, bit-rot detection, and so on.
If you deploy a Premium tier Azure SQL database, then local storage is used (Service Fabric stateful service). The physical machines hosting the tenant ring replicas have SSD storage, offering high IOPS (flash storage) and low latency (flash storage and being in the server). Because the storage resides inside of the tenant ring, it is the responsibility of Azure SQL to make the data highly available. All reads and writes go to the primary replica. The primary uses asynchronous replication to copy the data to the two secondary replicas. To ensure consistency, a quorum-based commit scheme ensures that at least one secondary receives the data before a database transaction is committed.
People often confuse different availability terms with disaster recovery. High availability is about minimizing the downtime to a service inside of a location. Disaster recovery is about making the system available again if a location is no longer available.
Everything that I have described so far is about high availability inside of a single data center in a region. If I deploy an Azure SQL database to East US, it exists in a single facility in that single region only. If you wish to avoid an AWS-style outage from last February, then high availability is not enough. Luckily, Azure SQL does offer functionality to replicate and failover databases, pretty easily, from one region to another.