Best Practices for Domain Controller VMs in Azure

cloud-hand-hero-img
This post will explain the best practices and support policies for deploying domain controllers (DCs) as virtual machines in Microsoft Azure.

What About Azure AD Domain Services?

In the not too distant past, if you wanted to run an application in the cloud with domain membership and consistent usernames and passwords, then you had no choice – you had to deploy one or more (preferably 2 or more) domain controllers as virtual machines in the cloud. Azure Active Directory (AD) didn’t offer domain membership, and couldn’t offer the same type of username/password authentication and authorization that you get with Active Directory Domain Services.
 

 
However, things have changed … slightly. Azure AD has recently added Domain Services as a generally available and supported feature. But be careful; Azure AD Domain Services might not be what you think it is!
Azure AD Domain Services allows you to deploy a domain-dependent application in the cloud without the additional cost of virtual machines that are functioning as domain controllers. However, Azure AD Domain Services is not another domain controller in your existing domain – in fact, it is not even your existing domain. Using Azure AD Connect you can clone your domain into Azure AD Domain Services. This means that your Organizational Units (OUs), group policies, groups, and so on can live on in the cloud, but in a different domain that is a clone of your on-premises domain.

Stretching an Active Directory domain to Azure virtual machines [Image Credit: Aidan Finn]
Stretching an Active Directory domain to Azure virtual machines [Image Credit: Aidan Finn]
If you want your on-premises AD forest to be truly extended into the cloud, then today, the best option is to continue to use virtual machines running the Active Directory Domain Services role. I do suspect that this will eventually change (I hope that AD goes the way of Exchange). My rule of thumb is this: if I want a hybrid cloud with cross-site authentication and authorization, then I will run domain controllers in the cloud.

Backup

Running DCs as virtual machines in Azure is safe, as long as you follow some rules. If you are running domain controllers running an OS that is older than Windows Server 2012 (WS2012), then you should never copy a domain controller’s virtual hard disks or restore it from backup. Azure supports the VM-GenerationID features of WS2012, so you can safely restore domain controllers from backup.
There is a bit of a “gotcha” with this VM-GenerationID feature. The normal practice to shut down virtual machines in Azure is to do so from the portal or PowerShell. Doing so will deallocate the virtual machine and reset the VM-GenerationID, which is undesirable. We should always shut down domain controllers using the shutdown command in the guest OS, otherwise:

  • The AD DS database is reset
  • The RID pool is discarded
  • SYSVOL is marked as non-authoritative

IP Configuration

You should never configure the IP configuration of an Azure virtual machine in the guest OS. A new domain controller will complain about having a DHCP configuration – let it complain because there will be no harm if you follow the correct procedures.
Edit the settings of the NIC of each virtual domain controller in the Azure Portal. Set the NIC to use a static IP address and record this IP address. Your new DC(s) will be the DNS servers of your network; open the settings of the virtual network (VNet) and configure the DNS server settings to use the IP addresses of your new domain controllers.
Note that if you are adding a new domain controller to an existing on-premises domain, then you will need a site-to-site network connection and you should temporarily configure the VNet to use the IP address of one of your on-premises DCs as a DNS server; this will allow your new cloud-based DC to find the domain so that it can join it.

Domain Controller Files

I rarely pay attention to anything in the wizard when promoting a new domain controller; it’s all next-next-next, and I doubt I’m unique. However, there is one very important screen that you must not overlook.
Azure implements write caching on the OS disk of virtual machines. This will cause an issue for databases such as AD, which can lead to corruption such as a USN rollback. You must add a data disk, with caching disabled, to the virtual machine and use this new volume to store:

  • AD DS database
  • Logs
  • SYSVOL

There is no additional cost for this if you use standard storage disks; standard storage is billed for based on data stored, not the overall size of deployed disks. Note that Azure Backup instance charges are based on the size of the disks, but you shouldn’t need so much data that you’ll exceed the 50GB-500GB price band to incur additional instance charges.

Active Directory Topology

If you work in a large enterprise, then you’ve probably already realized that it would be a good idea to define an AD topology for your new site (Azure). However, many of you work in the small-to-midsized enterprise world, so you’ve never had to do much in AD Sites and Services.
You should deploy the following for each region that you deploy AD DCs into:

  • An AD site
  • A network definition for each address space in Azure that will have domain members. Associating this definition with a site will control authentication/authorization and AD replication traffic
  • A site link to control the flow and timing of AD replication traffic – try to leverage lower cost links if you have a choice.

You can perform some advanced engineering of AD replication to reduce outbound data transfer costs. Be careful because some advanced AD engineering can have unintended consequences!

  • You can introduce disable the Bridge All Site Link option of a site link to prevent transitive inter-site replication.
  • If you have multiple site links to your Azure sites, then you can add costs to the links to mimic the costs of your networks. For example, a site-to-site VPN might be more cost effective than an ExpressRoute connection.
  • If you have a lot of data churn in your Azure site(s) that doesn’t affect your on-premises site(s), then you can reduce the frequency of replication to avoid redundant replication.
  • You can disable change notification to further reduce replication — be careful of this feature!
  • Changing the replication compression algorithm can reduce network costs. The DWORD value HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters\Replicator compression algorithm controls the algorithm that is used. The default is 3 (Xpress Compress). Changing this value to 2 (MSZip) increases compression but will increase CPU utilization in domain controllers.
  • Read-Only Domain Controllers (RODCs) do not replicate, but they are reliant on a network connection to full domain controllers to retrieve data to perform authentication and authorization.

Read-Only Domain Controllers

RODCs are supported in Azure. You can choose to deploy RODCs in Azure if you need to restrict what secrets are stored in the cloud; you can filter which attributes are available in the cloud if you wish. Most Windows roles work well with RODCs, but make sure that your applications will work well and not become overly dependent on site-to-site network links.

Global Catalog Servers

Every DC in a single-domain forest should be a global catalog server; this does not incur any additional replication traffic (outbound data transfer) costs.
However, multi-domain forests use universal groups and these require careful placement and usage of global catalog (GC) servers. You should place at least one GC server in Azure if you require the multi-domain forest to continue authenticating users if the site-to-site link fails – a GC is required to expand universal group membership, and a DC must verify that the user is not in a universal group with a DENY permission.
Note that the placement or lack of placement of GCs will impact traffic if you have stretched a multi-domain AD forest to the cloud:

  • A lack of a cloud-based GC will cause cross-VPN/ExpressRoute traffic for every authentication.
  • Having one or more GCs in the cloud will increase replication traffic.

ADFS and Azure AD Connect

One of the risks of using ADFS to integrate your AD forest with Azure AD is that all of your cloud services will be unavailable if Azure AD cannot talk to your ADFS cluster. The simplest solution to this is to move ADFS (and some domain controllers) from on-premises to Azure, effectively putting your critical component next door to the service that requires reliable connectivity.

I have also opted to deploy Azure AD Connect in an Azure virtual machine. The benefit is that in a disaster recovery scenario, my connection to Azure AD is already running in the cloud. On the downside, I need to realize that it can take up to 15 minutes (with the most frequent option in an AD site link) from an on-premises AD site to replicate to a site in Azure, and then up to 30 minutes (the default and most frequent replication option in Azure AD Connect) for changes to appear in Azure AD – you can manually trigger inter-site replication in AD and Azure AD Connect.