If an Office 365 Disaster Happened, What Would You Do?
Office 365, Retention, and Disaster Recovery
I recently wrote about how many people don’t realize that Office 365 includes retention policies that can keep SharePoint and OneDrive documents for much longer than the 93-day maximum supported by the traditional recycle bin. The result was a ton of comments, most of which I enjoyed. Mary-Jo Foley and I explored some of those comments in our recent MJFChat.
One piece of feedback was that you shouldn’t confuse retention with disaster recovery. This is absolutely true. As part of the Office 365 data governance framework, retention is all about keeping information for as long as you need it for and removing what you don’t need, which is how Office 365 retention policies work. Policies include settings to specify how long information is kept and what happens when that retention period finishes. In some cases, information is removed (including the option to go through a moderated disposition process). In others, it’s kept and left to the user to decide.
The Nature of Disaster
On the other hand, disaster recovery focuses on getting back to a normal operational state following a major outage. In the on-premises world, this could be an issue affecting a complete datacenter or a major piece of infrastructure that can’t be put right quickly, like anecdotal incidents of telecoms lines being severed.
Companies work out plans for recovering from different types of disasters and create IT infrastructures to cope, insofar as budgets and technology allow. In the cloud, the responsibility for disaster recovery transfers to the service provider. This then raises the question of how capable Office 365 is to cope with a disaster is and how will customers recover from service disruption.
Office 365 Track Record
I’m not privy to Microsoft internal operations so this commentary is limited to observation of incidents since Office 365 was launched in June 2011 and some knowledge of how the major workloads run inside Office 365 from information Microsoft has made public. With that health warning in place, here goes.
I’m unaware of any major disaster that has caused widespread multi-region disruption to Office 365 in the last five years. Sure, there are incidents every day and some of those incidents are more painful than others. The Teams outage of February 2019 is one example as is the failure that affected multi-factor authentication in November 2018. We can also go back to June 2014 to find directory issues that caused multi-hour outages for some U.S. customers.
Incidents, not Disasters
I don’t know if I would call any of these incidents a disaster. Painful and irritating, absolutely. Frustrating too, mostly because Microsoft still struggles with effective communication to customers affected by problems at times. Working in a vacuum (or maybe not being able to work in a vacuum of Microsoft communications) can be dreadful, but it’s not a disaster like a complete datacenter being blown up.
The lightning strike on the San Antonio datacenter in September 2018 might be the closest incident of that type seen to date. The outage affected many Azure services and disrupted Office 365 for customers in several regions, but everything was up-and running within 24 hours.
The San Antonio outage (and November’s MFA problems) pointed to some places in the Microsoft cloud where single point of failures can cause issues for multiple Office 365 datacenter regions. This isn’t good, and Microsoft needs to eliminate these weaknesses.
Office 365 Design Limits Outages
In general, the deployment of apps like Exchange Online, SharePoint Online, OneDrive for Business, and Teams in multiple datacenter regions limits any outage to a single region. An Exchange outage in the UK datacenter is unlikely to affect tenants in Western Europe, or a SharePoint problem in Australia won’t overflow to Japan.
All Office 365 regions have at least two datacenters (some regions have more) to provide redundancy. When problems happen, services can switch between the available datacenters. In smaller regions, the focus is to deliver the core workloads in-region, which means that some other services might come from another datacenter region. You can find out where services are delivered in an Office 365 region from Microsoft’s online map. Figure 1 shows the location of the Canadian region datacenters – in this case, services like Planner, Azure Active Directory, and Yammer come from the U.S. region.
Microsoft’s internal datacenter network is designed to cope with failures too, so if a major physical incident happens (like an explosion at a datacenter) traffic would be routed to a surviving datacenter and service would resume from there.
If Disaster Strikes, What Happens Next?
Getting back to disaster recovery, a more pertinent question is if a disaster happened to stop Office 365 delivering service to customers for extended periods (more than two days), what could those customers do? The answer is “not a lot.”
You might have backups of Exchange and SharePoint data, but where do you restore the data to? On-premises systems aren’t waiting to swing into action when cloud disasters occur, and anyway, some apps don’t work on-premises (Teams, Planner, Yammer) and some cloud features aren’t supported in their on-premises counterparts (like Exchange expanding archives). And given the amount of data now present in cloud Exchange mailboxes and SharePoint libraries, could you restore to on-premises servers in any reasonable time?
Switching to Another Service
If disaster happened, could you switch to another cloud service, like Google G Suite? Migration products do exist, but you’re going to be limited to basic email, calendar, contacts, and documents. Don’t even think about Teams, Office 365 Groups, Planner, or Yammer. Migrations are expensive, time-consuming, and complex, and moving from Office 365 to G Suite won’t happen overnight. You could sign up with Google to restore “dial-tone” services to users while you figure out the migration in the background, but is that any easier than waiting for Microsoft to fix whatever problem caused you to think about switching?
The horrible thing is that once you sign up with a cloud service like Office 365 or G Suite, you hand most operational responsibility over to the service provider. That includes coping with disasters. Unless you’re willing to take your data out of the cloud service and move it elsewhere, all you can do when problems happen is cope as best you can and hope that the service provider restores full operation speedily. As it happens, they have a good record in this respect.
Disasters and Clouds
Don’t confuse data retention with disaster recovery. Good retention practice solves some problems, like users messing up when they delete documents, but it won’t help you recover from an Office 365 disaster because your data is in the affected service and you can’t get to it until service resumes. Even if you have a copy, where can you restore it to? It’s a thought worth considering.