A recent blog post summed up Microsoft’s commitment to reliability and noted that Office 365 achieved an average of 99.98 percent uptime over the first three quarters of 2016. That equates to about 75 minutes of downtime, which isn’t too bad when you look at it like that!
However, as Tony Redmond pointed out, certain factors govern how the SLA is calculated that can mask the actual availability numbers for a single tenant. Your Office 365 tenant may have only achieved 99.91 percent of uptime against the backdrop of 99.98 percent rating for the service.
In my experience, the major Office 365 workloads such as Exchange, SharePoint, and Skype for Business are built on solid foundations and are run by experts in their field. It’s very unusual to see a business-impacting outage on these platforms affect more than a few tenants for more than a few hours at a time.
Going Deeper into Secondary Office 365 Workloads
The trouble with positioning Office 365 as “more reliable than on-premises services delivered by the majority of IT departments” is that Microsoft’s SLAs don’t give the full picture for availability. The problem I see is with the secondary, less mature workloads such as Power BI, that are not backed by comprehensive SLAs. For example, Figure 1 shows a screenshot of the Service Health Dashboard of our tenant displaying a current Power BI incident.
Figure 1: A Power BI outage
This is an example of an Office 365 workload that has had an incident open against it for more than a month, with the next update promised 9 days after the last one. For many people, this may not seem like a big deal. After all, unless you have even heard of Power BI you won’t care. And anyway, email is working just fine.
Yet more and more organizations are using Power BI, perhaps lured by Microsoft’s promises of intelligent decision making. Here’s an excerpt from a Power BI blog post published in the middle of this incident, which preaches about the benefits of the exact functionality that is currently broken:
“Schedule multiple data refreshes per day to keep everything and everybody up to date. Then you’ll be prepared for real-time analytics at any scale. You can help your colleagues—and your bosses—make decisions and take action based on what’s happening right now, or even make reliable predictions about what’s going to happen.”
Power BI is touted as a service that organizations can use to make reliable business decisions, yet the service itself is not backed by a reliability guarantee via an SLA.
If a company uses Power BI to help make decisions on which stores should receive additional inventory of Christmas trinkets, and that data is affected by an “intermittent” data update failure, the business might make multi-million dollar decisions on potentially inaccurate information. They may also use this same system to report revenues to the stock exchange, which if proven inaccurate could decimate their share value or result in fines for falsifying information.
Even if Power BI was backed by an SLA that applied to this incident, would a 50 percent reduction in the cost of their Power BI licenses (currently at $9.99 per user per month) make any difference to the potential loss of millions of dollars from making business decisions on inaccurate data? Probably not.
This problem isn’t confined to just the secondary workloads of Office 365, either. Reliability also affects other components, such as the various APIs used to programmatically interact with Office 365 core services.
Two examples of this are the Management Activity API, which is used by customers to pull Office 365 audit data into third-party SIEM tools and the Office 365 Reporting Web Service, which helps customers access usage and adoption metrics from their Office 365 tenant.
Both APIs have been faulty for most of December. Although the APIs are in General Availability status, they are not backed by an SLA. Any issues reported to Microsoft do not even result in the creation of an official Office 365 incident, which then makes it extremely difficult to pursue matters to a fix.
As an ISV my company has seen thousands of Office 365 tenants globally impacted by API outages in the past three weeks. However, Microsoft has been incredibly slow to acknowledge the issue, let alone provide a resolution. The product team that the problem was eventually escalated to claimed to be working on an SLA of “providing an update every 2 business days.”
As an increasing number of tenants rely on these APIs to provision, report on, and audit cloud services, it becomes increasingly problematic when APIs fall over for long periods.
The original post said it perfectly – “You can really do nothing to restore service but wait for Microsoft to troubleshoot and fix.” A CIO or IT Manager is likely to be crucified if they report to an internal incident review board every day saying “Sorry, I still have no update, Microsoft has suggested that they will give us further information 9 days from now.”
Microsoft Needs to Commit to All of Office 365
The ideal solution would be for Microsoft to extend its commitment to reliability across all of its enterprise services, including the background systems that support Office 365. I am sure that this is the target for Microsoft, but in the meantime, it would go a long way if incidents were acknowledged in the service health dashboard for all of these systems and that updates were provided more frequently. If we can’t fix it ourselves, we at least deserve to know what is going on.
Is this a deal breaker for Cloud Services? Absolutely not. You just need to be aware of what you are getting into and that means looking deeper under the hood, past the wonderful marketing metrics that are pushed in front of our faces. It can still be argued that Microsoft has far better processes, tools, and people in place to resolve these types of incidents than the average corporate IT department. The only benefit you ultimately get from running these services on premises is the illusion that you are in control.