Fast-Growing Apps Sometimes Trip Up
Microsoft says Teams is the fastest-growing Office 365 application (a claim that might be challenged by Exchange) and is the current rock star of their cloud Office application suite. But February is developing into a bad month for Teams. In 2019, a failure involving keys stored in Azure Key Vault caused its first major outage. This year, the outage on February 3 was caused by an expired authentication certificate that stopped users being able to sign-into Teams.
Failures are part and parcel of application life, but letting a certificate expire surprised Office 365 observers who thought that the days of Microsoft tripping up over certificate management were long past, especially with the much-hyped automation deployed in Office 365 datacenter operations. Even acknowledging the complexity of Office 365 and Azure services used by Teams spread over multiple datacenter regions worldwide, failing to renew certificates is a fundamental and embarrassing issue.
Figure 1 shows the Downdetector graph for incident TM202916. Problems began around 13:00 UTC when users found that they were unable to connect with Teams. The number of reports spiked around 16:00 just as a new certificate was deployed to restore service.
The Development of a Teams Outage
Based on the incident history and reported in the Office 365 Service Health Dashboard and some extra information in the post incident report (PIR) released to affected tenants on February 10, the timeline developed as follows:
- 13:44: Alerts notified the problem to engineers, followed quickly by some social media posts saying that people couldn’t connect to Teams.
- 13:56: Engineers diagnose that the problem is an expired authentication certificate.
- 14:13: First notice of the incident is posted to the Office 365 Service Health dashboard.
- 15:03: Microsoft notifies customers that an expired certificate is the root cause.
- 15:40: The new certificate is deployed.
- 15:58: Microsoft tells customers that they have deployed a new certificate.
- 16:20: A surge of reconnections caused some throttling to happen. Microsoft increased the throttling threshold to let clients connect at a higher rate. At this point, Microsoft believed that most users had service restored.
- 17:14: Engineers began to selectively restart components. This was probably done to force services to pick up the new certificate Microsoft updated customers about this at 17:31.
- 19:50: “Remaining” user traffic was rerouted to alternate infrastructure (more servers) to restore service to those users.
- 20:06: Microsoft said that the service was largely restored.
- 20:30: Incident closed.
Apart from noting that people couldn’t connect if their authentication token had expired, Microsoft hasn’t said what Teams components were affected by the expired certificate. The PIR says that Microsoft is addressing the problem by implementing automated certificate rotation systems. Or, where automation isn’t possible, they are scheduling manual renewals.
No Real Problems in Some Areas
I first noticed the problem when I was unable to switch tenants. Apart from that, my work wasn’t affected because (unusually) I had no Teams meetings scheduled during the outage and none of my clients needed to reauthenticate for the duration of the outage.
When you think about Teams functionality, being unable to access channels wasn’t critical because of several factors, including:
- Email was available to communicate with internal and external people.
- Files stored in document libraries belonging to Teams were accessible through SharePoint or copies existed on workstations via the OneDrive sync client (obvious tip: synchronize important libraries to your workstation).
- You can always chat with people the old-fashioned way (Try it! The method is surprisingly effective).
All-on-all, losing the ability to make Teams calls and connect to meetings caused most user impact and discomfort.
The Promise of Voice and Audio Fails
Many companies have bought the vision of Teams as a one-stop shop for communication. They’ve moved or are moving from Skype for Business (Online or on-premises) or junked classic PABX systems to embrace Teams and the Microsoft Phone System. The promise of integrated communications using components like Stream to host recordings complete with automated transcripts (now in six languages) is compelling. Until the service fails.
Impact on Office 365 SLA
Even though this was a reasonably long incident, it won’t unduly affect the Office 365 performance against its committed SLA of 99.9% availability. Teams has too few users and the incident too short to clock up enough minutes to impact the overall availability of the service.
Failures Occur in Large Infrastructure Services
Slack is the major competitor for Teams. It has suffered its own outages, as have other meeting services like Cisco’s Webex. Other utility services like power and water can suffer outages too, so it’s not unique to have problems in massive shared infrastructures affect those who depend on those services. In fact, at any time, incidents are ongoing somewhere in Office 365 (or “advisories,” defined as a problem limited in scope). As I write this, the Service Health Dashboard for my tenant lists four advisories for the 24 available services, which seems like a normal (realistic) state of affairs.
Issue Won’t Stop Teams Growth
Eaten bread is soon forgotten and consciousness of an Office 365 incident soon disappears unless similar problems surface. I doubt that an outage of this nature will stop the rapid growth in Teams user numbers. What it might do is cause organizations to pay more attention to contingency plans to cover instances when key services are offline for extended periods.
As noted above, users could continue to work while they waited for Microsoft to restore the Teams service. The question might be asked what might happen if Teams and one of its fundamental underlying services (Exchange or SharePoint) were offline simultaneously. That prospect is enough to make tenant administrators squirm.