Mary Jo FoleyModeratorFebruary 26, 2019 at 10:02 am #614579
Microsoft’s Office 365 hasn’t had a good couple of months. There have been a couple of big outages, including most recently, recent multi-hour Teams outage on February 18. The common wisdom is Microsoft can run your cloud better than you can. But is this really true? What questions do you have for Redmond (as in Microsoft) and Redmond (as in Tony) on this hot topic? I’m going to be chatting with him on March 4 and will ask some of your best questions directly to Tony.
gregaltoParticipantFebruary 26, 2019 at 10:47 am #614580
Did Microsoft break any SLAs with the outages they have been having?February 27, 2019 at 3:58 am #614612
I was reading an article about this where respondents had said their on-premise setups have had less downtime than Office 365 over the last couple of years. What’s the best way to mitigate against this? The issue is not the downtime per se, but the fact that network administrators have no control over the remote network at all. At least in your own network, you can immediately set about basic troubleshooting, rebooting etc., to get the network back up within a few minutes.
OssianModeratorFebruary 27, 2019 at 9:28 am #614622
As far as I can see, the only mitigation is BOHICA – as you say, you have no control, or warning, or ability to manage it.
You then find that Microsoft’s “SLAs backed by robust financial penalties” turns into “we ruined your business with an outage – have a free week (or month) cloud services”
JeremyWModeratorFebruary 27, 2019 at 10:31 am #614628
For a lot of the clients I manage power and internet are the weakest links and they’re small enough where it’s not cost effective for them to get redundant connections or generators. For those clients, even with the outages, O365 is still a win. Also, what my clients currently see as most critical is Exchange Online closely followed by Sharepoint Online. And a much smaller category is the Dynamics 365 clients where that is critical as well. The other services could be down without too much impact.
My questions would be:
February 27, 2019 at 10:44 am #614629
- Is there a common thread among all the outages?
- Are we looking at all O365 services as a whole for uptime and should we be? Or should each major service be evaluated separately?
This also brings to mind hybrid setups. We have not yet embarked on the 365 journey but during my research, I looked at having two copies of the data – cloud, and on-premise. However, I then read several articles stating that hybrid setups can prove tricky to manage but I don’t know if that is because they are intrinsically tricky or if you require a doctorate in 365 to manage it effectively.February 27, 2019 at 10:45 am #614630
Given the size of Office 365 now, it is difficult for any single outage to affect the SLA. See https://www.petri.com/office-365-growth-good-sla-performanceFebruary 27, 2019 at 11:04 am #614632
Re. Is there a common thread, Azure AD might be the weak link for many services. Certainly, a number of the recent outages were caused by problems in the Azure AD or associated infrastructure.March 1, 2019 at 2:21 am #614729
First the “easy one”, Office 359 is already well beyond the 99.9% reliability for this year. Our on-prem solutions have had no downtime this year. Last year, we had about 2 hours of unplanned downtime. How can Microsoft hope to compete with such a poor availability record?
As to GDPR, or DSGVO as we call it (Datenschutzgrundverordnung), here we are responsible for the data. If we blab the data, that is our fault and we have to pay a fine, if we put it in the cloud and a cloud provider (E.g. Microsoft with Office 365) blabs the data, hands it over to the US Government or takes it out of the EU without getting the written permission of the identifiable entities in the data, we are still liable for the breach of GDPR and will still have to pay the fine.
With moving the data outside the EU, the data has to be stored in a land with equivalent data protection to the EU, the USA does not fall into this category. There was an agreement between the USA and the EU to replace Safe Harbor, which was deemed non-compliant. The new Data Shield sees the appointment of an Ombudsman in the USA as a pre-requisite but after over 18 months the USA Government has still failed to appoint an Ombudsman, which makes Data Shield non-compliant.
Then there is the matter of the FISA court, if Microsoft are presented with a FISA letter, they have to hand over the data, without informing their customers, that is a breach of contract and a breach of GDPR – Microsoft cannot hand over the data to the US Government without first getting the written permission of all identifiable entities, but the FISA letter prohibits them complying with the law. If Microsoft hand over the data and it comes out, the customer is liable to a minimum 24M€ fine.
Add to this the data slurping of Windows 10 and Office 365 (480 data providers in Windows in default configuration, 420 in “private” mode and still 4 in “secure” mode and several thousand data providers in Office 365), this data slurping has to be opt-in, but Microsoft doesn’t even offer an opt-out. The Dutch Government has given Microsoft until April to provide a compliant version of Office 365 for EU customers.
At work, we have Microsoft 365, mainly for the CALs. Office 365 Pro Plus is installed. No hybrid or Azure domain can be used. Teams, Exchange, Sharepoint and all other “cloudy” goodness is disabled by policy. Exchange, file servers and SQL Servers remain on-premises, mainly due to GDPR.
Given the uncertainty of the data storage, the legal problems and the low reliability at the current time, how is Microsoft going to make its cloud offerings attractive to potential customers?March 1, 2019 at 4:03 am #614732
I clicked on the link on the blog post and my first question landed in a new thread…
Anyway, second question:
Why can’t you use long, complex passwords with Office 365? The “normal” sort of password I use is rejected by Office 365, because it is too long (19 – 21 characters is normal for the passwords I use). Office 365, I think, only allows 15 or 16 characters maximum. That seems much too short for me.March 1, 2019 at 4:07 am #614733
Exactly, and when planned downtime runs to 6 figures or more per hour, that is a major expense caused by something outside your control.March 1, 2019 at 9:18 am #614747
@tony Redmond – from the link you posted.
Microsoft calculates the Office 365 SLA in terms of downtime, or minutes when incidents deprive users of a contracted service such as Exchange Online or SharePoint Online. As an example of the calculation, if you assume that Microsoft has 100 million active users for Office 365, the total number of minutes available to Office 365 users in a 90-day quarter is 12,960,000,000. Achieving a 99.97% SLA means that Microsoft considers incidents caused downtime of 3,888,000,000 minutes or 64,800,000 hours. These are enormous numbers, but put in the context of the size of Office 365, each Office 365 lost just 39 minutes of downtime during the quarter.
Of course, some users experienced zero downtime. Incidents might not have affected their tenant or they might not have been active when an incident happened. On the other hand, some tenants might have had a horrible quarter. Remember that Office 365 spreads across twelve datacenter regions and the service varies from region to region and from tenant to tenant, a fact that you should always remember when a Twitter storm breaks to discuss a new outage.
And that is a horrendous way to calculate it. If I have 5 days of no availability in a quarter, I don’t want to hear that the SLA for the complete Office 365 community hit its 99.9%. That is irrelevant. What is relevant is that my tenant / my users didn’t experience 99.97%. I don’t care if everybody else had 100% for the quarter, if I lost 5 days productivity at millions of Euros a day, that is a real loss for me, but due to Microsoft’s way of calculating, I’m SOL because the other tenants were AOK?March 1, 2019 at 5:16 pm #614779
I can’t account for how Microsoft calculates its SLA. It publishes what it does and customers sign up for this when they buy Office 365.
If you want to know your own SLA, you need to run a tool like https://office365mon.com/March 1, 2019 at 5:20 pm #614780
Re. GDPR and the Dutch DPIA, I wrote about this in https://www.petri.com/dutch-report-slams-microsoft-gdpr-violations
AFAIK, the agreement with the Dutch Government covers Office Pro Plus and not Office 365 apps like SharePoint Online.
tinaKeymasterMarch 4, 2019 at 5:01 pm #615048
MJFChat – Mary Jo Foley speaks with Tony Redmond – Office 365 Availability Chat #Transcript
Mary Jo Foley: 00:00
Hi, you’re listening to Petri’s MJFChat show. Your petri.com community magnate. That’s me, Mary Jo Foley. We’ll ask industry experts about various topics that you, our readers, will want to know about. Today, is our first inaugural MJFChat, And, it’s all about Office 365 stability and trust. For our guest, we have Tony Redmond, who is an independent consultant, specializing in Microsoft collaboration technologies.
He advises many companies on how to best develop, use and exploit Microsoft technology. And he’s the lead author of Office 365 for IT Pros, which is a constantly updated ebook covering Office 365 and associated technologies. Hi Tony. Thanks for joining the chat.
Tony Redmond: 00:54
Yeah, it’s great to be with you. I’m just looking forward to seeing what we can come up with.
Mary Jo Foley: 00:58
Alright. So, we were, you and I were bantering about what can we talk about that would be interesting around Office 365 and both of us noted that there have been some problems in the past few months with Office 365. I remember in November there were two back-to-back authentication incidents that took Office 365 down. In January, Office 365 users in Europe were unable to access their mailboxes for at least a day, some longer.
And then February teams went down for multiple hours. So the common wisdom is Microsoft can run your cloud better than better than you can, but is this really even true? That’s our main topic for today.
Tony Redmond: 01:42
Okay, good topic. You want me to respond to those or you just going to hit me with them one after another?
Mary Jo Foley: 01:49
Well I think if you, if you want to make any opening remarks and then we’ve got a bunch of questions from Petri.com readers in the forums.
Tony Redmond: 01:56
Okay. So I think one thing though I would say upfront is that in any kind of service that’s going to be incidents happening, at anytime of the day or night that you choose to measure it. Uh, that’s the first thing. I think the important thing is how many people are affected by any particular incident. And, in the, the things that, the incidents that you call out there, you know, at the MFA authentication issues, which were in November, Exchange in January, and then Teams in February — it was the case that all of Office 365 were affected, you know, the last official number we have from Microsoft is 155 million active monthly users that somebody is logs on at least once a month.
Over the last 24 months, they’ve been adding users at the rate of about 3 million a month. So you could probably say there’s at least 165 – 170 million today — and none of those incidents affected more than a small percentage of the total user base. Or we can come back to why that is. I think it’s really important to say upfront that any incidents, no matter how large, it’s only going to affect portion of the Office 365 base.
Mary Jo Foley: 03:16
You know that, that’s a great point. And in fact, one of the Petri.com readers, Jeremy W said, should we be thinking about uptime for Office 365 as a whole or should we be evaluating services separately? So that very point that you just made.
Tony Redmond: 03:30
Yeah, well, you know, Microsoft has this financially backed service level agreement which people sign into when they sign up to when they commit to office three, six, five. So if you look at, for example, we exchange or sharepoint SLA, it says, you know, can you get to email, can you get documents? And, if you can, the SLA is met. nd, the financial backing is that, uh, if they dip under 99.9% of the SLA, well they’ll pay out some credit to you for the service.
Now, Microsoft has not had to make a payout that I’m aware of since September, 2011, which was in the very early days of Office 365, they had a DNS issue and the service went down for quite a long time, uh, comparatively speaking quite a long time and the service wasn’t used by a lot of people at that stage. So that single instance had the capacity to affect the SLA. It moved the needle. Whereas today, any one of the incidents that we just talked about, they will barely budge the SLA needle and that’s the reason why I think Microsoft hit its SLA every quarter, same as the last quarter of 2011.
Mary Jo Foley: 04:48
I was wondering that too, and Greg Alto in the forum said, did they actually break the SLA is within the of these outages and the answer’s No. Right?
Tony Redmond: 04:55
Not at all. Not at all. I mean, I think it takes, it would take an incident which affected, uh, somewhere in the region of half a million users, which lasted more than eight hours. I think just off the top of my head, if you had one of those incidents that might affect it by 0.001%. Hmm.
Tony Redmond: 05:18
It’s just the law of numbers. As the numbers of users get higher and higher and the number of available minutes to those users gets bigger and bigger and bigger. And it’s billions of minutes every, every, uh, every quarter. So even if a million users go down for two hours, that’s still not a lot of minutes compared to the billions of minutes available — that’s the reason why.
Mary Jo Foley: 05:42
You know, our view of this is skewed, right? I mean, Microsoft will always say it’s a small number of users affected, but if you’re on Twitter or you’re getting email from these people like I do, it feels like the entire world is affected, right? Everyone’s screaming, no one seems know what’s going on. It just gets really out of hand.
And you know, a couple of people brought this up in the forum as well. Uh, you know what, part of the problem is the way they’re giving us status updates through Twitter, you know, through the now the called the Microsoft 365 support. Um, I forget the exact name of it, but there that account that that notifies people when Microsoft is working on an incident. What they do now is they say, go to your dashboard and look at incident number, blah, blah, blah, and you’ll see where we’re at. And a lot of the people we hear from our, not the administrators of the account, there are people who work in a company and the company’s not telling them what’s going on. They all they know is, I can’t get my email.
Tony Redmond: 06:38
Yep, that’s absolutely true. Microsoft had this problem since day one, the account name, by the way, in Twitter is, @MSFT365Status, which gives out Twitter updates for all of the Microsoft 365 services. But Microsoft could have had a problem with communicating with administrators and users since day one.
Uh, I think one of the reasons why as far as I can tell, Microsoft have nobody what I would call a user advocates working in the service. So they’ve whole pile of really talented engineers and designers and architects and programmers and so forth and so on. But they have nobody who looks at Office 365 or Microsoft 365 through the eyes of users. And so what this has led to that when an incident happens, they go very likely into technology mode, figuring out all the speeds and feeds.
So they get on, my gosh, they have telemetry coming out of the wazu to look at — and they look at all the telemetry and they figure out where the problem is and they swing into action. They go and fix things and reroute traffic and all the rest of them. But what they forget about this, that, uh, the world of the tenant administrator who used to be an on premises administrator has changed dramatically with the cloud. In the on premises world, those people had total control of the situation. So they knew exactly what was going down. They knew, for example, if somebody had poured a Coca Cola down the back of the server and it had become caramelized, which happened to me once and a long, long time ago. But that’s not important, right now? But they knew exactly what happened.
And they knew what was going to happen to get the service back online. And they’d communicate that in their own way to their own users. Now they’re living in darkness because all they get is, I might get a whole pile of stuff on Twitter, some of which is pretty bad. Some of it is inaccurate, some of it is just plain wrong — some of it is vile rumors or whatever it is. And then there’s a little bit of truth sprinkled there. And, and they’re expected to make sense of this.
They get, they may or may not be able to get to the state of pages where Microsoft is posting stuff. And even if they do get to those pages, and let’s face it, Microsoft don’t assign the most gifted writers. You know. It’s obscured and wonderful engineering lingo which hides what’s really happening. And I sometimes feel that Microsoft would do the whole world a favor with two simple steps. One is that they would send administrators messages via SMS if their tenant was effected. Everybody’s got phones, everybody’s got SMS.
And that cuts out the whole thing of dependency on being able to get to a Microsoft service when either your Internet connection aren’t there or their sites are offline. And the second thing is that they start to look at things through the eyes of users and try and to communicate a lot better, a lot more clearly, a lot more precisely about what’s going on. Um, quite frankly, an update once an hour to say, you know what, we will be back to you once we’ve done this is not enough when your business is offline and you don’t know what to do. I just wish they would do those two things and I think things would be a lot better.
Mary Jo Foley: 10:07
I think you’re right. I mean, Blood in the Petri.com forum’s says, you know, the fact that’s bugging him, is that network administrators have no control over these remote networks. So if it’s your own network that goes down, you can start doing something, even if it’s not the right thing, you’re trying basic troubleshooting or rebooting or something. When do you’re like, okay, Microsoft, what are you doing? We don’t even know what you’re doing. Tell us. Right?
Tony Redmond: 10:32
Yeah. And the interesting thing is if you look at the timeline of the Teams outage in February, uh, there was quite a substantial gap between the first signals that say, showing up inside Microsoft, saying ‘something might be going on here,’ to the time when people actually swung into action. Now, that’s understandable because the Microsoft folks have got to be sure that there’s an incident. And, they’ve got to be sure that the incident is happening at scale.
It’s not something that, uh, you know, it’s a minor, uh, a minor fall that happened on a particular piece of kit that’s caused a ripple effect across some other pieces of kit. But then it all goes quiet. So, you know, it just again comes back to communication. You just tell people what’s happening and tell people — bring them along. It’s almost like storytelling to make people happy rather than giving them just there, ‘Oh, we’re working on it.’
Mary Jo Foley: 11:32
Right. I mean, and then what do you, what do you suggest when IT Pros say to you, okay, when this does happen, there’s an outage of one of the services or multiple services. What do you do? Like, so you’re just sitting there waiting for Microsoft to fix something, but is there anything you on your end as the administrator should be doing?
Tony Redmond: 11:49
Well, I think they’ve got to look at all the available sources. Um, so the first check, the basic stuff, Microsoft Admin pages – the service help to see if there is an incident that’s occurred and if that is what is effecting their tenant, they should only see stuff showing up in the admin pages if it’s affecting their tenants, check around, do check, definitely check Twitter because Twitter is, yeah, it’s all sorts of false signals, but there’s a lot of good signals out there as well. And it’s a matter of being able to decipher what’s, what’s good and what’s bad and you get that with a little bit of experience. Uh, maybe check in with the local user group. Local user groups normally have ways of communicating the may have a Whatsapp group or something like that.
Because of the way Office 365 is built because of the way that it is regionalized, because of the way that it’s designed to limit the effect of an outage to within a data center region. It’s likely that if you’re having a problem and the folks that are local user group or also having exactly the same problems, you’ll help each other figure out what’s going on. And then if you really want to install some software that helps you know what’s going on — there, there is software out there, Office365mon.com for example, that will allow you to track exactly what’s happening for the various services view through the lens of your users, which I think is important because, you know, Office 365 is such an enormously immense place at this point in time. Uh, the view you’re getting from the whole internet is not necessarily what’s happening for you right now.
And, an example for that is again, you know, from the Teams outage where, yes, there was an authentication issue. Yes, there was an overload on the zero key bolts which caused this problem to occur, but users who had already authenticated we’re working quite happily all through the outage, right? People who are using, who used different authentication paths, like people who are using the teams, mobile clients kept on working and without it, without a hitch they never noticed it, which just goes to prove that the experience that somebody has of Office 365 right now may be diametrically opposed to the experience of somebody is having even in the same building connected to the same tenant. It just all depends.
Mary Jo Foley: 14:12
True. I mean, do you go so far as to say people should have backups of all their Office 365 data to other clouds just in case something like this happens or is that really nothing that will help them when something like the Team’s outage or Exchange Online goes down
Tony Redmond: 14:27
It won’t help them one bit. I mean, I hear this, but I ask myself this question, right? I’m an Admin, first of all, I only hear about an incident when my users start to be affected and then takes me a little bit of time to figure out whether or not it’s a true instance are just something that’s unique to those users. Okay. So we’re now maybe an hour into the incident. Now how long would it take me to get all of the data for my entire tenants which to where?
Now there’s two big, uh, big question marks here. Firstly, how much of the data do you need? And let’s face it, we all have more data now than we ever have before. I mean most Office 365 users now have a hundred gigabytes mailboxes to keep just about as much stuff as they want in their One Drive Accounts or their SharePoint libraries. So there’s a heck of a lot of data out there. So that’s one thing. How do I move all that data? And then the second thing is where do I move all that data because Office 365 is down. I can’t move it to Office 365. I can’t move it to another tenant. Uh, do I move to G Suite? Can I move it to G Suite? That’s another question. Can I move it on prem? No, probably not because I don’t have the capacity.
So, unfortunately, one of the things I think that’s happening as we go further and further along this journey into the cloud is that we become more trapped by the cloud. So the really, the only thing you can do is batten down the hatches, figure out what’s going on, communicate to your users, keep your users happy, and then wait for Microsoft to fix the problem. Unfortunately, Microsoft is pretty good at fixing problems and they have an awful lot of resources to help them fix the problem when something goes bad. Like the instance we talked about earlier on.
Mary Jo Foley: 16:23
Right, right. Yeah. I mean it’s, sometimes it feels like, is Microsoft doing this just to increase their own profits in a way? Right. Or are they actually doing this the way they’ve set up Office 365 to be resilient? You know, it was people who are kind of, what shall we call them, I don’t know, um, doubters. They like, yeah, ‘you know, are they, are they doing this just to show like, you know, they’re really good at fixing things and they can do this or is,’ or is Office 365 is really, truly resilient? And, did they build it to be something that would come back quickly from outages because, you know, to be frank, when your, when your emails down for an entire day, it’s like, come on guys, what are you doing over there? Right?
Tony Redmond: 17:10
Um, well I guess it all depends on how you look at this. Uh, I think it, I think Office 365, firstly Microsoft invested a whole lot of money in it. I mean, they keep on building out a data center regions. You know, we see them around the world now — Japan Korea, India, Australia, uh, France, western Europe, UK, you see, yeah. You see a new one in Germany. So they just continue to build out Officers 365 regions to make sure that they have data sovereignty taking care of. And that they can accommodate the needs of multi-geo organizations.
Part of the goodness that they get from this is the fact that any Office 365 outage is restricted to the boundary of the region. Inside a region, you’re going to have at least two data centers. So on. And, uh, you often have backup for the services for some of the services coming in from another region. So, for Office 365 to have a catastrophic failure. I mean, you’d have to take out lightning strike that happened to the San Antonio data centers.
Mary Jo Foley: 18:20
I do remember that.
Tony Redmond: 18:22
You reported extensively on that. I seem to recall. You’d have to take that — If you remember that was a pretty severe physical event — you know, you have this flash of lightning that came in and knocked a whole pile of stuff offline. It took them a quite a long time, took over a day to get everything back up and operational. Uh, you know, think about a more severe physical event. What would that be? An explosion at a data center. Right. Would that take out an entire data center, probably not.
Could an entire data center, be taken out by, uh, all of its Internet connectivity being taken out. What would it need for that kind of thing? Well, if you’ve visited any one of these Microsoft data centers, you’ll see a huge redundancy built into them. So, I’m pretty sure that the physical infrastructure of these data centers is good. I’m pretty sure that the software design that they’ve got for example, exchange uses database availability groups, every, every mailbox is protected by the fact that it’s copied into four different databases. Looking at my mailbox right now, its active copy is in Dublin, but there are passive copies in Amsterdam, Vienna and Helsinki. So you know, I think Microsoft has done their best to assure both the software and the physical side of the equations are taking care of.
Could they have a horrible, horrible incident which takes to takes out a complete data center? Absolutely. It could be an earthquake, tidal wave or whatever. Would that stop them restoring service? Probably not because they have their own uh, dark fiber and network, which connects together all these data centers. The data centers are able to take full load for region. Uh, you might get more incidents if you have for example, in Canada, one of the Canadian data centers was offline for two weeks, but the second one would keep on rolling. Could you have more instances in that one day the center possibly, but you’d still be running. It’s kind of hard. Yeah. I think that probably not as much as you can do with a given the state of the art right now.
Mary Jo Foley: 20:32
Hmm. Okay. Well Tony, we’re out of time, but I wanted to, I know half-hour goes by fast. Um, but I wanted to let you, uh, tell people where they can find you if they want to follow you.
Tony Redmond: 20:47
Oh, um, yeah. so I’m @12knocksinna on Twitter or office365itpros.com. I should say, what people ask me, why 12Knocksinna? Because stupidly, I took that name a long time ago when Twitter was starting up. I didn’t even think about it. I just took that name. It’s the name of my house and I’ve used it ever since.
Mary Jo Foley: 21:09
Oh, you know what, I always wondered that. So it’s good to know the reason for that. I was like, what is it a code name or something?
Tony Redmond: 21:15
Yeah, that’s a much more intelligent reason. I’ll use that. It’s a code name and I’ll let people to work out what the code is.
Mary Jo Foley: 21:22
There you go.
Tony Redmond: 21:24
Very boring. Sorry.
Mary Jo Foley: 21:24
Well, thank you so much for doing the first MJFChat with me. And, thanks to everyone who’s listening today to MJFChat. Um, we’ll be back again in a couple of weeks with our next guest, so make sure to check out the Petri.com forums to see who that is and send in your questions early and often. We’ll also be posting the audio recording and transcript of this and all our other chats in the Petri.com forums. So, thanks again to everyone and have a nice day.
- This reply was modified 7 months, 2 weeks ago by tina.
- This reply was modified 7 months, 2 weeks ago by Brad Sams.
- This reply was modified 7 months, 2 weeks ago by Brad Sams.
- This reply was modified 7 months, 2 weeks ago by Brad Sams.
- This reply was modified 7 months, 2 weeks ago by Brad Sams.
- This reply was modified 7 months, 2 weeks ago by Brad Sams.
You must be logged in to reply to this topic.