Reply To: How stable is Office 365? Can I trust the cloud?

Home Forums General Chat MJF Chat How stable is Office 365? Can I trust the cloud? Reply To: How stable is Office 365? Can I trust the cloud?

tina
tina
Keymaster
#615048

MJFChat – Mary Jo Foley speaks with Tony Redmond – Office 365 Availability Chat #Transcript

Mary Jo Foley: 00:00
Hi, you’re listening to Petri’s MJFChat show. Your petri.com community magnate. That’s me, Mary Jo Foley. We’ll ask industry experts about various topics that you, our readers, will want to know about. Today, is our first inaugural MJFChat, And, it’s all about Office 365 stability and trust. For our guest, we have Tony Redmond, who is an independent consultant, specializing in Microsoft collaboration technologies.

He advises many companies on how to best develop, use and exploit Microsoft technology. And he’s the lead author of Office 365 for IT Pros, which is a constantly updated ebook covering Office 365 and associated technologies. Hi Tony. Thanks for joining the chat.

Tony Redmond: 00:54
Yeah, it’s great to be with you. I’m just looking forward to seeing what we can come up with.

Mary Jo Foley: 00:58
Alright. So, we were, you and I were bantering about what can we talk about that would be interesting around Office 365 and both of us noted that there have been some problems in the past few months with Office 365. I remember in November there were two back-to-back authentication incidents that took Office 365 down. In January, Office 365 users in Europe were unable to access their mailboxes for at least a day, some longer.

And then February teams went down for multiple hours. So the common wisdom is Microsoft can run your cloud better than better than you can, but is this really even true? That’s our main topic for today.

Tony Redmond: 01:42
Okay, good topic. You want me to respond to those or you just going to hit me with them one after another?

Mary Jo Foley: 01:49
Well I think if you, if you want to make any opening remarks and then we’ve got a bunch of questions from Petri.com readers in the forums.

Tony Redmond: 01:56
Okay. So I think one thing though I would say upfront is that in any kind of service that’s going to be incidents happening, at anytime of the day or night that you choose to measure it. Uh, that’s the first thing. I think the important thing is how many people are affected by any particular incident. And, in the, the things that, the incidents that you call out there, you know, at the MFA authentication issues, which were in November, Exchange in January, and then Teams in February — it was the case that all of Office 365 were affected, you know, the last official number we have from Microsoft is 155 million active monthly users that somebody is logs on at least once a month.

Over the last 24 months, they’ve been adding users at the rate of about 3 million a month. So you could probably say there’s at least 165 – 170 million today — and none of those incidents affected more than a small percentage of the total user base. Or we can come back to why that is. I think it’s really important to say upfront that any incidents, no matter how large, it’s only going to affect portion of the Office 365 base.

Mary Jo Foley: 03:16
You know that, that’s a great point. And in fact, one of the Petri.com readers, Jeremy W said, should we be thinking about uptime for Office 365 as a whole or should we be evaluating services separately? So that very point that you just made.

Tony Redmond: 03:30
Yeah, well, you know, Microsoft has this financially backed service level agreement which people sign into when they sign up to when they commit to office three, six, five. So if you look at, for example, we exchange or sharepoint SLA, it says, you know, can you get to email, can you get documents? And, if you can, the SLA is met. nd, the financial backing is that, uh, if they dip under 99.9% of the SLA, well they’ll pay out some credit to you for the service.

Now, Microsoft has not had to make a payout that I’m aware of since September, 2011, which was in the very early days of Office 365, they had a DNS issue and the service went down for quite a long time, uh, comparatively speaking quite a long time and the service wasn’t used by a lot of people at that stage. So that single instance had the capacity to affect the SLA. It moved the needle. Whereas today, any one of the incidents that we just talked about, they will barely budge the SLA needle and that’s the reason why I think Microsoft hit its SLA every quarter, same as the last quarter of 2011.

Mary Jo Foley: 04:48
I was wondering that too, and Greg Alto in the forum said, did they actually break the SLA is within the of these outages and the answer’s No. Right?

Tony Redmond: 04:55
Not at all. Not at all. I mean, I think it takes, it would take an incident which affected, uh, somewhere in the region of half a million users, which lasted more than eight hours. I think just off the top of my head, if you had one of those incidents that might affect it by 0.001%. Hmm.

Tony Redmond: 05:18
It’s just the law of numbers. As the numbers of users get higher and higher and the number of available minutes to those users gets bigger and bigger and bigger. And it’s billions of minutes every, every, uh, every quarter. So even if a million users go down for two hours, that’s still not a lot of minutes compared to the billions of minutes available — that’s the reason why.

Mary Jo Foley: 05:42
You know, our view of this is skewed, right? I mean, Microsoft will always say it’s a small number of users affected, but if you’re on Twitter or you’re getting email from these people like I do, it feels like the entire world is affected, right? Everyone’s screaming, no one seems know what’s going on. It just gets really out of hand.

And you know, a couple of people brought this up in the forum as well. Uh, you know what, part of the problem is the way they’re giving us status updates through Twitter, you know, through the now the called the Microsoft 365 support. Um, I forget the exact name of it, but there that account that that notifies people when Microsoft is working on an incident. What they do now is they say, go to your dashboard and look at incident number, blah, blah, blah, and you’ll see where we’re at. And a lot of the people we hear from our, not the administrators of the account, there are people who work in a company and the company’s not telling them what’s going on. They all they know is, I can’t get my email.

Tony Redmond: 06:38
Yep, that’s absolutely true. Microsoft had this problem since day one, the account name, by the way, in Twitter is, @MSFT365Status, which gives out Twitter updates for all of the Microsoft 365 services. But Microsoft could have had a problem with communicating with administrators and users since day one.

Uh, I think one of the reasons why as far as I can tell, Microsoft have nobody what I would call a user advocates working in the service. So they’ve whole pile of really talented engineers and designers and architects and programmers and so forth and so on. But they have nobody who looks at Office 365 or Microsoft 365 through the eyes of users. And so what this has led to that when an incident happens, they go very likely into technology mode, figuring out all the speeds and feeds.

So they get on, my gosh, they have telemetry coming out of the wazu to look at — and they look at all the telemetry and they figure out where the problem is and they swing into action. They go and fix things and reroute traffic and all the rest of them. But what they forget about this, that, uh, the world of the tenant administrator who used to be an on premises administrator has changed dramatically with the cloud. In the on premises world, those people had total control of the situation. So they knew exactly what was going down. They knew, for example, if somebody had poured a Coca Cola down the back of the server and it had become caramelized, which happened to me once and a long, long time ago. But that’s not important, right now? But they knew exactly what happened.

And they knew what was going to happen to get the service back online. And they’d communicate that in their own way to their own users. Now they’re living in darkness because all they get is, I might get a whole pile of stuff on Twitter, some of which is pretty bad. Some of it is inaccurate, some of it is just plain wrong — some of it is vile rumors or whatever it is. And then there’s a little bit of truth sprinkled there. And, and they’re expected to make sense of this.

They get, they may or may not be able to get to the state of pages where Microsoft is posting stuff. And even if they do get to those pages, and let’s face it, Microsoft don’t assign the most gifted writers. You know. It’s obscured and wonderful engineering lingo which hides what’s really happening. And I sometimes feel that Microsoft would do the whole world a favor with two simple steps. One is that they would send administrators messages via SMS if their tenant was effected. Everybody’s got phones, everybody’s got SMS.

And that cuts out the whole thing of dependency on being able to get to a Microsoft service when either your Internet connection aren’t there or their sites are offline. And the second thing is that they start to look at things through the eyes of users and try and to communicate a lot better, a lot more clearly, a lot more precisely about what’s going on. Um, quite frankly, an update once an hour to say, you know what, we will be back to you once we’ve done this is not enough when your business is offline and you don’t know what to do. I just wish they would do those two things and I think things would be a lot better.

Mary Jo Foley: 10:07
I think you’re right. I mean, Blood in the Petri.com forum’s says, you know, the fact that’s bugging him, is that network administrators have no control over these remote networks. So if it’s your own network that goes down, you can start doing something, even if it’s not the right thing, you’re trying basic troubleshooting or rebooting or something. When do you’re like, okay, Microsoft, what are you doing? We don’t even know what you’re doing. Tell us. Right?

Tony Redmond: 10:32
Yeah. And the interesting thing is if you look at the timeline of the Teams outage in February, uh, there was quite a substantial gap between the first signals that say, showing up inside Microsoft, saying ‘something might be going on here,’ to the time when people actually swung into action. Now, that’s understandable because the Microsoft folks have got to be sure that there’s an incident. And, they’ve got to be sure that the incident is happening at scale.

It’s not something that, uh, you know, it’s a minor, uh, a minor fall that happened on a particular piece of kit that’s caused a ripple effect across some other pieces of kit. But then it all goes quiet. So, you know, it just again comes back to communication. You just tell people what’s happening and tell people — bring them along. It’s almost like storytelling to make people happy rather than giving them just there, ‘Oh, we’re working on it.’

Mary Jo Foley: 11:32
Right. I mean, and then what do you, what do you suggest when IT Pros say to you, okay, when this does happen, there’s an outage of one of the services or multiple services. What do you do? Like, so you’re just sitting there waiting for Microsoft to fix something, but is there anything you on your end as the administrator should be doing?

Tony Redmond: 11:49
Well, I think they’ve got to look at all the available sources. Um, so the first check, the basic stuff, Microsoft Admin pages – the service help to see if there is an incident that’s occurred and if that is what is effecting their tenant, they should only see stuff showing up in the admin pages if it’s affecting their tenants, check around, do check, definitely check Twitter because Twitter is, yeah, it’s all sorts of false signals, but there’s a lot of good signals out there as well. And it’s a matter of being able to decipher what’s, what’s good and what’s bad and you get that with a little bit of experience. Uh, maybe check in with the local user group. Local user groups normally have ways of communicating the may have a Whatsapp group or something like that.

Because of the way Office 365 is built because of the way that it is regionalized, because of the way that it’s designed to limit the effect of an outage to within a data center region. It’s likely that if you’re having a problem and the folks that are local user group or also having exactly the same problems, you’ll help each other figure out what’s going on. And then if you really want to install some software that helps you know what’s going on — there, there is software out there, Office365mon.com for example, that will allow you to track exactly what’s happening for the various services view through the lens of your users, which I think is important because, you know, Office 365 is such an enormously immense place at this point in time. Uh, the view you’re getting from the whole internet is not necessarily what’s happening for you right now.

And, an example for that is again, you know, from the Teams outage where, yes, there was an authentication issue. Yes, there was an overload on the zero key bolts which caused this problem to occur, but users who had already authenticated we’re working quite happily all through the outage, right? People who are using, who used different authentication paths, like people who are using the teams, mobile clients kept on working and without it, without a hitch they never noticed it, which just goes to prove that the experience that somebody has of Office 365 right now may be diametrically opposed to the experience of somebody is having even in the same building connected to the same tenant. It just all depends.

Mary Jo Foley: 14:12
True. I mean, do you go so far as to say people should have backups of all their Office 365 data to other clouds just in case something like this happens or is that really nothing that will help them when something like the Team’s outage or Exchange Online goes down

Tony Redmond: 14:27
It won’t help them one bit. I mean, I hear this, but I ask myself this question, right? I’m an Admin, first of all, I only hear about an incident when my users start to be affected and then takes me a little bit of time to figure out whether or not it’s a true instance are just something that’s unique to those users. Okay. So we’re now maybe an hour into the incident. Now how long would it take me to get all of the data for my entire tenants which to where?

Now there’s two big, uh, big question marks here. Firstly, how much of the data do you need? And let’s face it, we all have more data now than we ever have before. I mean most Office 365 users now have a hundred gigabytes mailboxes to keep just about as much stuff as they want in their One Drive Accounts or their SharePoint libraries. So there’s a heck of a lot of data out there. So that’s one thing. How do I move all that data? And then the second thing is where do I move all that data because Office 365 is down. I can’t move it to Office 365. I can’t move it to another tenant. Uh, do I move to G Suite? Can I move it to G Suite? That’s another question. Can I move it on prem? No, probably not because I don’t have the capacity.

So, unfortunately, one of the things I think that’s happening as we go further and further along this journey into the cloud is that we become more trapped by the cloud. So the really, the only thing you can do is batten down the hatches, figure out what’s going on, communicate to your users, keep your users happy, and then wait for Microsoft to fix the problem. Unfortunately, Microsoft is pretty good at fixing problems and they have an awful lot of resources to help them fix the problem when something goes bad. Like the instance we talked about earlier on.

Mary Jo Foley: 16:23
Right, right. Yeah. I mean it’s, sometimes it feels like, is Microsoft doing this just to increase their own profits in a way? Right. Or are they actually doing this the way they’ve set up Office 365 to be resilient? You know, it was people who are kind of, what shall we call them, I don’t know, um, doubters. They like, yeah, ‘you know, are they, are they doing this just to show like, you know, they’re really good at fixing things and they can do this or is,’ or is Office 365 is really, truly resilient? And, did they build it to be something that would come back quickly from outages because, you know, to be frank, when your, when your emails down for an entire day, it’s like, come on guys, what are you doing over there? Right?

Tony Redmond: 17:10
Um, well I guess it all depends on how you look at this. Uh, I think it, I think Office 365, firstly Microsoft invested a whole lot of money in it. I mean, they keep on building out a data center regions. You know, we see them around the world now — Japan Korea, India, Australia, uh, France, western Europe, UK, you see, yeah. You see a new one in Germany. So they just continue to build out Officers 365 regions to make sure that they have data sovereignty taking care of. And that they can accommodate the needs of multi-geo organizations.

Part of the goodness that they get from this is the fact that any Office 365 outage is restricted to the boundary of the region. Inside a region, you’re going to have at least two data centers. So on. And, uh, you often have backup for the services for some of the services coming in from another region. So, for Office 365 to have a catastrophic failure. I mean, you’d have to take out lightning strike that happened to the San Antonio data centers.

Mary Jo Foley: 18:20
I do remember that.

Tony Redmond: 18:22
You reported extensively on that. I seem to recall. You’d have to take that — If you remember that was a pretty severe physical event — you know, you have this flash of lightning that came in and knocked a whole pile of stuff offline. It took them a quite a long time, took over a day to get everything back up and operational. Uh, you know, think about a more severe physical event. What would that be? An explosion at a data center. Right. Would that take out an entire data center, probably not.

Could an entire data center, be taken out by, uh, all of its Internet connectivity being taken out. What would it need for that kind of thing? Well, if you’ve visited any one of these Microsoft data centers, you’ll see a huge redundancy built into them. So, I’m pretty sure that the physical infrastructure of these data centers is good. I’m pretty sure that the software design that they’ve got for example, exchange uses database availability groups, every, every mailbox is protected by the fact that it’s copied into four different databases. Looking at my mailbox right now, its active copy is in Dublin, but there are passive copies in Amsterdam, Vienna and Helsinki. So you know, I think Microsoft has done their best to assure both the software and the physical side of the equations are taking care of.

Could they have a horrible, horrible incident which takes to takes out a complete data center? Absolutely. It could be an earthquake, tidal wave or whatever. Would that stop them restoring service? Probably not because they have their own uh, dark fiber and network, which connects together all these data centers. The data centers are able to take full load for region. Uh, you might get more incidents if you have for example, in Canada, one of the Canadian data centers was offline for two weeks, but the second one would keep on rolling. Could you have more instances in that one day the center possibly, but you’d still be running. It’s kind of hard. Yeah. I think that probably not as much as you can do with a given the state of the art right now.

Mary Jo Foley: 20:32
Hmm. Okay. Well Tony, we’re out of time, but I wanted to, I know half-hour goes by fast. Um, but I wanted to let you, uh, tell people where they can find you if they want to follow you.

Tony Redmond: 20:47
Oh, um, yeah. so I’m @12knocksinna on Twitter or office365itpros.com. I should say, what people ask me, why 12Knocksinna? Because stupidly, I took that name a long time ago when Twitter was starting up. I didn’t even think about it. I just took that name. It’s the name of my house and I’ve used it ever since.

Mary Jo Foley: 21:09
Oh, you know what, I always wondered that. So it’s good to know the reason for that. I was like, what is it a code name or something?

Tony Redmond: 21:15
Yeah, that’s a much more intelligent reason. I’ll use that. It’s a code name and I’ll let people to work out what the code is.

Mary Jo Foley: 21:22
There you go.

Tony Redmond: 21:24
Very boring. Sorry.

Mary Jo Foley: 21:24
Well, thank you so much for doing the first MJFChat with me. And, thanks to everyone who’s listening today to MJFChat. Um, we’ll be back again in a couple of weeks with our next guest, so make sure to check out the Petri.com forums to see who that is and send in your questions early and often. We’ll also be posting the audio recording and transcript of this and all our other chats in the Petri.com forums. So, thanks again to everyone and have a nice day.

  • This reply was modified 1 month, 2 weeks ago by tina tina.
  • This reply was modified 1 month, 2 weeks ago by Brad Sams Brad Sams.
  • This reply was modified 1 month, 2 weeks ago by Brad Sams Brad Sams.
  • This reply was modified 1 month, 2 weeks ago by Brad Sams Brad Sams.
  • This reply was modified 1 month, 2 weeks ago by Brad Sams Brad Sams.
  • This reply was modified 1 month, 2 weeks ago by Brad Sams Brad Sams.