When you place your data in the ‘cloud’, you are always promised one thing, increased availability. The idea is that companies like Microsoft, Amazon, and Google can build out redundancy at a scale that most companies cannot afford.
But even the best data centers in the world will go offline which is why we always preach that you should be prepared for an outage because it’s not a matter of if, but when. For Microsoft, a couple weeks back, one of their data centers went offline and now we have the full triage report of what happened.
Thanks to the company’s transparency, you can read exactly what happened here. The short version of it is that an electrical storm, after multiple repeated strikes, tripped all of the protections in place to prevent such a failure. Specifically, the cooling system inside the data center failed and as the temperatures quickly peaked above safe levels; automated shutdown procedures started running to protect the hardware inside the facilities.
The temperatures delta was so fast that some hardware was damaged by the high heat before the shutdown procedures could be completed; this is why some users experienced an extended outage as Microsoft was recovering and migrating data.
This type of an outage is one that was not directly Microsoft’s fault and despite their best efforts to prepare for an electrical storm strike, their protections failed to isolate the data center successfully. This is a good lesson in that building out a data center is not an easy task and despite our best knowledge about how to avoid disaster at this scale, we still have a lot to learn.