Understanding Disaster Recovery

This article will introduce you to the concepts of disaster recovery (DR) solution. Once seen as something that only was done by Fortune 500 enterprises, DR has been democratized by virtualization and third-party software vendors. Furthermore, Microsoft offered an amazing solution in the form of Hyper-V Replica, making DR replication possible for small businesses, solving problems for large enterprises, and introducing a new business opportunity for service providers. But you’ll need to learn to walk before you run. You’ll need to understand what DR is and to understand the concepts and terminology before you start learning about one of the most popular features in Hyper-V. Read on for an overview of disaster recovery and what it means for you and your business.

The Need for Disaster Recovery

Disasters happen more than one might think. Some like Hurricanes Sandy or Katrina make headlines and the chaos that they create is obvious and widely felt. Tornadoes or floods might destroy part of a small town with barely a mention in the news, but the damage caused to personal lives and businesses is no less real and devastating.
Life does not stop with a disaster. Shareholders, employees, customers, partners, and the community will depend on those businesses once the initial effects are dealt with. A few enterprises, such as stock markets, will need to have zero downtime. Some businesses can survive a few minutes of an outage. And most can survive a few hours or even a few days. But one thing is certain: No modern business can survive the loss of the IT infrastructure that provides them with their data (customer transactions and financial records), their processes (applications and services), and their availability (client computers and/or remote access).

Disaster Recovery Help button

Virtualization, such as Hyper-V, is a superb enabler for disaster recovery. Without virtualization we have to replicate databases, files, configurations, application installations, and so on. The complexity is incredible – so much so that designing a reliable DR architecture was nearly impossible, and testing completely was impossible without affecting those same production systems we want to protect. Virtualization encapsulates our services (operating systems, applications, configurations, and data) into virtual machines. Virtual machines are just files, and files are easy to replicate.

Essential DR Terminology

There are a few terms that you should become familiar with when discussing DR.

  • Primary or Production Site: This is where services normally function.
  • Secondary or DR Site: This is the alternative location where services are replicated to and started in the event of a disaster.
  • Business Continuity Plan (BCP): DR is usually considered a function of IT. The BCP is the business’ all-encompassing plan that describes when the DR site is used, who is involved, what those peoples’ roles are, how the DR site is started up, and importantly, how the business returns to a primary site.
  • Recovery Time Objective (RTO): This is a measure of how long it takes to complete the invocation of the BCP and get essential services operational in the secondary site.
  • Recovery Point Objective (RPO): How much data, measured in time, is lost by invoking the BCP?
  • Synchronous Replication: This is a continuous form of replication that offers absolutely zero data loss after a disaster.  Typically this will work by not acknowledging a change in the primary site until the storage in both the primary and secondary site have been updated. Synchronous replication offers a zero seconds RPO.
  • Asynchronous Replication: This is a form of replication that copies the changes from the primary site to the secondary site on a regular basis. You will not have a zero seconds RPO with asynchronous replication, but the amount of loss can be small and acceptable depending on the frequency and dependability of replication.

Disaster Recovery and Backup Are Different Things

Those who are new to the concepts of DR often mix up the roles of backup/restore and disaster recovery. The function of backup is to archive content in an offline store. That data or virtual machines can be restored with some effort and with some delay to the original or an alternative location. There is some amount of data loss in restoring from backup and using the restored data as the primary content. The amount of time to restore the business could be significant and the amount of data loss could be huge, from half a day to a week, depending on when backup media were last sent off-site.
The role of disaster recovery is to replicate data or virtual machines to a secondary (or DR) site on a regular basis. The business can quickly bring its services online in the secondary site in the event of a disaster with minimal or even no data loss.
When comparing the functions of IT, DR is seen as being either hot or warm, and backup is seen as a cold copy. Ask an administrator to restore a business critical application from backup and they’ll shiver with dread. Does the entire business really want to rely on a three-cent roller in an LTO tape or a backup solution that even the IT department fears? That’s why we should use:

  • A DR replica to bring the business back from a disaster. There is usually a low RPO and RTO.
  • Backups as a cold archive for reaching back in time. The RTO is huge because data or virtual machines must be restored across a network, and the RPO is often a day or more, assuming that the backup jobs were functioning correctly before the disaster.

 Backup and Disaster recovery Replication

Backup and DR Replication

DR Desires vs. Reality

Every business owner and CIO will say that they want an RTO and RPO of zero seconds. This is possible, in theory. However, they usually change their minds when presented with a proposal that estimates the cost of such an undertaking, and their real needs are soon revealed.  RTO/RPO are ideally zero but there is an exponential curve of cost to get closer to that zero. In reality, most enterprises would be delighted to have an RPO/RTO of less than an hour.
Synchronous replication can be critical piece to getting zero seconds RPO. There are a few problems with this:

  • Network link: A very low latency network connection with high amounts of bandwidth are required. While this is not a problem with data centers, it is an issue for branch offices or remote locations.
  • Remote DR site: Some disasters effect not just towns or cities, but entire regions. In those situations you need to have the secondary site as far away as possible. This can cause a problem synchronous replication because the latency of the replication link may exceed what is required.
  • Cost and complexity: Synchronous replication is done using very expensive solutions, such as a SAN with replication functionality that must be identically deployed in both the primary and secondary site. This adds cost and increases complexity.

On the other hand, asynchronous replication offers:

  • Low latency links: More affordable and readily available links can be used by asynchronous replication because the storage in the primary site will acknowledge a change before replicating it to the secondary site.
  • Larger distance between primary and secondary sites: This is made possible by the support of low latency connections, and means that a huge earthquake or flood in a regions doesn’t necessarily destroy the business if the DR site is in another state, country, or continent.
  • Simpler and less complex: The systems involved in asynchronous replication are usually less complex and more affordable because they don’t have the same requirements for low-fault-tolerance networking and storage.

Disaster Recovery Plan: Simplification, Practice, Automation

Think back: What were the worst days of your career in IT?  There’s a good chance that it was a time when some essential database was missing or corrupted and you needed to restore it from backup. Your manager was standing behind you with his phone in his or her hand, and you could hear an executive screaming for an update on why the CRM (or similar) system was offline. Am I close?
Now imagine that a fire has destroyed the office, taking down your computer room, or your data center was flattened by an earthquake. What do you think that day will be like? It won’t be a walk in the park, that’s for certain!
Three things make for a BCP that you can depend upon: simplification, practice, and automation. I’ll go through each one.

Simplification

This is the beauty of virtualization. Virtual machines are simple because they are files. Or at least, that’s the goal. Unfortunately, some people did not get the memo, and they have continued to deploy passthrough disks in their Hyper-V virtual machines. These raw partitions that are presented to virtual machines are inflexible and create complexity. What we want are files, like VHDX files, that are easy to replicate at the storage or host level. At this point in time, it is safe to say that anyone who is deploying passthrough disks is out of touch with Hyper-V and should seek education.
Simplicity breeds success in DR. Virtualize everything you possibly can. Windows Server 2012 Hyper-V made that easier thanks to the possibility of virtual machines with 64 virtual processors, 1 TB RAM, and lots of 64 TB VHDX files.
That’s the technology end, but you will find complexity in the human element of the BCP. This is more of a business and company politics issue, but every decision must be clearly binary, every process must be documented, and every communication must be clear. Any time you hear “We should probably let Bob in Accounts know X because he might get upset” is when you need to raise your hand, profess your willingness to do what the boss wants, but explain the possible impact on the success of the BCP by adding complexity.

Practice

Why do (American) football teams train so much? It’s because they play a complex sport with many moving pieces and no time for communication. That sounds like a disaster to me! You cannot know if your BCP processes will work or if the DR replication system functions unless you test them. Ideally you will do this on a regular basis and involve all of the possible players that could be engaged when the real thing happens. This will mean rotating team members, not just the seniors, and getting executives involves because they will play a significant role during that dreaded day if it happens.
A good DR solution will allow you to test failover of your services to the secondary site without affecting production systems. This will allow you to verify that data is being replicated, that services will start up, and measure how long the BCP will really take to complete.
The BCP is a living document. Processes will be tuned, replaced, and rewritten based on your practice experience. People will be added or removed. And hopefully, with enough dry runs, you will have BCP veterans on board should a disaster really occur.

Automation

Your worst day of practice will be better than your best day in a disaster. As I just said, practice is essential to success, but people will be stressed when the real thing happens. Parents will be worried about children, people will be worried about spouses, transport will be chaos, and executives will be stressing out the IT staff that they suddenly realized the business depends upon.
Any BCP that relies on huge amounts of manual effort is doomed to fail. What we need is orchestration. For example:

  1. The domain controllers start up first.
  2. Second, the database servers that provide the backend start.
  3. BI and middleware servers then fire up.
  4. Now web servers can be started.
  5. Finally PCs and/or Remote Desktop Services servers can be started to allow end users access to the services of the business.

There is no point in starting up web servers before anything else because those tiers in the services are not ready yet. Orchestration allows us to:

  • Model the dependencies of the services
  • Automate the start-up of tiers in those services
  • Simplify the BCP by removing the stressed-out human element from the processes

The role of the humans is:

  • Start the orchestration
  • Monitor the infrastructure and services and deal with exceptions

Wrapping Up

That’s disaster recovery in a 2,000-word nutshell! This is a huge topic, and it’s why Hyper-V Replica created so much interest. Over the next few weeks I will spend some time  introducing you to and making you an expert in Hyper-V Replica.