Who's afraid of the big bad DR?
At Sorint we have a lot of experience with Backup & Recovery solutions for the datacenter, of every kind, from simple on site architectures, to the most challenging distributed sites scenarios. And from all of this, the one question that we have almost always got answered the "wrong way" from our customers when we're involved in the assessment of their backup design is: "when did you perform the last test of your DR plan and how did it go?"
My question then is, are you happy with your straw house?
There are other versions for that same question, but as a summary, to be able to fully answer them you will need:
- To have a documented DR plan for all your assets
- To have tested those DR plans... in the last 6 months?
- To Perform a review of the results to fix any inconsistencies on the plan
- To frequently align them with your "Business Continuity Plan"
Unluckily, the most common scenario is that there is no plan, and in cases where there is one, it is more to comply with the company's internal policies, and it hasn't been really tested. Now, if neither of those two is your case, and you were able to answer confidently the previous questions, congratulations!, you have built your house out of bricks and you may not need to keep reading this.
The background
Traditionally, maintaining DR plans has simply not been a priority, and mostly because it has not been cost-efficient... or at least not perceived that way. Well, this is changing now, and we have been able to help some of our customers to achieve the level of automation that allows them to perform their DRs and DR tests painlessly and frequently.
Different scenarios that have required our assistance in the past involving DR (not automated) have been:
- Common HW failures (from a host to the storage holding the database)
- Infrastructure HW refresh (migration to a new server)
- Ransomware encryption of production data including backup servers
None of those have been easy situations for our customers, and the most common problems found trying to get back their backup services have been:
- Unknown or unexpected dependencies between services have delayed decision making or lead to failed or incomplete DR of those services.
- Lack of well known communication systems, or services not available have impacted severely in coordination of groups and prioritization of activities.
- Bottlenecks and error on planning for human activities. It does not matter how many people you involve, recoveries take time and many times cannot be parallelized. Knowing when and where to involve the right people will "totally" speed up getting back to business
The previous problems are not going to be solved by automation, but we've found that having an automated process allows to test and therefore early detection and preparation for this situation and mitigate those issues.
Where to start
Usually, it's the starting point that feels overwhelming, but for us, it's always just being able to perform a robust recovery of your backup solution, and by being able to do it fast. At the end, we may have the multi-TB critical DB waiting for it to be able to access that last backup data that saves the day.
Three simple steps on where to start:
- Choose the simplest scenario to recover
- Automate the process
- Test it (from here, aim for more complex situations and keep automating)
The main outcome of going through this is to leverage trust on the information on how long it takes to recover the backup infrastructure to bring back that service to the rest of the business.
Other benefits you'll get:
- Deep knowledge on the DR process for your team.
- Early detection of any misconfiguration in your backup system.
- Have a clear understanding on the dependencies with different teams and services. (i.e: how long does it take to provision a new backup server and how to request it?)
Automation 101
Some important tips to keep in mind:
- Automation is not magic, there needs to be a manual process stable enough to be automated.
- The tool being automated has to come with the right interfaces to interact with it. Most modern applications do, but a complete CLI or even better a Restful API are a must.
- Automation also requires maintenance (software changes, DR procedures or systems change over time).
- Production-like infrastructure to test is a must (virtualization or cloud environments offering makes this really affordable and easy to accomplish)
How do we do it
In our case, we start simple, choose the right set of tools, and iterate.
We've mainly been using Ansible to automate the required operations. Why ansible? For us, Ansible is an easy to learn automation tool with a really low footprint in the general configuration of our customers infrastructure that really eases the adoption of any automation for their operations teams.
One of the keys of automation done right is that it actually reduces the work for the team. If using it or maintaining it is complicated, it won't last.
Depending on the scope, we also integrate other tools such as Molecule for Ansible code testing, Packer for OS image creation, Vagrant or Terraform to provision infrastructure... you'll probably will need to start thinking of code development and CI pipelines at this point.
Most important point I want to make here is that there is no rocket science required, just slightly different sets of skills that the ones we are used to handle in these services. And those skills can be learned.
One of our use cases
For this simple case, we automated the provisioning and DR of a NetBackup Master Server for one of our customers. This used to take at least a day of professional services if not more, depending on the planning required.
Now, the automation allows our customer to independently:
- Prepare the server with the recommended settings from the vendor
- Install and configure the software
- Recover the application catalog of images.
- Bring the service back to production.
... all of this without supervision from one of our senior consultants.
Our code is around 1000 lines, but with ansible, this all looks as simple as:
This code checks the status of NetBackup Master Server before running the DR
name: dr_process | dr_pre_tests.yml | Verify NetBackup is up
shell: /usr/openv/netbackup/bin/bpps -x
changed_when: false
register: local_process_list
- name: dr_process | dr_pre_tests.yml | Show running processes
debug:
var: local_process_list.stdout_lines
- name: dr_process | dr_pre_tests.yml | Verify NBU processes are running
fail:
msg: "Some critical processes for recovery are not running"
when: >
not local_process_list.stdout is search ('bprd') or
not local_process_list.stdout is search ('bpjobd') or
not local_process_list.stdout is search ('nbrb') or
not local_process_list.stdout is search ('bpdbm') or
not local_process_list.stdout is search ('bpcd') or
not local_process_list.stdout is search ('nbemm') or
not local_process_list.stdout is search ('vmd')
- name: dr_process | dr_prepare.yml | Read running jobs
shell: /usr/openv/netbackup/bin/admincmd/bpdbjobs -summary -noheader | tr -s ' ' | cut -d' ' -f4
changed_when: false
register: local_job_status
- name: dr_process | dr_pre_tests.yml | Read running jobs details
shell: /usr/openv/netbackup/bin/admin/bpdbjobs -noheader -most_colunns
changed_when: false
register: local_job_details
when: local_job_status.stdout != "0"
- name: dr_process | dr_pre_tests.yml | Show running jobs details
debug:
var: local_job_details.stdout_lines
when: local_job_status.stdout != "0"
- name: dr_process | dr_pre_tests.yml | Jobs currently running in the NBU domain
fail:
msg: "Jobs are currently running in NetBackup"
when: local_job_status.stdout != "0"
- name: dr_process | dr_pre_tests.yml | Calculating free space in "{{ dr_process__catalog_path }}"
shell: df "{{ dr_process__catalog_path }}" --output=avail | tail -1
changed_when: false
register: local_available_catalog_space
- name: dr_process | dr_pre_tests.yml | Checking available space in "{{ dr_process__catalog_path }}" is at least 20% bigger than catalog size
fail:
msg: "Not enough space, required {{ dr_process__catalog_size * 1.2 }}, configured {{ local_available_catalog_space.stdout }}"
when: ( local_available_catalog_space.stdout | int ) < ( dr_process__catalog_size | int * 1.2 )
The result of the automation is an unattended DR of the Master Server in a brand new server finished in 15 minutes (plus the time of waiting for the actual catalog recovery to finish)
Conclusion
Nowadays, it is no secret to say that the most complex DR scenarios are created by Ransomware attacks (we already covered general design principles of the backup architecture to address this problems in another entry), but in this case, I just introduced some benefits ant tips for automating the process of recovering your backup infrastructure.
Over the years, we have been involved in different scenarios that have required our professional services to perform DR procedures over the backup infrastructure. Depending on the solution, the pain points of this process are in different areas, but in the end, the common outcome is a lot of planning and unexpected situations on the actual day of the DR.
Being prepared for the worst in IT shouldn't be a project for the future. Having an automated DR plan can save time, but sometimes it simply saves companies.
The time expended in creating the automation for these activities is rapidly paid back for operation teams.