The art of backup - an introduction

 

For first timers trying to get to understand how to backup anything beyond your own PC (read our blog on the subject or look at our FAQ’s) the wealth of terminology can be daunting. As you search the internet, you will find references to terms such as RTO, RPO and methodologies called odd names such as “Tower of Hanoi”, “Grandfather-Father-Son” and a lot more. Some of these terms originate from the time when we backed up to tape drives and while they are still relevant today, understanding these concepts and applying them to a modern environment can be daunting. Add to that the complexity of multi-site, cloud technology, agile development and the shift of control over the entire IT environment to a shared model that includes DevOps teams and the complexity can become overwhelming for a newcomer. With this first blog of the series we aim to provide a first introduction and discuss some of the key concepts and terms you will encounter. Later blogs will deep-dive into specific subjects. So let's get started.

Before getting into the guts of approaches, technology, etc., let’s start with the business requirements. Understanding the requirements of your business is key and the starting point of every backup initiative. In most cases you will quickly find that you have different sources of data ranging from laptops and desktops, file servers, databases, cloud systems and services and more. Most likely the data residing on these systems does not all need to be treated the same way, for example, your e-Commerce systems back-end database is most likely more critically important than an employee laptop so the different sources will need to be reviewed separately and requirements established. More often than not your architects will need to be involved as well to help you understand the system dependencies since they will understand what measures have been put in place to ensure high-availability and fault-tolerance of critical systems but also what components form an entire system. This helps you avoid backing up only parts of a critical system and then finding out that you cannot restore service and data quickly enough in case of failure (something naturally tests should have uncovered before you find yourself in a pickle). Ultimately, two key metrics that this process will deliver you are the Restore Time Objective (RTO) and Restore Point Objective (RPO). Let’s have a closer look at these terms:

 

Recovery Point Objective (RPO)

Recovery Point Objective (RPO) defines the (maximum) interval of time that might pass during a failure or disruption before the amount of data lost exceeds a maximum threshold. Or put otherwise, it defines what point you need to be able to restore back to. The RPO is usually defined in a company's Business Continuity Plan.

Example: You have an outage and the company’s RPO is 24 hours. Your last available good copy of the data affected is from 18 hours ago, so you are still within the parameters of the RPO set in the Business Continuity Plan (BCP).

Recovery Time Objective (RTO)

The Recovery Time Objective (RTO) is the time and the service level that a service will need to be restored to after a disaster to avoid unacceptable consequences in line with the goals defined in the Business Continuity Plan. RTO answers the question: “How much time did it take to recover after notification of business process disruption?“

RPO and RTO are linked

The two objectives are obviously linked with the RPO designating the variable amount of data that will be lost or will have to be re-entered during an incident and RTO designating the amount of “real time” that can pass before the disruption begins to seriously and unacceptably impede the flow of normal business operations.

There is always a gap between the actuals – Recovery Time Actual (RTA) and Recovery Point Actual (RPA) – and objectives introduced by various manual and automated steps to bring the business application up. These actuals can only be exposed by disaster and business disruption rehearsals which should be standard practise and performed regularly.

Example combining RTO and RPO

For simplicity’s sake in this example, let’s assume you are using traditional tape based backups and you run your backups at 06:00 and 18:00 and the backups run for 1 hour each. Say you had a failure at 13:00, the only option you would be left with is to restore your 06:00 backup, which leaves you with an RPA of 7 hours and a RTA of 1 hour.

 

One of the questions you will quickly likely be asking yourself is where to store your backup and the wealth of options can be daunting. More traditional options include tapes, read-only media and disks which can then be stored safely off-site. In environments where larger volumes are present, snapshots have added another tool to our toolbelt for certain situations and naturally over the past years a range of cloud based options have emerged from simple options as for example moving data on an AWS S3 bucket to AWS Glacier or full hybrid solutions such as BeBack backup. The choice of medium and approach will largely depend on the environment and infrastructure you possess, whether it is a main site, remote site, cloud environment etc., your RTO and RPO, cost considerations, security, privacy and compliance requirements as well as the amount of data to just name a few factors. As additional consideration with systems, specifically in Disaster Recovery situations, is the question if the system state needs to be backed up. A system state backup allows you to restore a system onto different hardware and run from there quickly in case of a failure. As an example, if you have a physical server fail you could restore it to a virtual machine from the system backup and bring it back up very quickly. In combination with cloud offerings this can help improve availability while reducing cost.

Closely linked is the question of how to manage your backup. In some cases if you have a smaller environment and disks or tapes do the trick, the management of the backup can be as simple as keeping a log of the backups and tape or disk rotations, but as your environment becomes larger, more diverse and complex and possibly even includes multiple sites, cloud and other platforms such as Microsoft Office 365 or Google G-Suite then automation support is required. The wealth of solutions on the market is staggering but at the end of the day can be separated into “point solutions” and “framework solutions”. “Point” offerings can be very powerful but often fail to provide you a good overview of where you stand and once requirements change fail to provide an answer while “framework solutions” aim to provide you a full framework that allows you the ability to grow from a single platform and usually manage from a “single pane of glass” for your desktops, servers, databases, cloud environments allowing you to have a near-real time overview of the status of your backups.

Whatever solution you end up selecting for your environment and needs, security and compliance are paramount. Data needs to be secured in transit as well as at rest. In traditional environments with a SCSI tape drive sitting next to your server hosted in a secured data center and tapes being moved to a safe, things are comparatively simple. But in today’s enterprise environments it is essential that data in transit is encrypted and secured and that data at rest is equally encrypted, specifically if the backups are stored on systems that you don’t own and operate in a shared responsibility model as is the case with public cloud systems. Regulation and Standards such as privacy regulation (EU GDPR, California CCPA, Singapore PDPA, and others), healthcare specific regulation such as the US HIPAA or payment industry standards such as Payment Card Industry Data Security Standard (PCI DSS) all affect how we need to treat our data and will consequently also affect your backup approach.

In this blog we have covered some of the key terms and points that will affect your approach to your organization's business continuity and backup. This introduction is far from exhaustive and complete, but hopefully provided an introduction and an understanding of the type of considerations that will ultimately drive the technical and operational approach. In future blogs of the series we will drill down into specific subjects and questions.

To hear about future blogs, please follow us on LinkedIn or Twitter.