Thursday, February 08, 2018

On the importance of testing your backups

You're backing up your databases. The jobs run when they're supposed to, you've chosen the right types of backups, got all your settings dialed in, they take the expected amount of time, the files are there on disk. Everything's good!

Well, maybe.

Have you restored your backups? You should. I'd even go so far as to say that without test restores, your careful backup strategy was a waste of time and energy.

"But the backups complete without errors," you say. And that's certainly good, but it's only part of it. Things can go wrong in backups. Disk hiccups after the backup is generated can cause problems. Bugs in the database engine can cause unexpected problems as well. Different database products will have different issues and quirks, but the key thing to remember is that they ALL have them. Your preferred RDBMS is no different. It's up to you as the DBA to learn what they are, and to learn how to work with and through them.

This is one that I've personally run into: "RESTORE DATABASE is terminating abnormally" error message when you perform a full backup of a database in SQL Server 2008 R2

We added a new data file to a database, and due to a quirk in our naming convention the truncation caused a problem: we wound up with two identical logical filenames in the backup file list, and we couldn't restore the backup. We were fortunate that we didn't need to do any restores for that database before we found the problem, and Microsoft did release the fix in the linked article, but we would've been in serious trouble if we hadn't caught the issue and suffered data loss.

You also need to test your restores in a realistic fashion. How much data can you recover? How long will it take? This is when you will need to consider the business's needs.

There are two objectives around recovery that you need to consider: the recovery point objective (RPO) and recovery time objective (RTO). In brief, they respectively address how much data the business is comfortable with losing, and how long the business is comfortable with a database being offline in case of failures. People too often assume that both RPO and RTO will be close to zero. It's certainly possible to get very close to instantaneous recovery with no data loss, but it can get very expensive very quickly as you're then getting into high availability (HA) territory. The old saw, 'fast, cheap, and good: pick two,' definitely holds true here!

If you don't know how long your restore will take, you won't know if you can meet the RTO. If you don't know how much data is available in your backups, you won't know if you can meet RPO. If your backups are bad, you're guaranteed not going to meet either the RPO or RTO unless neither matters for the instance (but how many instances do you manage that fall under that 'it would be nice to recover it, but we can live without it' heading? I know I don't have many in production). If you haven't talked with your management about their expectations around RPO and RTO, please do so.

While RPO and RTO are certainly important considerations, they can be ongoing topics of conversation - your first order of business is to ensure that your backups first CAN be restored, and second that they give you the results you expect. As with anything, practice makes perfect. Openness and honesty will go a long way if you identify gaps. These can be hard things to bring up, I know - I've had to have a few unpleasant conversations on this topic over the course of my career. But how much worse will the conversations be if something fails and you can't recover?

Test your backups, folks! You'll be glad you did!

No comments: