Delta Lake table restore

2021-04-19

6 min

In this article

One of the most common reasons to perform a restore is to do so for a table. In this post, we’ll be looking into how one of delta lake’s neat features allows us to accomplish fast and simple table restores to previous versions.

The big deal

Having done a fair share of table restores in SQL Server/Azure SQL Database, I really do appreciate the ability to restore tables in delta, a lot.

For many years, it’s been a popular feature request for Microsoft to introduce and in the absence of table restores, one has to perform a clunky restore of a database first, prior to restoring a table.

As we’ll show in the rest of the post, the restore process in delta lake is a lot simpler to implement.

Delta restore

If you’re following along in this article I assume you know what is a delta table is and that it’s really the gold standard for data lakes but if not, feel free to browse the documentation included either in the links above or at the end of the post in the additional resources section.

Let’s dive straight into it.

Restore syntax

The syntax to restore a delta table can target either a metastore table or delta file path and can be based on version or timestamp. Simple and neat, right? 😊

RESTORE TABLE db.target_table TO VERSION AS OF <version> RESTORE TABLE delta.`/data/target/` TO TIMESTAMP AS OF <timestamp>

How we perform a restore will be covered next, in our example scenario.

Example scenario

Set up

This demo is run on the community edition of Databricks, on a version Databricks runtime 7.4 or later.

The demo includes the following steps to get the data set up:

1. Create a sample database.

2. Load the _AdventureWorks S_ales data, available in dbfs, for 2013-10-01.

3. Create a delta table using that sample data set.

Set the scene

In this scenario, we have some “Sales” orders loaded that we’ve accidentally issued an update for all rows. Here, we want to restore the table to an earlier point before the data was accidentally updated.

Starter for 10

Let’s begin by viewing the current state of the newly created delta table by checking it’s contents and version history.

To view table versions in delta, we can issue the DESCRIBE HISTORY command on the “S_ales_” delta table as shown below.

DESCRIBE HISTORY demo.sales

Check orders and current version

Dodgy update

To make our scenario relatable, we’ll issue an UPDATE without a WHERE clause to mimic a data change mishap. The stuff of nightmares!

We’ll also check the state of sales orders for ‘2013-10-10, of which we expect zero rows to be returned and view the version history of the demo.sales table to show what operations have taken place.

Delta update and missing Sales

Restore to the rescue

Don’t fear,

Delta Lake restores are here!

By running the RESTORE command, we can restore the “Sales” delta table to the original version quickly and simply, as follows: RESTORE TABLE demo.sales TO VERSION AS OF 0

This is great. No mess, no fuss and personally, I love the fact that the restore operation is logged in the table history so the audit trail of events is clear for all (with permissions) to see. The screenshot below shows this.

Restore to previous version

Timestamp

For the sake of showing how we can restore the delta table using the timestamp, we can run the following command to do this:

RESTORE TABLE demo.sales TO TIMESTAMP AS OF ‘2021-04-14T06:22:00.000’.

The results of the screenshots are shown below.

Restore delta table to previous time

Python example

For those who prefer( or hate SQL), or are curious about what this would look like in Pyspark, it’s similarly straightforward.

We use the classmethod “forName” to specify which metastore table we want to restore and then we issue the restore command based on version or timestamp.

Looking at the version history, we’re back to our original version with the correct sales OrderDate.

Restore delta table to oldest version

Under the hood

While this is all fine and dandy, what is actually happening under the covers when we update data and restore it to a previous version?

There’s a quite a bit that is happening and the subject of a later post. 😉

Challenges of restoring data

With what has been mentioned so far, the following is a list of some challenges to restoring data and how delta lake helps mitigate against these challenges. It’s also important to note that even with Delta, data must still fall within the data retention threshold to mitigate against these challenges.

Complexity

Scenario: A separate process needs to locate backups externally, then restore the backup and finally, the table data.

Delta restores: Single command for table restores to a previous version.

Time

Scenario: Time taken to locate a backup, restore a backup and potentially downtime during restore.

Delta restores: Time taken is minimised as the restore of table versions is built in. This also means a reduction in downtime.

Effort

Scenario: Effort needed to coordinate processes to locate backups externally, then restore the backup and finally, the table data.

Delta restores: Minimises effort by using built in, commands for table restores.

Considerations

Runtime

We covered this earlier but make sure you’re running Databricks runtime 7.4 or later. Without this, the RESTORE command will be unavailable.

Data files

Simply put, the restore will fail is the data files for your delta table missing because as one would expect, delta lake restores rely on having retained data old enough to restore to. There is the possibility to partially restore in cases of missing data by using the

spark.sql.files.ignoreMissingFiles (set to true).

Where we use the VACUUM command on our delta tables, whereby we clean up the data files up to the data retention specified, we lose the ability to restore data to a previous version that falls outside of the retention interval. We can mitigate against this as we’ll see in the following section.

Data retention

By default, VACUUM will purge versions older than 7 days. This can be changed, for example, to extend to 30 days or longer depending on your requirements but be careful.

Set the value too high and whilst you’ll have more data available to recover to, storage costs will increase as a result. Set it too low and you’ll minimise storage costs but at the risk of a smaller set of data for immediate recovery.

Configure the retention properties to an interval that is within a threshold for your intended restore. This setting will need your understanding of your data recovery requirements.

The following example shows us how we can configure data retention.

set delta.deletedFileRetentionDuration = “interval 30 days”

Summary

It’s hard not to appreciate the work that has been done to enable this small but quite powerful feature of the delta lake.

These restores are useful for situations in development or testing where quickly want data restored to a specified point or time. Or in production, for more serious cases where data has been accidentally inserted, updated or deleted.

It’s no substitute for making sure we careful with our data processing though, so keep those data retention settings optimal and watch those maintenance jobs like VACUUM.

As much as delta restores can be fast and simple, let’s minimise the need to do them in the first place so we can focus on other activities we ought to be doing.

Thanks for reading! Until next time.

Additional resources

https://delta.io/

https://docs.microsoft.com/en-us/azure/databricks/delta/delta-batch#data-retention

https://docs.microsoft.com/en-us/azure/databricks/delta/delta-utility#–restore-a-delta-table-to-an-earlier-state