Part 4 - Bad records path
In part 4, the final part of this beginner’s mini-series of how to handle bad data, we will look at how we can retain flexibility to capture bad data and proceed uninterrupted.
We’ll look to use specifically, the “badRecordsPath” option in Azure Databricks, which has been available since Azure Databricks runtime 3.0.
In Parts 1-3, we looked at the following options to handle bad data using the DataFramereaderClass:
- FAILFAST
Of the options considered so far, PERMISSIVE has proved to be the most flexible for a common need to be able to capture bad data and run uninterrupted.
Demo
Set up
This demo is run on a premium version of Databricks, using Databricks runtime 8.2 and the demo code can be found in the additional resources section at the bottom of the post.
The data set up will be similar to the setup in part 1, part 2 and part 3, where we’ll create data in a file with a mixture of “good” and “bad” data in it. Here we’ll demonstrate how we can use the “badRecordsPath” to divert bad data, so our Spark jobs can continue to run.
Recall that bad data can be a subjective term and essentially, what we have is some data that won’t match our target schema.
Bad records path
According to Azure Databricks, “badRecordsPath” is a “unified interface for both corrupt records and files”. Xiao Li from Spark Summit June 2017 summarised the benefits well.
-
Enabling multi-phase data cleaning - DROPMALFORMED + Exception files
-
No need for an extra column for bad data (corrupt records)
-
Recording the exception data, reasons and time.
Usage
In the example below, we will read the source file into a DataFrame and pass the “badRecordsPath” option, specifying a file path for any exceptions that do not match our schema.
We will then immediately write the source data to a “target” table with only the good data. An example of this is in the following screenshot.
Source data
Now when we query the target table, we only have the 2 rows that match our target schema as shown below.
Select target table
Regarding the bad data that didn’t make it into our target table, we’ll create a log table that will allow us to easily query the bad data we’ve captured.
Setup logging
After the table set up, we’ll then read and insert log data as shown the in the following screenshot.
Capture bad data
Using the nifty COPY INTO command, we go through the directory location where the exception files are saved, read the data and INSERT INTO the log table. And that’s it!
Having removed bad data via the badRecordsPath from one table, we’re now able to easily analyse and remediate in another table.
The log table in our demo give us the ability to see why we have bad data (reason), where the bad data came from (path) and the log location (file_name).
Caveats
The badRecordsPath option is not perfect and the Databricks documentation lists out some caveats to be aware of such as:
-
It is non-transactional and can lead to inconsistent results.
-
Transient errors are treated as failures.
There’s also different treatment for CSV and JSON sources vs other file-based sources to be aware which we’ll cover in a future post.
Summary
In the absence of elaborate exception handling solutions, using the “badRecordsPath” is a decent option to separate good and bad data, providing a simple solution for handling bad data gracefully, rather than allowing bad data to interrupt our data pipelines.
Have a play with it and do share your experiences with what has been presented in this post and mini-series.
Thanks for dropping by. Until next time.
Additional resources
Handling bad records and files - Azure Databricks - Workspace | Microsoft Docs
Building Robust ETL Pipelines with Apache Spark (slideshare.net)