Part 1 - Failfast
Receiving bad data is often a case of “when” rather than “if”, so the ability to handle bad data is critical in maintaining the robustness of data pipelines.
In this beginners 4-part mini-series, we’ll look at how we can use the Spark DataFrameReader to handle bad data and minimise disruption in Spark pipelines. There are many other creative methods outside of what will be discussed and I invite you to share those if you’d like.
What is bad data
This series is about demonstrating a few simple methods and the considerations that come with each one.
I like the idea that “bad data is data that gets in the way”. Data that is incomplete, inaccurate, missing and generally makes us work more than we should, instead of focusing on getting insights to our data for informed decision making.
Bad data is also subjective and for some, the phrase “Garbage In, Garbage Out” springs to mind when thinking of bad data but for others, bad data may yet provide valuable insights.
Ultimately, with the widespread use of machine learning to predict outcomes based on data, for example, the implications of bad data can have far reaching consequences.
Handling bad data
There are ways we can handle bad data.
-
Option 0 - Do nothing
-
Option 1 - Stop/Fail further data processing
-
Option 2 - Remove
-
Option 3 - Replace
-
Option 4 - Redirect
Do nothing
This is what we start with by default if we do not put any measures in place to handle bad data. The risk to the business of inaccurate/missing/down-right wrong data is at its highest and so is the risk of significant troubleshooting or re-work later. It’s also the easiest option regarding the coding effort in the short term.
Arguably for a small, throwaway demonstration it may be acceptable to ignore handling of bad data but for where we want to be in building mature and robust data pipelines, this should be our least favoured option.
I will assume that in reading this post, you want to do something about bad data, so this first option to “do nothing” will not be elaborated on any longer.
Demo
Set up
Throughout this mini-series, the demonstrations will be run on a premium version of Databricks, using Databricks runtime 8.1 and the demo code can be found in the additional resources section at the bottom of each post.
Typically, a set up notebook will accompany the demo notebooks and, in this demo, we’ll create a file with a mix of “good” and “bad” data in it. See below.
set up demo data
Option 1 - FAILFAST
The first option we’ll look at is to stop processing if any bad data is encountered. There are numerous ways of doing this and one of the ways we’ll explore is the FAILFAST option.
In Spark, when we read files using the DataFrameReader by default we use the “PERMISSIVE” mode. We’ll cover this in later parts of the series but essentially, malformed column data (i.e. bad data) is set to NULL. An example of this using our demo data is as follows:
default bad data handling
There are options we can employ to stop processing if we encounter bad data rather than set to NULL and one of those is FAILFAST.
FAILFAST does what is says on the tin which is, it “throws an exception when it meets corrupted records”. We can see this in an example as shown below.
It’s as straightforward as it is a sledgehammer of an option. If this were used in production as the means of filtering out bad data, the process would be best described as brittle. Any row of bad data would immediately stop data processing, crippling the inability to gather insights from the data.
As options go, it’s better than not handling bad data but it is rudimentary at best. Arguably, permissive mode as a default makes more sense than FAILFAST in many cases.
Delta Lake
Arguably, the tried and trusted way of schema enforcement on a table could be put in the same category of FAILFAST.
Schema enforcement in Delta will prevent bad data in as much as we’ll be able to only bring in the data we read if it is compliant with the schema being enforced. Any data that is not compliant will fail unless we evolve or overwrite the schema.
This is great because by default, we have a method of handling bad data but simply throwing an exception when data type mismatches occur, is rarely enough.
Summary
To introduce the series, we’ve looked at what bad data is (subjective) and mentioned some ways we can handle bad data.
Failing data processing when bad data is encountered using FAILFAST, is at least some effort in mitigating against bad data but can, in most cases be a harsh approach.
In the second part, we’ll look at removing bad data altogether and the implications of this.
Thanks for dropping by and see you in part 2.
Additional resources
pyspark.sql.DataFrameReader.csv — PySpark 3.1.1 documentation (apache.org)
https://towardsdatascience.com/data-quality-garbage-in-garbage-out-df727030c5eb
https://en.wikipedia.org/wiki/Garbage_in,_garbage_out
https://oreilly.com/library/view/bad-data-handbook/9781449324957/ch01.html