Part 2 - Dropmalformed
In the second part, we’ll continue to focus on the DataFrameReader class and look at the option, DROPMALFORMED to remove bad data.
In the first part of this 4-part mini-series, we had a gentle introduction to what bad data is and the various options to handling bad data to maintain the robustness of data pipelines.
We also covered ways we can handle bad data using the DataFrameReader class.
-
Option 0 - Do nothing
-
Option 1 - Stop/Fail further data processing (FAILFAST)
-
Option 2 - Remove (DROPMALFORMED)
-
Option 3 - Replace
-
Option 4 – Redirect
Demos will be run on a premium version of Databricks, using Databricks runtime 8.1 and the demo code can be found in the additional resources section at the bottom of each post.
DROPMALFORMED
Set up data
Using “DROPMALFORMED”, we can drop corrupted/bad data when we read files using the DataFrameReader. It’s as simple as the previous sentence stated and to see an example of this, check out the screenshot below.
Before and after DROPMALFORMED
In the first step we want to view the file data based on our target schema. For column values that do not match the target schema, NULL marks will replace the malformed data. This is the default behaviour of the DataFrameReader class, also known as permissive mode.
When we specify the mode option, DROPMALFORMED we drop rows that have malformed data. Essentially, any row that a column value that does not match the target schema will be dropped.
Next, we’ll cover some implication of using DROPMALFORMED but for those keen to experiment with the sample code, see the additional resources section at the end of this post.
Considerations
Keep “good data” and carry on
For cases where we only need to load data that conforms to our target schema, this is may be a viable option and arguably works better than what we demonstrated when we explored the FAILFAST option because our data pipeline is not interrupted by reading bad data.
Continuing data processing having dropped bad data does carry with it, some serious consequences as mentioned in the next sections.
Loss of insight
Dropping bad (malformed) data will mean we lose valuable insights to our data. Although there may be an acceptable business case to do this, from personal experience, it is often the case that we do not want to drop bad data.
Dropping bad data can in some cases make things worse. For example, if a customer order row were missing an order type, dropping the row entirely would be bad for data analysis. It would be better to have a NULL mark for unknown order types than drop the row and report that we had fewer orders than we did.
Change of mind
If the definition of our target schema changes and what was previously “bad data” is now “good data” what would this mean for your reporting and analysis?
For engineering, it would likely mean a messy backfill process, even if you were fortunate to have read from delta tables and time travelled or able to issue a restore of a table to a point where one can re-read data.
Loss of audit
The additional benefit to retaining the bad data rather than dropping it, would be in its use for highlighting source system inadequacies. In our example, by utilising NULL marks that should be a prompt to highlight incomplete data, where it has gone wrong and at what time. This would help with addressing issues at source.
Summary
In this second part of the series, we’ve explored how we can remove bad data altogether using DROPMALFORMED and the implications of this some good but a few a fair few, bad. It carries with it consequences that should not be taken lightly.
Unless you’re absolutely sure and even then, I still wouldn’t drop bad data, use DROPMALFORMED with great caution.
In part 3 we will explore a slightly more palatable approach to processing uninterrupted using PERMISSIVE mode.
Thank you for reading and see you all in part 3.
Additional resources
pyspark.sql.DataFrameReader.csv — PySpark 3.1.1 documentation (apache.org)
Table batch reads and writes - Azure Databricks - Workspace | Microsoft Docs