Part 3 - Permissive

2021-05-30

3 min

In this article

In the 3rd instalment of this 4-part mini-series, we will look at how we can handle bad data using PERMISSIVE mode. It is the default mode when reading data using the DataFrameReader but there’s a bit more to it than simply replacing bad data with NULLs.

See below, a recap of what we have covered so far, specific to the DataFrameReader class.

Part 1 - Stop/Fail further data processing (FAILFAST)
Part 2 - Remove (DROPMALFORMED)
Part 3 - Replace
Part 4 – Redirect

The data set up will be similar to the setup in part 1 and part 2.

set up data

Part 3 - PERMISSIVE

“PERMISSIVE” mode is the default mode that is implemented for the DataFrameReader class. It will, by default, replace malformed/bad data with NULL marks.

This makes sense to have it as a default mode for reading data as in most cases, we want to read data uninterrupted and flag that bad data has been encountered. An example of this is shown in the next screenshot.

view file data

More than just NULL

PERMISSIVE is more than merely setting bad data values to NULLs, however. If you venture into the Python API documentation, you’ll notice this section and in particular, the additional column that can be “configured by columnNameOfCorruptRecord”

more than just null

Keep “all data” and carry on

Using PERMISSIVE mode, we can add to our schema, a column that will contain the malformed data. This is really handy because not only can we substitute malformed data with NULL but we can also keep the bad row data in a column specified using the “columnNameOfCorruptRecord”.

To see an example of this, let’s take a look at the following screenshot. For those eager to run code, help yourself to the example code in the additional resources section at the end of this post.

default permissive mode

Customisation

It gets even better because if the default column name isn’t to your liking, you can alter the column name by simply specifying the preferred column name in the schema and applying that under the option “columnNameOfCorruptRecord”when calling the DataFrameReader. An example of this is as follows:

permissive mode customisation

Benefits

Track and trace

The immediate benefit with using PERMISSIVE is the substitution of bad data with NULL and the ability to view the raw data that was read.

The fact that this is possible and is the default mode, minimises the impact of bad data to our data pipelines, adding rigour our processing. In apples and oranges example, we can filter out good and bad data as shown in the following example.

Drawback

Filter overhead

There’s also a quirk with filtering out data using the “columnNameOfCorruptRecord“ option where results are persisted prior to filtering out and failure to do so will result in an exception reading the file. To see this in action, remove the caching in the code, un-cache data and re-run.

A potential complication is still the mix of good and bad data. This is easily remedied by adding a filter predicate on the “columnNameOfCorruptRecord” to obtain only the good or bad data (as shown in the previous section) but does not escape the fact that there will be a filter overhead.

Summary

In this third part of the series, we’ve explored how we can use PERMISSIVE mode to have good and bad data co-exist so we can carry on with data processing. This carries obvious benefits such as data retention and auditing but still retains overhead for downstream process that query the mixed set of good and bad data.

In part 4, we’ll look at how we can redirect bad data for separate capture and analysis and retain the data we expect.

See you all in part 4.

Additional resources

pyspark.sql.readwriter — PySpark 3.1.1 documentation (apache.org)

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.json.html?highlight=permissive