You can provide mode=PERMISSIVE|FAILFAST|DROPMALFORMED
:
Scenario 1: Store unprocessed data into another column:
schema = '''{"fields":[{"metadata":{},"name":"id0","nullable":true,"type":"integer"},{"metadata":{},"name":"id1","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"integer"},{"metadata":{},"name":"timestamp","nullable":true,"type":"string"},{"metadata":{},"name":"_corrupt_record","nullable":true,"type":"string"}],"type":"struct"}'''schemaFromJson = StructType.fromJson(json.loads(schema))df2 = spark.read.format("csv") \ .option("header", True) \ .option("columnNameOfCorruptRecord", "_corrupt_record") \ .schema(schemaFromJson) \ .load(s3InputPath, mode='PERMISSIVE')df2.show()
Output:+---+---+--------+----------------+--------------------+|id0|id1| name| timestamp| _corrupt_record|+---+---+--------+----------------+--------------------+| 10| 1| null|01/03/2021 13:00|10,1,Name1,01/03/...|| 10| 2| null|01/03/2021 13:00|10,2,Name2,01/03/...|| 10| 3| null|01/03/2021 13:00|10,3,Name3,01/03/...|| 10| 4| null|01/03/2021 13:00|10,4,Name4,01/03/...|| 10| 5|40028922|01/03/2021 13:00| null|+---+---+--------+----------------+--------------------+
Scenario 2: Raise an exception
df2 = spark.read.format("csv") \ .option("header", True) \ .schema(schemaFromJson) \ .load(s3InputPath, mode='FAILFAST')
Output:[...]Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: java.lang.NumberFormatException: For input string: "Name1"[...]
Scenario 3: Handle only valid data
df2 = spark.read.format("csv") \ .option("header", True) \ .schema(schemaFromJson) \ .load(s3InputPath, mode='DROPMALFORMED')df2.show()
Output:+---+---+--------+----------------+|id0|id1| name| timestamp|+---+---+--------+----------------+| 10| 5|40028922|01/03/2021 13:00|+---+---+--------+----------------+