Quantcast
Viewing latest article 1
Browse Latest Browse All 2

Answer by Kafels for Wrong data types when reading data with spark

You can provide mode=PERMISSIVE|FAILFAST|DROPMALFORMED:

Scenario 1: Store unprocessed data into another column:

schema = '''{"fields":[{"metadata":{},"name":"id0","nullable":true,"type":"integer"},{"metadata":{},"name":"id1","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"integer"},{"metadata":{},"name":"timestamp","nullable":true,"type":"string"},{"metadata":{},"name":"_corrupt_record","nullable":true,"type":"string"}],"type":"struct"}'''schemaFromJson = StructType.fromJson(json.loads(schema))df2 = spark.read.format("csv") \    .option("header", True) \    .option("columnNameOfCorruptRecord", "_corrupt_record") \    .schema(schemaFromJson) \    .load(s3InputPath, mode='PERMISSIVE')df2.show()
Output:+---+---+--------+----------------+--------------------+|id0|id1|    name|       timestamp|     _corrupt_record|+---+---+--------+----------------+--------------------+| 10|  1|    null|01/03/2021 13:00|10,1,Name1,01/03/...|| 10|  2|    null|01/03/2021 13:00|10,2,Name2,01/03/...|| 10|  3|    null|01/03/2021 13:00|10,3,Name3,01/03/...|| 10|  4|    null|01/03/2021 13:00|10,4,Name4,01/03/...|| 10|  5|40028922|01/03/2021 13:00|                null|+---+---+--------+----------------+--------------------+

Scenario 2: Raise an exception

df2 = spark.read.format("csv") \    .option("header", True) \    .schema(schemaFromJson) \    .load(s3InputPath, mode='FAILFAST')
Output:[...]Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: java.lang.NumberFormatException: For input string: "Name1"[...]

Scenario 3: Handle only valid data

df2 = spark.read.format("csv") \    .option("header", True) \    .schema(schemaFromJson) \    .load(s3InputPath, mode='DROPMALFORMED')df2.show()
Output:+---+---+--------+----------------+|id0|id1|    name|       timestamp|+---+---+--------+----------------+| 10|  5|40028922|01/03/2021 13:00|+---+---+--------+----------------+

Viewing latest article 1
Browse Latest Browse All 2

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>