Quantcast
Channel: Wrong data types when reading data with spark - Stack Overflow
Viewing all articles
Browse latest Browse all 2

Wrong data types when reading data with spark

$
0
0

if I try to specify the schema when reading a file using pyspark, the read method doesn't fail when the schema is incorrect. For example:

schema = '''{"fields":[{"metadata":{},"name":"id0","nullable":true,"type":"integer"},{"metadata":{},"name":"id1","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"integer"},{"metadata":{},"name":"timestamp","nullable":true,"type":"string"}],"type":"struct"}'''schemaFromJson = StructType.fromJson(json.loads(schema))df2 = spark.read.format("csv") \      .option("header", True) \      .schema(schemaFromJson) \      .load(s3InputPath)

yields

+----+----+----+---------+| id0| id1|name|timestamp|+----+----+----+---------+|null|null|null|     null||null|null|null|     null||null|null|null|     null||null|null|null|     null|+----+----+----+---------+

because the "name" field is a string and I told it to read as integer.Is there any way of making the read method fail in such cases?

Data sample:

id0,id1,name,timestamp10,1,Name1,01/03/2021 13:0010,2,Name2,01/03/2021 13:0010,3,Name3,01/03/2021 13:0010,4,Name4,01/03/2021 13:00

Thanks


Viewing all articles
Browse latest Browse all 2

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>