Dataset schema incompatibility bug when reading column partitioned data

Dávid Szakállas Fri, 29 Mar 2019 06:15:44 -0700

We observed the following bug on Spark 2.4.0:

scala> 
spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet")


scala> val schema = StructType(Seq(StructField("_1", 
IntegerType),StructField("_2", IntegerType)))

scala> spark.read.schema(schema).parquet("foo.parquet").as[(Int, Int)].show
+---+---+
| _2| _1|
+---+---+
|  2|  1|
+---+- --+

That is, when reading column partitioned Parquet files the explicitly specified 
schema is not adhered to, instead the partitioning columns are appended the end 
of the column list. This is a quite severe issue as some operations, such as 
union, fails if columns are in a different order in two datasets. Thus we have 
to work around the issue with a select:

val columnNames = schema.fields.map(_.name)
ds.select(columnNames.head, columnNames.tail: _*)


Thanks, 
David Szakallas
Data Engineer | Whitepages, Inc.

Dataset schema incompatibility bug when reading column partitioned data

Reply via email to