DataFrame partitionBy to a single Parquet file (per partition)

Patrick McGloin Thu, 14 Jan 2016 23:48:18 -0800

Hi,

I would like to reparation / coalesce my data so that it is saved into one
Parquet file per partition. I would also like to use the Spark SQL
partitionBy API. So I could do that like this:


df.coalesce(1).write.partitionBy("entity", "year", "month", "day",
"status").mode(SaveMode.Append).parquet(s"$location")

I've tested this and it doesn't seem to perform well. This is because there
is only one partition to work on in the dataset and all the partitioning,
compression and saving of files has to be done by one CPU core.

I could rewrite this to do the partitioning manually (using filter with the
distinct partition values for example) before calling coalesce.

But is there a better way to do this using the standard Spark SQL API?

Best regards,

Patrick

DataFrame partitionBy to a single Parquet file (per partition)

Reply via email to