Additionally, if I delete the parquet and recreate it using the same generic
save function with 1000 partitions and overwrite the size is again correct.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-save-with-SaveMode-Overwrite-produces-3x-higher
Hi,
Kudos on Spark 1.3.x, it's a great release - loving data frames!
One thing I noticed after upgrading is that if I use the generic save
DataFrame function with Overwrite mode and a "parquet" source it produces
much larger output parquet file.
Source json data: ~500GB
Originally saved parquet: