subject:"DataFrame.save with SaveMode.Overwrite produces 3x higher data size"

Re: DataFrame.save with SaveMode.Overwrite produces 3x higher data size

2015-06-10 Thread bkapukaranov

Additionally, if I delete the parquet and recreate it using the same generic save function with 1000 partitions and overwrite the size is again correct. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-save-with-SaveMode-Overwrite-produces-3x-higher

DataFrame.save with SaveMode.Overwrite produces 3x higher data size

2015-06-10 Thread bkapukaranov

Hi, Kudos on Spark 1.3.x, it's a great release - loving data frames! One thing I noticed after upgrading is that if I use the generic save DataFrame function with Overwrite mode and a "parquet" source it produces much larger output parquet file. Source json data: ~500GB Originally saved parquet: