[Pyspark 2.4] Large number of row groups in parquet files created using spark

Rishi Shah Wed, 24 Jul 2019 18:30:02 -0700

Hi All,

I have the following code which produces 1 600MB parquet file as expected,
however within this parquet file there are 42 row groups! I would expect it
to crate max 6 row groups, could someone please shed some light on this? Is
there any config setting which I can enable while submitting application
using spark-submit?


df = spark.read.parquet(INPUT_PATH)
df.coalesce(1).write.parquet(OUT_PATH)

I did try --conf spark.parquet.block.size & spark.dfs.blocksize, but that
makes no difference.

-- 
Regards,

Rishi Shah

[Pyspark 2.4] Large number of row groups in parquet files created using spark

Reply via email to