Hi Team,

We have scheduled jobs that read new records from MySQL database every hour
and write (append) them to parquet. For each append operation, spark
creates 10 new partitions in parquet file.

Some of these partitions are fairly small in size (20-40 KB) leading to
high number of smaller partitions and affecting the overall read
performance.

Is there any way in which we can configure spark to merge smaller
partitions into a bigger one to avoid too many partitions? Or can we define
a configuration in Parquet to set a minimum partition size, say 64 MB?

Coalesce/repartition will not work for us as we have highly variable
activity on the database during peak and non-peak hours.

Regards,
Sonal

Reply via email to