Hi Team, We have scheduled jobs that read new records from MySQL database every hour and write (append) them to parquet. For each append operation, spark creates 10 new partitions in parquet file.
Some of these partitions are fairly small in size (20-40 KB) leading to high number of smaller partitions and affecting the overall read performance. Is there any way in which we can configure spark to merge smaller partitions into a bigger one to avoid too many partitions? Or can we define a configuration in Parquet to set a minimum partition size, say 64 MB? Coalesce/repartition will not work for us as we have highly variable activity on the database during peak and non-peak hours. Regards, Sonal
