Hi All,

I have an requirement where I have to run 100 group by queries with different 
columns I have generated the parquet file which has 30 columns I see every 
parquet files has different size and 200 files are generated, my question is 
what is the best approach to run group by queries on parquet files more files 
are recommend or I should create less files to get better performance.  

Right now with 2 cores and 65 executors on 4 node cluster with 320 cores 
available spark is taking average 1.4 mins to finish one query we want to tune 
the time around 30 or 40 seconds for one query the hdfs block size 128MB and 
spark is launching 2400 tasks the partitions for the input dataset is 2252.

I have implemented the threading in spark driver to launch all these queries at 
the same time with fair scheduled enabled however I see most of times jobs are 
running sequentially.

Any input in this regard is appreciated.

Best Regards,
Anil Langote
+1-425-633-9747

Reply via email to