Hi All, I have an requirement where I have to run 100 group by queries with different columns I have generated the parquet file which has 30 columns I see every parquet files has different size and 200 files are generated, my question is what is the best approach to run group by queries on parquet files more files are recommend or I should create less files to get better performance.
Right now with 2 cores and 65 executors on 4 node cluster with 320 cores available spark is taking average 1.4 mins to finish one query we want to tune the time around 30 or 40 seconds for one query the hdfs block size 128MB and spark is launching 2400 tasks the partitions for the input dataset is 2252. I have implemented the threading in spark driver to launch all these queries at the same time with fair scheduled enabled however I see most of times jobs are running sequentially. Any input in this regard is appreciated. Best Regards, Anil Langote +1-425-633-9747