Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup:
- TPCDS dataset with scale factor 100 (size 100GB). - Spark, Drill, Presto have a same number of workers: 12. - Each worked has same allocated amount of memory: 4GB. - Data is stored by Hive with ORC format. I executed a very simple SQL query: "SELECT * from table_name" The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second. For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table. Do you have any idea or reasonable explanation for this issue? Thanks,