I did not partition the data when I created the parquet files (CTAS without a PARITION BY)
Here is the file list. Thank you. [dholmes@ip-10-20-49-40 sales_p]$ ll total 1021372 -rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet -rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet -rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet Dan Holmes | Revenue Analytics, Inc. Direct: 770.859.1255 www.revenueanalytics.com -----Original Message----- From: Dan Holmes [mailto:[email protected]] Sent: Thursday, July 27, 2017 3:59 PM To: [email protected] Subject: Drill performance tuning parquet I am performance testing a single drill instance with different vCPU configurations in AWS. I have a parquet files on an EFS volume and use the same data for each EC2 instance. I have used 4vCPUs, 8 and 16. Drill performance is ~25 second, 15 and 12 respectively. I have not changed any of the options. This an out of the box 1.11 installation. What Drill tuning options should I experiment with? I have read https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so technical that I can't consume it but it reads like the default options are the best ones. The query looks like this: SELECT store_key, SUM(sales_dollars) sd FROM dfs.root.sales_p GROUP BY store_key ORDER BY sd DESC LIMIT 10 Dan Holmes | Architect | Revenue Analytics, Inc.
