I did not partition the data when I created the parquet files (CTAS without a 
PARITION BY)

Here is the file list.

Thank you.


[dholmes@ip-10-20-49-40 sales_p]$ ll
total 1021372
-rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
-rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
-rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-----Original Message-----
From: Dan Holmes [mailto:[email protected]] 
Sent: Thursday, July 27, 2017 3:59 PM
To: [email protected]
Subject: Drill performance tuning parquet 

I am performance testing a single drill instance with different vCPU 
configurations in AWS.  I have a parquet files on an EFS volume and use the 
same data for each EC2 instance.

I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12 
respectively.  I have not changed any of the options.   This an out of the box 
1.11 installation.

What Drill tuning options should I experiment with?  I have read 
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so 
technical that I can't consume it but it reads like the default options are the 
best ones.

The query looks like this:
SELECT store_key, SUM(sales_dollars) sd
FROM dfs.root.sales_p
GROUP BY store_key
ORDER BY sd DESC
LIMIT 10

Dan Holmes | Architect | Revenue Analytics, Inc.

Reply via email to