Hi all,

I'm doing some testing on query performance, especially in a clustered
environment.

The test data is 5 Parquet files with 2.2 million records in each file
(total of ~11m).

The cluster is an Amazon EMR cluster with a total of 10 drillbits
(c3.xlarge instances).

A single SUM() with a GROUP BY results in a ~700ms query.

We setup about 30 agents running a query every second (total 30 queries per
second) and the performance drops to queries at about 6-7 seconds.

The bottleneck seems to be entirely CPU based - all drillbits' CPUs are
fairly swamped.

Looking at the plans, the Parquet scan still performs fairly well, but the
hash aggregate gets gradually slower and slower (obviously competing for
CPU time).

Is this the expected query times for such a setup?  Is there anything
further I can investigate to gain more performance?

Reply via email to