Hi Dan,

Here are some thoughts from my end.

So this is just one query and you have the numbers. But how about a
representative collection? Do you have the use cases? Now, I know from
experience that if you can predict the pattern of the queries to about 60%,
that would be great. The rest could be ad hoc and you could plan for it.

For that 60%, it would be good to share some numbers along these lines:

1. SQL query, response time measured, response time expected, size of the
tables that are part of the query
2. Do you have any data skew?
3. What is the EC2 configuration you have: memory, CPU cores?

So the approach would be to tune it for the entire set (which means you
will end up trading off the various parameters) and then scale out. (Scale
out is not cheap).

Thanks,
Saurabh

On Thu, Jul 27, 2017 at 1:37 PM, Kunal Khatua <kkha...@mapr.com> wrote:

> You haven't specified what kind of query are you running.
>
> The Async Parquet Reader tuning should be more than sufficient in your
> usecase, since you seem to be only processing 3 files.
>
> The feature introduces a small fixed pool of threads that are responsible
> for the actual fetching of bytes from the disk, without blocking the
> fragments that already have some data available to work on.
>
> The "store.parquet.reader.pagereader.buffersize" might be of interest.
> The default for this is 4MB and can be tuned to match the parquet page size
> (usually 1MB). This can reduce memory pressure and improve the pipeline
> behavior.
>
> Apart from this, the primary factors affecting your query performance is
> the number of cores (which is what you seem to be tuning) and memory.
> By design, the parallelization level is a function of the num-of-cores.
> From the look of things, it looks like that is helping. You can try further
> tuning it with this:
> planner.width.max_per_node (default is 70% of num-of-cores)
>
> For memory,
> planner.memory.max_query_memory_per_node (default is 2GB)
>
>
> This is where you'll find more about this:
> https://drill.apache.org/docs/performance-tuning/
>
> ~ Kunal
>
> -----Original Message-----
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 1:06 PM
> To: user@drill.apache.org
> Subject: RE: Drill performance tuning parquet
>
> I did not partition the data when I created the parquet files (CTAS
> without a PARITION BY)
>
> Here is the file list.
>
> Thank you.
>
>
> [dholmes@ip-10-20-49-40 sales_p]$ ll
> total 1021372
> -rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
> -rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
> -rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet
>
> Dan Holmes | Revenue Analytics, Inc.
> Direct: 770.859.1255
> www.revenueanalytics.com
>
> -----Original Message-----
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 3:59 PM
> To: user@drill.apache.org
> Subject: Drill performance tuning parquet
>
> I am performance testing a single drill instance with different vCPU
> configurations in AWS.  I have a parquet files on an EFS volume and use the
> same data for each EC2 instance.
>
> I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12
> respectively.  I have not changed any of the options.   This an out of the
> box 1.11 installation.
>
> What Drill tuning options should I experiment with?  I have read
> https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so
> technical that I can't consume it but it reads like the default options are
> the best ones.
>
> The query looks like this:
> SELECT store_key, SUM(sales_dollars) sd
> FROM dfs.root.sales_p
> GROUP BY store_key
> ORDER BY sd DESC
> LIMIT 10
>
> Dan Holmes | Architect | Revenue Analytics, Inc.
>
>

Reply via email to