Re: Suggestion needed for UNION ALL performance in Apache drill

Paul Rogers Thu, 23 Apr 2020 11:30:09 -0700

Hi Sreeparna,

As suggested in the earlier e-mail, we would not expect to see different 
performance in UNION ALL than in a simple scan. Clearly you've found some kind 
of issue. The next step is to investigate that issue, which is a bit hard to do 
over e-mail.



Please file a JIRA ticket to describe the issue and provide a reproducible test 
case including query and data. If your data is sensitive, please create a dummy 
data set, or use the provided TPC-H data set to recreate the issue. We can then 
take a look to see what might be happening.

Thanks,
- Paul

 

    On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani 
<bhabani.sreepa...@gmail.com> wrote:  
 
 Hi Team,
In addition to the below mail I have another finding. Please consider below 
scenarios. The first 2 scenarios are giving expected results in terms of 
performance. But we are not getting expected performance for 3rd scenario which 
is UNION ALL with 2 different types of datasets.

Scenario 1- Parquet UNION ALL Parquet
Individual execution time of 1st query - 5 secsIndividual execution time of 2nd 
query - 5 secsUNION ALL of both queries execution time - 10 secs
Scenario 2 - DB query UNION ALL DB queryIndividual execution time of 1st query 
- 5 secsIndividual execution time of 2nd query - 5 secsUNION ALL of both 
queries execution time - 10 secs
Scenario 3 - Parquet UNION ALL DB query
Individual execution time of 1st query - 5 secsIndividual execution time of 2nd 
query - 1 secUNION ALL execution time - 20 secsIdeally the execution time 
should not be more than 6 secs.
May I request you to check whether the UNION ALL performance of 3rd scenario is 
expected with different dataset types.
Please suggest if there is any specific way to bring down the execution time of 
3rd scenario.
Thanks in advance.
Sreeparna Bhabani


On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <bhabani.sreepa...@gmail.com> 
wrote:

Hi Team,
Apart from the below issue I have another question.
Is there any relation between number of row groups and performance ?
In the below query the number of files is 13 and numRowGroups is 69. Is the 
UNION ALL takes more time if the number of rowgroup is high like that.
Please note that the individual Parquet query takes 6 secs. But UNION ALL takes 
20 secs. Details are given in trail mail.
Thanks,Sreeparna Bhabani

On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <dishari.5...@gmail.com> wrote:

Hi Paul,
Please find the details below. We are using 2 drillbits. Heap memory 16 G, Max 
direct memory 32 G. One query selects from Parquet. Another one selects fron 
JDBC. The parquet file size is 849 MB. It is UNION ALL. There is not sorting.
Single parquet query-Total execution time - 6.6 secScan time - 0.152 secScreen 
wait time - 5.3 sec
Single JDBC query-Total execution time - 0.261 secJDBC scan - 0.152 secScreen 
wait - 0.004 sec

Union all query -Execution time - 21. 118 secScreen wait time - 5.351 
secParquet scan - 15.368 secUnordered receiver wait time - 14.41 sec
Thanks,Sreeparna Bhabani

On Thu, 23 Apr 2020, 10:43 Paul Rogers, <par0...@yahoo.com> wrote:

Hi Sreeparna,

The short answer is it *should* work: a UNION ALL is simply an append. (Be sure 
you are not using a plain UNION as that needs to do more work to remove 
duplicates.)

Since you are seeing unexpected behavior, we may have some kind of issue to 
investigate and perhaps fix. Always hard to do over e-mail, but let's see what 
we can do.


The first question is to understand the full query: are you doing more than a 
simple scan of two files and a UNION ALL? Are there sorts or joins involved?

The best place to start to investigate performance issues is the query profile, 
which it looks like you are doing. What is the time for the scans if you run 
each of the two scans separately? You said that they take 8 and 1 seconds. Is 
that for the whole query or just the scan operators?

Then, when you run the UNION ALL, again looking at the scan operators, is there 
any difference in run times? If the scans take longer, that is one thing to 
investigate. If the scans take the same amount of time, what other operator(s) 
are taking the rest of the time? Your note suggests that it is the scan taking 
the time. But, there should be two scan operators: one for each file. How is 
the time divided between them?


How large are the data files? Using what storage system? How many Drillbits? 
How much memory?


Thanks,
- Paul

 

    On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani 
<bhabani.sreepa...@gmail.com> wrote:  
 
 Hi Team,

I reach out to you for a specific problem regarding UNION ALL. There is one
UNION ALL statement which combines 2 queries. The individual queries are
taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is 1.17.

Please help to suggest how to improve this UNION ALL performance. We are
using parquet file.

Thanks,
Sreeparna Bhabani

Re: Suggestion needed for UNION ALL performance in Apache drill

Reply via email to