Re: Suggestion needed for UNION ALL performance in Apache drill

sreeparna bhabani Mon, 04 May 2020 09:39:37 -0700

Hi Team,

After further checking on this UNION ALL, I found that UNION ALL
(between Parquet and database) behaves as expected with limited number of
rows and columns. But for a larger Parquet file and higher number of
selected rows and columns, the UNION ALL takes much higher time than sum of
the same of individual Parquet and DB Query.


As per the analysis, it looks like the source of this issue is-
Although we are using distributed mode, the UNION ALL query is executed
only on 1 NODE in case of Parquet UNION ALL DB. It is not distributed and
parallelized in multiple nodes.

Whereas, for individual query or UNION ALL between same type datasets
(Parquet + Parquet) it is getting distributed in 2 NODES.

Do you have any finding / idea on this ?

Thanks,
Sreeparna Bhabani

On Tue, Apr 28, 2020 at 9:00 PM sreeparna bhabani <
[email protected]> wrote:

> Hi Paul Team,
>
> Please check the observation mentioned in the  below Jira where we found
> that UNION ALL query is not parallelized between multiple nodes when there
> are 2 types dataset (Parquet and Database). But it is parallelized if we
> query individual Parquet file.
>
> Is there any way to enforce parallel execution in multiple nodes ?
>
> Thanks,
> Sreeparna Bhabani
>
>
> On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <[email protected]>
> wrote:
>
>>
>> Hi Paul and Team,
>>
>> As you suggested I have created a Jira ticket which is  -
>> https://issues.apache.org/jira/browse/DRILL-7720.
>> I have mentioned details in the Jira you asked. Please have a look. As
>> the data is sensitive, I am trying to create dummy dataset. Will
>> provide once it is ready.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
>> [email protected]> wrote:
>>
>>>
>>> ---------- Forwarded message ---------
>>> From: Paul Rogers <[email protected]>
>>> Date: Thu, 23 Apr 2020, 23:59
>>> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
>>> To: <[email protected]>, sreeparna bhabani <
>>> [email protected]>
>>> Cc: <[email protected]>, <[email protected]>
>>>
>>>
>>> Hi Sreeparna,
>>>
>>>
>>> As suggested in the earlier e-mail, we would not expect to see different
>>> performance in UNION ALL than in a simple scan. Clearly you've found some
>>> kind of issue. The next step is to investigate that issue, which is a bit
>>> hard to do over e-mail.
>>>
>>>
>>> Please file a JIRA ticket to describe the issue and provide a
>>> reproducible test case including query and data. If your data is sensitive,
>>> please create a dummy data set, or use the provided TPC-H data set to
>>> recreate the issue. We can then take a look to see what might be happening.
>>>
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>>
>>> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
>>> [email protected]> wrote:
>>>
>>>
>>> Hi Team,
>>>
>>> In addition to the below mail I have another finding. Please consider
>>> below scenarios. The first 2 scenarios are giving expected results in terms
>>> of performance. But we are not getting expected performance for 3rd
>>> scenario which is UNION ALL with 2 different types of datasets.
>>>
>>> *Scenario 1- Parquet UNION ALL Parquet*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 5 secs
>>> UNION ALL of both queries execution time - 10 secs
>>>
>>> *Scenario 2 - DB query UNION ALL DB* *query*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 5 secs
>>> UNION ALL of both queries execution time - 10 secs
>>>
>>> *Scenario 3 - Parquet UNION ALL DB query*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 1 sec
>>> UNION ALL execution time - 20 secs
>>> Ideally the execution time should not be more than 6 secs.
>>>
>>> May I request you to check whether the UNION ALL performance of 3rd
>>> scenario is expected with different dataset types.
>>>
>>> Please suggest if there is any specific way to bring down the execution
>>> time of 3rd scenario.
>>>
>>> Thanks in advance.
>>>
>>> Sreeparna Bhabani
>>>
>>>
>>>
>>> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <
>>> [email protected]> wrote:
>>>
>>> Hi Team,
>>>
>>> Apart from the below issue I have another question.
>>>
>>> Is there any relation between number of row groups and performance ?
>>>
>>> In the below query the number of files is 13 and numRowGroups is 69. Is
>>> the UNION ALL takes more time if the number of rowgroup is high like that.
>>>
>>> Please note that the individual Parquet query takes 6 secs. But UNION
>>> ALL takes 20 secs. Details are given in trail mail.
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <[email protected]>
>>> wrote:
>>>
>>> Hi Paul,
>>>
>>> Please find the details below. We are using 2 drillbits. Heap memory 16
>>> G, Max direct memory 32 G. One query selects from Parquet. Another one
>>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There
>>> is not sorting.
>>>
>>> Single parquet query-
>>> Total execution time - 6.6 sec
>>> Scan time - 0.152 sec
>>> Screen wait time - 5.3 sec
>>>
>>> Single JDBC query-
>>> Total execution time - 0.261 sec
>>> JDBC scan - 0.152 sec
>>> Screen wait - 0.004 sec
>>>
>>>
>>> Union all query -
>>> Execution time - 21. 118 sec
>>> Screen wait time - 5.351 sec
>>> Parquet scan - 15.368 sec
>>> Unordered receiver wait time - 14.41 sec
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>>
>>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <[email protected]> wrote:
>>>
>>> Hi Sreeparna,
>>>
>>>
>>> The short answer is it *should* work: a UNION ALL is simply an append.
>>> (Be sure you are not using a plain UNION as that needs to do more work to
>>> remove duplicates.)
>>>
>>>
>>> Since you are seeing unexpected behavior, we may have some kind of issue
>>> to investigate and perhaps fix. Always hard to do over e-mail, but let's
>>> see what we can do.
>>>
>>>
>>> The first question is to understand the full query: are you doing more
>>> than a simple scan of two files and a UNION ALL? Are there sorts or joins
>>> involved?
>>>
>>>
>>> The best place to start to investigate performance issues is the query
>>> profile, which it looks like you are doing. What is the time for the scans
>>> if you run each of the two scans separately? You said that they take 8 and
>>> 1 seconds. Is that for the whole query or just the scan operators?
>>>
>>>
>>> Then, when you run the UNION ALL, again looking at the scan operators,
>>> is there any difference in run times? If the scans take longer, that is one
>>> thing to investigate. If the scans take the same amount of time, what other
>>> operator(s) are taking the rest of the time? Your note suggests that it is
>>> the scan taking the time. But, there should be two scan operators: one for
>>> each file. How is the time divided between them?
>>>
>>>
>>> How large are the data files? Using what storage system? How many
>>> Drillbits? How much memory?
>>>
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>>
>>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
>>> [email protected]> wrote:
>>>
>>>
>>> Hi Team,
>>>
>>> I reach out to you for a specific problem regarding UNION ALL. There is
>>> one
>>> UNION ALL statement which combines 2 queries. The individual queries are
>>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
>>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
>>> 1.17.
>>>
>>> Please help to suggest how to improve this UNION ALL performance. We are
>>> using parquet file.
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>>
>>
>> --
>>
>> Thanks n Regards,
>> *Sreeparna Bhabani*
>>
>

-- 

Thanks n Regards,
*Sreeparna Bhabani*

Re: Suggestion needed for UNION ALL performance in Apache drill

Reply via email to