Hello,

Thanks a lot for reviewing HIVE-28489 in detail. So, finally it looks like
all the three patches will be merged to Hive!

While analyzing query plans generated by Hive and Trino, we identified a
few more categories of queries on which Trino was much faster than Hive,
such as:

1. The query plan is identical, but Trino is much faster. This is due to
the architectural difference between Trino and Hive (on shuffle-intensive
queries): Trino is based on MPP and thus uses the push model, while Hive
uses the pull model. There is not much we can do about this type of
queries. (Note that the push model has its own drawbacks and thus does not
always win over the pull model. That's why Trino is much slower than Hive
on many queries.)

2. Trino generates a query plan that is clearly more efficient than Hive.
We made some attempt to find a solution in Hive, but came to a preliminary
conclusion that this would require a significant change in the query
compiler (e.g., if the decision made later during query compilation is
inconsistent with an earlier assumption, retry with a different assumption
until consistency is reached).

Currently we don't fully understand the internals of Trino compilation, and
don't know how Trino generates efficient query plans. If we figure it out,
we can try to replicate the process in Hive. If not, we can open a
discussion where we can discuss potential solutions.

Thanks,

--- Sungwoo


On Mon, Nov 25, 2024 at 11:18 PM Okumin <m...@okumin.com> wrote:

> Hi,
>
> Thanks for submitting the patches and writing the post with fantastic
> illustrations. The impressive documentation linking each ticket made it
> easy for me to review.
>
> I gave +1 to the last pull request. Unless anyone finds another issue, I
> will merge it in 1 day.
> https://github.com/apache/hive/pull/5424
>
> Let's keep Hive's performance competitive.
>
> Regards,
> Okumin
>
> On Thu, Oct 10, 2024 at 12:11 Sungwoo Park <glap...@gmail.com> wrote:
>
>> Hello,
>>
>> For your query, pre-partitioning is performed on if() expressions. Since
>> the output of if() expressions is skewed, the resultant query plan is
>> inefficient.
>>
>> We think that the pre-partitioning columns should be restricted to those
>> found in grouping sets (which I think is the case in Trino). We will test a
>> new patch for HIVE-28489 implementing this idea.
>>
>> Thanks a lot for the feedback on HIVE-28489!
>>
>> --- Sungwoo
>>
>>
>> On Wed, Oct 9, 2024 at 7:07 PM lisoda <lis...@yeah.net> wrote:
>>
>>> Hi.
>>>
>>> It's great to see this article. This indicates that since Hive 4.0,
>>> downstream vendors of the Hive project have also started to become more
>>> active. For the project's activity, this is a good thing. It shows that
>>> more and more developers and users are starting to try version 4.0.
>>>
>>> Back to this blog, it mentions three optimization patches, in fact, we
>>> noticed these three patches two weeks ago and have added them to our
>>> production environment. In most cases, these three patches can indeed
>>> greatly improve the efficiency of HIVEQL.
>>> However, we also encountered some problems in the process of using these
>>> three patches, and we raise these issues here in the hope of discussing
>>> them with everyone.
>>>
>>> The main problem we encountered was with HIVE-28489. Although in most
>>> cases, HIVE-28489 can effectively reduce the amount of SHUFFLE data and the
>>> load produced by grouping sets, it currently cannot optimize this type of
>>> scenario very well:
>>>
>>>
>>> SELECT
>>> count(distinct if(flag=0,uni_id,null)),
>>> count(distinct if(flag=1,uni_id,null)),
>>> count(distinct if(flag=2,uni_id,null))
>>> from tbl
>>> group by c1,c2,c3
>>> grouping sets(
>>> (c1,c2),
>>> (c2,c3),
>>> (c1,c3)
>>> );
>>>
>>>
>>> In this type of SQL, we need to make precise statistics on the uni_id
>>> field based on the value of flag. After introducing HIVE-28489, it uses the
>>> value of the expression if(flag=0,uni_id,null) as the grouping condition.
>>> However, expressions like if(flag=0,uni_id,null) generate a large number of
>>> null values, ultimately leading to skewed tasks. This makes the execution
>>> of the Query very slow.
>>>
>>> At present, we do not have a very good way to handle this scenario. Does
>>> anyone have experience dealing with this kind of problem? If someone can
>>> guide me or join the discussion, I would be very grateful.
>>>
>>> Thanks,
>>> --LiSoDa
>>>
>>>
>>>
>>>
>>> 在 2024-10-09 16:21:20,"Sungwoo Park" <glap...@gmail.com> 写道:
>>>
>>> Hi everyone,
>>>
>>> We have published a blog article that reports the performance
>>> improvement from three patches HIVE-28488, HIVE-28489, and HIVE-28490 which
>>> we submitted some time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on
>>> the 10TB TPC-DS benchmark, but the results could be useful to users of
>>> Apache Hive 4 (with or without LLAP).
>>>
>>>
>>> https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/
>>>
>>> We got the ideas for the three patches by comparing query plans
>>> generated by Hive 4 and Trino (for those queries that Trino executes much
>>> faster than Hive 4). Currently the patches are not actively reviewed, so we
>>> would appreciate it if some committers could take a look and try merging
>>> them to the master branch. If you are currently using Hive 4, backporting
>>> these patches should improve the performance on some class of queries.
>>>
>>> Thanks,
>>>
>>> --- Sungwoo Park
>>>
>>>
>>>
>>>

Reply via email to