Hello,
Hive Order by is known to be slow. It is slightly odd that it is slow even when
we use a limit under strict mode. I am running this query over 3 billion rows
with a limit of 20. It takes an hour to run. I expect the maps to do some
sorting and limiting in parallel. That way the reducer
I am running an older version of Hive on MR. Does it have it too?
Mainak
Sent from my iPhone
> On Feb 6, 2019, at 3:43 PM, Gopal Vijayaraghavan wrote:
>
>
>> I expect the maps to do some sorting and limiting in parallel. That way the
>> reducer load would be small. I don’t think it does
>I am running an older version of Hive on MR. Does it have it too?
Hard to tell without an explain.
AFAIK, this was fixed in Aug 2013 - how old is your build?
Cheers,
Gopal
Hi,
That looks like the TopN hash optimization didn't kick in, that must be a
settings issue in the install.
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
|
> I expect the maps to do some sorting and limiting in parallel. That way the
> reducer load would be small. I don’t think it does that. Can you tell me why?
They do.
Which version are you running, is it Tez and do you have an explain for the
plan?
Cheers,
Gopal
Hey Gopal,
I am using Apache Hive v2.3.2. Here is the explain:
++
| Explain |
++
| STAGE DEPENDENCIES:|
| Stage-1