Hive Order By Question

2019-02-06 Thread Mainak Ghosh
Hello, Hive Order by is known to be slow. It is slightly odd that it is slow even when we use a limit under strict mode. I am running this query over 3 billion rows with a limit of 20. It takes an hour to run. I expect the maps to do some sorting and limiting in parallel. That way the reducer

Re: Hive Order By Question

2019-02-06 Thread Mainak Ghosh
I am running an older version of Hive on MR. Does it have it too? Mainak Sent from my iPhone > On Feb 6, 2019, at 3:43 PM, Gopal Vijayaraghavan wrote: > > >> I expect the maps to do some sorting and limiting in parallel. That way the >> reducer load would be small. I don’t think it does

Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan
>I am running an older version of Hive on MR. Does it have it too? Hard to tell without an explain. AFAIK, this was fixed in Aug 2013 - how old is your build? Cheers, Gopal

Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan
Hi, That looks like the TopN hash optimization didn't kick in, that must be a settings issue in the install. | Reduce Output Operator | | key expressions: _col0 (type: string) | | sort order: + | |

Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan
> I expect the maps to do some sorting and limiting in parallel. That way the > reducer load would be small. I don’t think it does that. Can you tell me why?  They do. Which version are you running, is it Tez and do you have an explain for the plan? Cheers, Gopal

Re: Hive Order By Question

2019-02-06 Thread Mainak Ghosh
Hey Gopal, I am using Apache Hive v2.3.2. Here is the explain: ++ | Explain | ++ | STAGE DEPENDENCIES:| | Stage-1