Re: Obvious and not so obvious query optimzations in Hive

Igor Tatarinov Wed, 27 Jun 2012 15:44:33 -0700

If you are optimizing for latency (running time) as opposed to throughput,
it's best to have a single "wave" of reducers. So if your cluster is setup
with a limit of, say, 2 reducers per node using 2*N reduce tasks would work
best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single
reducer - so that's slow. I never use OrderBy though (Unix's sort is
probably faster). For analytics queries I need Distribute/Sort By (with
UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com

On Wed, Jun 27, 2012 at 8:47 AM, <[email protected]> wrote:

> 5.       **How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop cluster?
> I am trying to understand the bottleneck between query and cluster size.**
> **
>
>

Re: Obvious and not so obvious query optimzations in Hive

Reply via email to