If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using SET mapred.reduce.tasks = ...;
GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers. Hope this helps. igor decide.com On Wed, Jun 27, 2012 at 8:47 AM, <[email protected]> wrote: > 5. **How are number of reducers get set for a Hive query (The way > group by and order by sets the number of reducers to 1) ? If I am not > changing it explicitly does it pick it from the underlying Hadoop cluster? > I am trying to understand the bottleneck between query and cluster size.** > ** > >
