Re: How to move tasks under reducer to Mapper phase

Furcy Pin Sat, 10 Dec 2016 00:41:44 -0800

Hi Mahender,

it's hard to say what happen without seeing the actual query.


Hive has several ways to perform joins. There is a complete description of
how it does it here:

https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization
Sadly, the illustrations are broken.

There is also this presentation :
https://www.youtube.com/watch?v=OB4H3Yt5VWM

And the corresponding slides :
https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf

However, these docs are from the Map-Reduce era and quite old now.
So it is hard to tell if everything works the same way with Tez today.

If all your tables are big, I would say there is not much to optimize
except trying to bucket and sort them before.


Last but not least:

When I get this kind of behavior (reducer stuck during a JOIN), more often
than not,
it is simply because the JOIN clause is incorrect, and the reducer
generates way too much data.

Just imagine what would happen if you did a "JOIN ON 1 = 1"
between two tables with 10^9 records... you can actually kill a cluster
with this,
if you let it run long enough.





On Fri, Dec 9, 2016 at 10:31 PM, Mahender Sarangam <
mahender.bigd...@outlook.com> wrote:

> Hi,
>
> We are performing left joining on 5-6 larger tables. We see job is hanging
> around 95%. All the mappers completed fast and some of the reducer are also
> completed fast. but some of reducer are hanging state because single task
> is running on large data. Below are the Mapper and Reducer captured.
>
>
>
>    - Is there a way to move task running under Reducer phase to Mapper
>    phase. I mean tweaking with memory settings or modifying the query to have
>    more mapper tasks than reducer task.
>
>
>    - Is there a way to know what part of query is taken by task which is
>    running for long time. or what amount of rows this task is running upon (
>    so that i can think of partition or alternate approach)
>    - Any other memory setting to resolve hanging issue. Below is our
>    memory settings
>
> SET hive.tez.container.size = -1;
> SET hive.execution.engine=tez;
> SET hive.mapjoin.hybridgrace.hashtable=FALSE;
> SET hive.optimize.ppd=true;
> SET hive.cbo.enable =true;
> SET hive.compute.query.using.stats =true;
> SET hive.exec.parallel=true;
> SET hive.vectorized.execution.enabled=true;
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.auto.convert.join=false;
> SET hive.auto.convert.join.noconditionaltask=false;
> set hive.tez.java.opts = "-Xmx3481m";
> set hive.tez.container.size = 4096;
> --SET mapreduce.map.memory.mb=4096;
> --SET mapreduce.map.java.opts = -Xmx3000M;
> --SET mapreduce.reduce.memory.mb = 2048;
> --SET mapreduce.reduce.java.opts = -Xmx1630M;
> SET fs.block.size=67108864;
>
>
> Thanks in advance
>
>
> -Mahender
>
>
>
>
>

Re: How to move tasks under reducer to Mapper phase

Reply via email to