> 1. OOM condition -- I get the following error when I force a map join in
>hive/tez with low container size and heap size:"
>java.lang.OutOfMemoryError: Java heap space". I was wondering what is the
>condition which leads to this error.

You are not modifying the noconditionaltasksize to match the Xmx at all.

hive.auto.convert.join.noconditionaltask.size=(Xmx - io.sort.mb)/3.0;


> 2.  Shuffle Hash Join -- I am using hive 2.0.1. What is the way to force
>this join implementation? Is there any documentation regarding the same?

<https://issues.apache.org/jira/browse/HIVE-10673>


For full-fledged speed-mode, do

set hive.vectorized.execution.reduce.enabled=true;
set hive.optimize.dynamic.partition.hashjoin=true;
set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
set hive.mapjoin.hybridgrace.hashtable=false;

> 3. Hash table size: I use "--hiveconf hive.root.logger=INFO,console" for
>seeing logs. I do not see the hash table size in these logs.

No, the hashtables are no longer built on the gateway nodes - that used to
be a single point of failure when 20-25 usere are connected via the same
box.

The hashtable logs are in the task side (in this case, I would guess Map
2's logs would have it). The output is from a log like which looks like

yarn logs -applicationId <app-id> | grep Map.*metrics

> Map 1                      3                0            0
>37.11             65,710              1,039     15,000,000
>15,000,000


So you have 15 million keys going into a single hashtable? The broadcast
output rows is fed into the hashtable on the other side.

The map-join sort of runs out of steam after about ~4 million entries - I
would guess for your scenario setting the noconditional size to 8388608
(~8Mb) might trigger the good path.

Cheers,
Gopal




Reply via email to