> 1. OOM condition -- I get the following error when I force a map join in >hive/tez with low container size and heap size:" >java.lang.OutOfMemoryError: Java heap space". I was wondering what is the >condition which leads to this error.
You are not modifying the noconditionaltasksize to match the Xmx at all. hive.auto.convert.join.noconditionaltask.size=(Xmx - io.sort.mb)/3.0; > 2. Shuffle Hash Join -- I am using hive 2.0.1. What is the way to force >this join implementation? Is there any documentation regarding the same? <https://issues.apache.org/jira/browse/HIVE-10673> For full-fledged speed-mode, do set hive.vectorized.execution.reduce.enabled=true; set hive.optimize.dynamic.partition.hashjoin=true; set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true; set hive.mapjoin.hybridgrace.hashtable=false; > 3. Hash table size: I use "--hiveconf hive.root.logger=INFO,console" for >seeing logs. I do not see the hash table size in these logs. No, the hashtables are no longer built on the gateway nodes - that used to be a single point of failure when 20-25 usere are connected via the same box. The hashtable logs are in the task side (in this case, I would guess Map 2's logs would have it). The output is from a log like which looks like yarn logs -applicationId <app-id> | grep Map.*metrics > Map 1 3 0 0 >37.11 65,710 1,039 15,000,000 >15,000,000 So you have 15 million keys going into a single hashtable? The broadcast output rows is fed into the hashtable on the other side. The map-join sort of runs out of steam after about ~4 million entries - I would guess for your scenario setting the noconditional size to 8388608 (~8Mb) might trigger the good path. Cheers, Gopal