hive auto join conversion

Chen Song Wed, 30 Jul 2014 18:05:26 -0700

I am using cdh5 with hive 0.12. We have some hive jobs migrated from hive
0.10 and they are written like below:


select /*+ MAPJOIN(sup) */ c1, c2, sup.c
from
(
select key, c1, c2 from table1
union all
select key, c1, c2 from table2
) table
left outer join
sup
on (table.c1 = sup.key)
distribute by c1

In Hive 0.10 (CDH4), Hive translates the left outer join into a map join
(map only job), followed by a regular MR job for distribute by.

In Hive 0.12 (CDH5), Hive is not able to convert the join into a map join.
Instead it launches a common map reduce for the join, followed by another
mr for distribute by. However, when I take out the union all operator, Hive
seems to be able to create a single MR job, with map join on map phase, and
reduce for distribute by.

I read a bit on
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
and found out that there are some restrictions on map side join starting
Hive 0.11. The following are not supported.


   - Union Followed by a MapJoin
   - Lateral View Followed by a MapJoin
   - Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed
   by MapJoin
   - MapJoin Followed by Union
   - MapJoin Followed by Join
   - MapJoin Followed by MapJoin


So if one side of the table (big side) is a union of some tables and the
other side is a small table, Hive would not be able to do a map join at
all? Is that correct?

If correct, what should I do to make the job backward compatible?

-- 
Chen Song

hive auto join conversion

Reply via email to