I am using cdh5 with hive 0.12. We have some hive jobs migrated from hive 0.10 and they are written like below:
select /*+ MAPJOIN(sup) */ c1, c2, sup.c from ( select key, c1, c2 from table1 union all select key, c1, c2 from table2 ) table left outer join sup on (table.c1 = sup.key) distribute by c1 In Hive 0.10 (CDH4), Hive translates the left outer join into a map join (map only job), followed by a regular MR job for distribute by. In Hive 0.12 (CDH5), Hive is not able to convert the join into a map join. Instead it launches a common map reduce for the join, followed by another mr for distribute by. However, when I take out the union all operator, Hive seems to be able to create a single MR job, with map join on map phase, and reduce for distribute by. I read a bit on https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins and found out that there are some restrictions on map side join starting Hive 0.11. The following are not supported. - Union Followed by a MapJoin - Lateral View Followed by a MapJoin - Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin - MapJoin Followed by Union - MapJoin Followed by Join - MapJoin Followed by MapJoin So if one side of the table (big side) is a union of some tables and the other side is a small table, Hive would not be able to do a map join at all? Is that correct? If correct, what should I do to make the job backward compatible? -- Chen Song
