Hi, I was wondering if there was a way in Hive to trigger it to perform an efficient equality join on large tables? Specifically, I have two or more tables where the joined key is relatively rare in each table. A good example would be an AdClick scenario where you would have two tables, one for ad loads and one for ad clicks, where there is a reference id to connect a click to a specific ad load and that's the key to join on.
In the MapReduce framework, this join could be done efficiently by using the reference id as the key from the map output and then the join would be done in the reducer since there are a small number of rows associated with each reference id. However, hive implements join by buffering all but the last tables in the reducer, and then streams the last table through. This works great if the first tables are relatively small, but if the tables are large and the associated rows are rare, it's very inefficient. So, can hive perform the join the first way I described? If so, any idea how I trigger it? Or do I have to write my own MR job? Cheers, Mark