Efficient Equality Joins of Large Tables

Mark Desnoyer Mon, 09 Jun 2014 09:59:16 -0700

Hi,

I was wondering if there was a way in Hive to trigger it to perform an
efficient equality join on large tables? Specifically, I have two or more
tables where the joined key is relatively rare in each table. A good
example would be an AdClick scenario where you would have two tables, one
for ad loads and one for ad clicks, where there is a reference id to
connect a click to a specific ad load and that's the key to join on.


In the MapReduce framework, this join could be done efficiently by using
the reference id as the key from the map output and then the join would be
done in the reducer since there are a small number of rows associated with
each reference id. However, hive implements join by buffering all but the
last tables in the reducer, and then streams the last table through. This
works great if the first tables are relatively small, but if the tables are
large and the associated rows are rare, it's very inefficient.

So, can hive perform the join the first way I described? If so, any idea
how I trigger it? Or do I have to write my own MR job?

Cheers,
Mark

Efficient Equality Joins of Large Tables

Reply via email to