DISTRIBUTE BY to distribute joins or maybe something else?

Premal Shah Thu, 16 Oct 2014 01:22:41 -0700

I have 2 tables that are inner joined and the join keys have low
cardinality and high volume within each key ie. key1 on both sides
sometimes have millions of rows


When joining, which happens in the reduce stage, the tasks take forever
since there are too many keys to join.

I tried using DISTRIBUTE BY using another field assuming that the data will
get partitioned on that key on the left table (effectively using more
reducers) and steam the table on the right side. But taht does not seem to
work. Is DISTRIBUTE BY the wrong thing to use for this use case?

Is there any other way to partition the tables so that the joins are faster?

-- 
Regards,
Premal Shah.

DISTRIBUTE BY to distribute joins or maybe something else?

Reply via email to