I have 2 tables that are inner joined and the join keys have low cardinality and high volume within each key ie. key1 on both sides sometimes have millions of rows
When joining, which happens in the reduce stage, the tasks take forever since there are too many keys to join. I tried using DISTRIBUTE BY using another field assuming that the data will get partitioned on that key on the left table (effectively using more reducers) and steam the table on the right side. But taht does not seem to work. Is DISTRIBUTE BY the wrong thing to use for this use case? Is there any other way to partition the tables so that the joins are faster? -- Regards, Premal Shah.
