On Jun 17, 2013, at 7:24 AM, Ido Hadanny wrote:

> Hey,
> 
> We noticed that the current skewed join supports only 1 skewed table, and
> assumes that the second table isn't skewed.
> Please review this suggestion for a 2 skewed tables design:
> 
>   - Sample both tables
>   - for each skewed key (with many records in at least one table), build a
>   surrogate key in a GFCross style - e.g. if for this key there are 3M keys
>   from the left table and 7M from the right table, and there are 100 reducers
>   available, build GFCross with dimensions of sqrt(100*3/7) and sqrt(100*7/3)
> 
> What do you say? Is this a necessary enhancement request? Or is it safe to
> assume that only one table will be skewed in each join?

When we built the original skewed join we chose to worry about it only in the 
case of 1 table being skewed for two reasons:

1) It made joins of skewed tables (even two skewed tables) possible.  Before it 
was possible to have a join where neither table could fit all instances of a 
given key in memory (as the default hash join implementation requires) and thus 
the join could not be done.  With this implementation you are guaranteed that 
you can split key instances for one of the inputs and thus complete the join.  
If it is skewed on both sides the join will still be slow, as you point out.
2) It addressed most of our use cases.

Obviously being able to handle cases where both sides are skewed more 
efficiently will be very valuable.  If you're thinking of contributing in this 
area I encourage you to file a JIRA with your proposal.

Alan.
> 
> Thanks, Dudu and Ido
> 
> -- 
> Sent from my androido

Reply via email to