Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Stuart White
I need to do a reduce-side join of two datasets. It's a many-to-many join; that is, each dataset can can multiple records with any given key. Every description of a reduce-side join I've seen involves constructing your keys out of your mapper such that records from one dataset will be presented

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Todd Lipcon
Hi Stuart, It seems to me like you have a few options. Option 1: Just use a lot of RAM. Unless you really expect many millions of entries on both sides of the join, you might be able to get away with buffering despite its inefficiency. Option 2: Use LocalDirAllocator to find some local storage

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Todd Lipcon
One last possible trick to consider: If you were to subclass SequenceFileRecordReader, you'd have access to its seek method, allowing you to rewind the reducer input. You could then implement a block hash join with something like the following pseudocode: ahash = new HashMapKey, Val(); while (i

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Chris K Wensel
I believe PIG, and I know Cascading use a kind of 'spillable' list that can be re-iterated across. PIG's version is a bit more sophisticated last I looked. that said, if you were using either one of them, you wouldn't need to write your own many-to-many join. cheers, ckw On May 28,

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread jason hadoop
Use the mapside join stuff, if I understand your problem it provides a good solution but requires getting over the learning hurdle. Well described in chapter 8 of my book :) On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel ch...@wensel.net wrote: I believe PIG, and I know Cascading use a kind