Unfortunately neither side is small enough to either support a cross or a replicated join in memory approach.
But opt3 does make sense, I think I'm over thinking things. I can utilise a udf to do the equivalent of tokenisation and do, like you say, just a join. In terms of the multiple joins I can just do all three, count the matches, and only allow the cases of all three matching Thanks! Mat On Aug 29, 2012 5:06 PM, "Jonathan Coveney" <[email protected]> wrote: > You're not missing anything obvious... what you're trying to do, on face > value, is not an easy thing to do. In M/R, joining is done based on > partitioning to the same reducer...how can you do that if you have a case > > foo > bar > > foo bar > > and foo is sent to reducer 1, bar to reducer 2? There's no way to know > where keys should be sent. > > That said, there are options. > > Option 1: a cross. Undesirable because of data explosion. > Option 2: If one of the data sets is large enough to fit in memory, you can > make a UDF that brings it in, and does the join for you. This is > essentially option 1. > Option 3: Less generically, exploit the join you're actually doing. In the > dummy example, it looks like you're checking if a token is contained in > another string. You could convert this into a join by tokenizing, > flattening, doing the join, etc. I don't know how close your real use case > is to what you posted. > > Jon > > > 2012/8/29 Mat Kelcey <[email protected]> > > > Hello! > > > > Considering the following two relations... > > > > grunt> querys = load 'query' as (id:int, token:chararray); > > grunt> dump querys > > (11,foo) > > (12,bar) > > (13,frog) > > > > and > > > > grunt> documents = load 'document' as (id:int, text:chararray); > > grunt> dump documents; > > (21,foo bar frog) > > (22,hello frog) > > > > Is is possible to do a join where the query:token is not equal to but > > contained in documents:text ? > > > > eg > > (11,foo,21,foo bar frog) > > (12,bar,21,foo bar frog) > > (13,frog,21,foo bar frog) > > (13,frog,22,hello frog) > > > > I can certainly do this in Java map/reduce (as we all had to in the > > dark days days before pig) but is there a way to hack this together > > with a custom udf or some other weird join backdoor (customer > > partitioner for a group or something whacky) ??? > > > > It's been a long day, maybe I'm just missing some super obvious.. > > > > Cheers! > > Mat > > >
