Unfortunately neither side is small enough to either support a cross or a
replicated join in memory approach.

But opt3 does make sense, I think I'm over thinking things. I can utilise a
udf to do the equivalent of tokenisation and do, like you say, just a join.

In terms of the multiple joins I can just do all three, count the matches,
and only allow the cases of all three matching

Thanks!
Mat
On Aug 29, 2012 5:06 PM, "Jonathan Coveney" <[email protected]> wrote:

> You're not missing anything obvious... what you're trying to do, on face
> value, is not an easy thing to do. In M/R, joining is done based on
> partitioning to the same reducer...how can you do that if you have a case
>
> foo
> bar
>
> foo bar
>
> and foo is sent to reducer 1, bar to reducer 2? There's no way to know
> where keys should be sent.
>
> That said, there are options.
>
> Option 1: a cross. Undesirable because of data explosion.
> Option 2: If one of the data sets is large enough to fit in memory, you can
> make a UDF that brings it in, and does the join for you. This is
> essentially option 1.
> Option 3: Less generically, exploit the join you're actually doing. In the
> dummy example, it looks like you're checking if a token is contained in
> another string. You could convert this into a join by tokenizing,
> flattening, doing the join, etc. I don't know how close your real use case
> is to what you posted.
>
> Jon
>
>
> 2012/8/29 Mat Kelcey <[email protected]>
>
> > Hello!
> >
> > Considering the following two relations...
> >
> > grunt> querys = load 'query' as (id:int, token:chararray);
> > grunt> dump querys
> > (11,foo)
> > (12,bar)
> > (13,frog)
> >
> > and
> >
> > grunt> documents = load 'document' as (id:int, text:chararray);
> > grunt> dump documents;
> > (21,foo bar frog)
> > (22,hello frog)
> >
> > Is is possible to do a join where the query:token is not equal to but
> > contained in documents:text ?
> >
> > eg
> > (11,foo,21,foo bar frog)
> > (12,bar,21,foo bar frog)
> > (13,frog,21,foo bar frog)
> > (13,frog,22,hello frog)
> >
> > I can certainly do this in Java map/reduce (as we all had to in the
> > dark days days before pig) but is there a way to hack this together
> > with a custom udf or some other weird join backdoor (customer
> > partitioner for a group or something whacky) ???
> >
> > It's been a long day, maybe I'm just missing some super obvious..
> >
> > Cheers!
> > Mat
> >
>

Reply via email to