and i just realised this last statement makes no sense in the context
of my original contrived example (i originally asked about a join, not
a filter)
don't mind me! :)

On 29 August 2012 17:29, Mat Kelcey <[email protected]> wrote:
> Actually, given the nature of my Query data I might just pack a few bloom
> filters and stream Document through a udf, I've got plenty of data and can
> guard against mistakes downstream.
> It's wonderful what leaving the office and getting on the bus does for your
> thought process....
> Mat
>
> On Aug 29, 2012 5:14 PM, "Mat Kelcey" <[email protected]> wrote:
>>
>> Unfortunately neither side is small enough to either support a cross or a
>> replicated join in memory approach.
>>
>> But opt3 does make sense, I think I'm over thinking things. I can utilise
>> a udf to do the equivalent of tokenisation and do, like you say, just a
>> join.
>>
>> In terms of the multiple joins I can just do all three, count the matches,
>> and only allow the cases of all three matching
>>
>> Thanks!
>> Mat
>>
>> On Aug 29, 2012 5:06 PM, "Jonathan Coveney" <[email protected]> wrote:
>>>
>>> You're not missing anything obvious... what you're trying to do, on face
>>> value, is not an easy thing to do. In M/R, joining is done based on
>>> partitioning to the same reducer...how can you do that if you have a case
>>>
>>> foo
>>> bar
>>>
>>> foo bar
>>>
>>> and foo is sent to reducer 1, bar to reducer 2? There's no way to know
>>> where keys should be sent.
>>>
>>> That said, there are options.
>>>
>>> Option 1: a cross. Undesirable because of data explosion.
>>> Option 2: If one of the data sets is large enough to fit in memory, you
>>> can
>>> make a UDF that brings it in, and does the join for you. This is
>>> essentially option 1.
>>> Option 3: Less generically, exploit the join you're actually doing. In
>>> the
>>> dummy example, it looks like you're checking if a token is contained in
>>> another string. You could convert this into a join by tokenizing,
>>> flattening, doing the join, etc. I don't know how close your real use
>>> case
>>> is to what you posted.
>>>
>>> Jon
>>>
>>>
>>> 2012/8/29 Mat Kelcey <[email protected]>
>>>
>>> > Hello!
>>> >
>>> > Considering the following two relations...
>>> >
>>> > grunt> querys = load 'query' as (id:int, token:chararray);
>>> > grunt> dump querys
>>> > (11,foo)
>>> > (12,bar)
>>> > (13,frog)
>>> >
>>> > and
>>> >
>>> > grunt> documents = load 'document' as (id:int, text:chararray);
>>> > grunt> dump documents;
>>> > (21,foo bar frog)
>>> > (22,hello frog)
>>> >
>>> > Is is possible to do a join where the query:token is not equal to but
>>> > contained in documents:text ?
>>> >
>>> > eg
>>> > (11,foo,21,foo bar frog)
>>> > (12,bar,21,foo bar frog)
>>> > (13,frog,21,foo bar frog)
>>> > (13,frog,22,hello frog)
>>> >
>>> > I can certainly do this in Java map/reduce (as we all had to in the
>>> > dark days days before pig) but is there a way to hack this together
>>> > with a custom udf or some other weird join backdoor (customer
>>> > partitioner for a group or something whacky) ???
>>> >
>>> > It's been a long day, maybe I'm just missing some super obvious..
>>> >
>>> > Cheers!
>>> > Mat
>>> >

Reply via email to