and i just realised this last statement makes no sense in the context of my original contrived example (i originally asked about a join, not a filter) don't mind me! :)
On 29 August 2012 17:29, Mat Kelcey <[email protected]> wrote: > Actually, given the nature of my Query data I might just pack a few bloom > filters and stream Document through a udf, I've got plenty of data and can > guard against mistakes downstream. > It's wonderful what leaving the office and getting on the bus does for your > thought process.... > Mat > > On Aug 29, 2012 5:14 PM, "Mat Kelcey" <[email protected]> wrote: >> >> Unfortunately neither side is small enough to either support a cross or a >> replicated join in memory approach. >> >> But opt3 does make sense, I think I'm over thinking things. I can utilise >> a udf to do the equivalent of tokenisation and do, like you say, just a >> join. >> >> In terms of the multiple joins I can just do all three, count the matches, >> and only allow the cases of all three matching >> >> Thanks! >> Mat >> >> On Aug 29, 2012 5:06 PM, "Jonathan Coveney" <[email protected]> wrote: >>> >>> You're not missing anything obvious... what you're trying to do, on face >>> value, is not an easy thing to do. In M/R, joining is done based on >>> partitioning to the same reducer...how can you do that if you have a case >>> >>> foo >>> bar >>> >>> foo bar >>> >>> and foo is sent to reducer 1, bar to reducer 2? There's no way to know >>> where keys should be sent. >>> >>> That said, there are options. >>> >>> Option 1: a cross. Undesirable because of data explosion. >>> Option 2: If one of the data sets is large enough to fit in memory, you >>> can >>> make a UDF that brings it in, and does the join for you. This is >>> essentially option 1. >>> Option 3: Less generically, exploit the join you're actually doing. In >>> the >>> dummy example, it looks like you're checking if a token is contained in >>> another string. You could convert this into a join by tokenizing, >>> flattening, doing the join, etc. I don't know how close your real use >>> case >>> is to what you posted. >>> >>> Jon >>> >>> >>> 2012/8/29 Mat Kelcey <[email protected]> >>> >>> > Hello! >>> > >>> > Considering the following two relations... >>> > >>> > grunt> querys = load 'query' as (id:int, token:chararray); >>> > grunt> dump querys >>> > (11,foo) >>> > (12,bar) >>> > (13,frog) >>> > >>> > and >>> > >>> > grunt> documents = load 'document' as (id:int, text:chararray); >>> > grunt> dump documents; >>> > (21,foo bar frog) >>> > (22,hello frog) >>> > >>> > Is is possible to do a join where the query:token is not equal to but >>> > contained in documents:text ? >>> > >>> > eg >>> > (11,foo,21,foo bar frog) >>> > (12,bar,21,foo bar frog) >>> > (13,frog,21,foo bar frog) >>> > (13,frog,22,hello frog) >>> > >>> > I can certainly do this in Java map/reduce (as we all had to in the >>> > dark days days before pig) but is there a way to hack this together >>> > with a custom udf or some other weird join backdoor (customer >>> > partitioner for a group or something whacky) ??? >>> > >>> > It's been a long day, maybe I'm just missing some super obvious.. >>> > >>> > Cheers! >>> > Mat >>> >
