For the sake of discussion I actually simplified things but perhaps in a critical way...
Query actually has 3 token fields and Document has 2 text fields and I really require token1 to be text1, token2 to also be in text1 and token3 to be in text2. (Damn bizarre NLP) These additional complexities might change things... On Aug 29, 2012 4:55 PM, "Mat Kelcey" <[email protected]> wrote: > Hello! > > Considering the following two relations... > > grunt> querys = load 'query' as (id:int, token:chararray); > grunt> dump querys > (11,foo) > (12,bar) > (13,frog) > > and > > grunt> documents = load 'document' as (id:int, text:chararray); > grunt> dump documents; > (21,foo bar frog) > (22,hello frog) > > Is is possible to do a join where the query:token is not equal to but > contained in documents:text ? > > eg > (11,foo,21,foo bar frog) > (12,bar,21,foo bar frog) > (13,frog,21,foo bar frog) > (13,frog,22,hello frog) > > I can certainly do this in Java map/reduce (as we all had to in the > dark days days before pig) but is there a way to hack this together > with a custom udf or some other weird join backdoor (customer > partitioner for a group or something whacky) ??? > > It's been a long day, maybe I'm just missing some super obvious.. > > Cheers! > Mat >
