Kevin, Merge join in particular hasn't found a customer willing to do the work to make it great. Parts that get more use get more polish. Opening tickets with concrete suggestions for improvement helps. Submitting patches helps even more :).
D On Sun, Aug 21, 2011 at 2:00 PM, Kevin Burton <[email protected]> wrote: > Ah! I think I was close to figuring that out last night… > > It seems that for the most part Pig is 'functional' right now but > optimization takes a lot of work. > > Right now the major tasks I can see are a more efficient binary encoding as > well as making merge joins 'just work'… > > I have a functional version of my code in Pig right now, it's just not as > fast as I would like. :-) > > Just my $0.02 > > On Sun, Aug 21, 2011 at 9:51 AM, Ashutosh Chauhan <[email protected] > >wrote: > > > Try the following: > > > > data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int, > target:int); > > > > by_source = ORDER data BY source; > > by_target = FOREACH (ORDER data BY target) GENERATE target, source; > > > > STORE by_source INTO 'tmp/by_source' USING PigStorage(); > > STORE by_target INTO 'tmp/by_target' USING PigStorage(); > > > > -- Add this magical keyword here. > > exec; > > > > by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int, > > target:int); > > by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int, > > target:int); > > > > joined = JOIN by_source BY source, by_target BY target USING 'merge'; > > > > STORE joined INTO 'tmp/joined' ; > > > > > > > > Since pig looks across the entire script it finds the lineage of data > > across > > 'load-store' and finds all the predecessors for Merge-Join and since > > currently only few of those operators can be predecessors, query fails to > > compile. By introducing 'exec', compiler is forced to stop and compile > and > > execute the script till there and then picks up execution after the > current > > one is finished, and then it only sees loads as the predecessors for > merge > > join. > > > > Hope it helps, > > Ashutosh > > > > On Sat, Aug 20, 2011 at 14:03, Kevin Burton <[email protected]> wrote: > > > > > OK….. I still can't get this to work. > > > > > > I've read the documentation and i still get the same error on 0.9.0 … > > > > > > Here's my code. I think it's implying that I need to have the > predecessor > > > as > > > a LOAD and meet the following conditions: > > > > > > > > > Inner merge join (between two tables) will only work under these > > > conditions: > > > > > > > > - Between the load of the sorted input and the merge join > statement > > > > there can only be filter statements and foreach statement where > the > > > foreach > > > > statement should meet the following conditions: > > > > > > > > > > > > - There should be no UDFs in the foreach statement. > > > > > > > > > > > > - The foreach statement should not change the position of the join > > > > keys. > > > > > > > > > > > > - There should be no transformation on the join keys which will > > change > > > > the sort order. > > > > > > > > > > > > - Data must be sorted on join keys in ascending (ASC) order on > both > > > > sides. > > > > > > > > > > > > - Right-side loader must implement either the {OrderedLoadFunc} > > > > interface or {IndexableLoadFunc} interface. > > > > > > > > > > > > - Type information must be provided for the join key in the > schema. > > > > > > > > The Zebra and PigStorage loaders satisfy all of these conditions. > > > > > > > > > …… which I believe I AM….. but it's still not working. > > > > > > Here's the data: > > > > > > > > > 1,1 > > > 1,2 > > > 1,3 > > > 1,4 > > > 1,1000000000 > > > 0,1 > > > 0,2 > > > 0,3 > > > 0,4 > > > 0,1000000000 > > > > > > > > > … and the script. > > > > > > data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int, > > target:int); > > > > > > by_source = ORDER data BY source; > > > by_target = FOREACH (ORDER data BY target) GENERATE target, source; > > > > > > STORE by_source INTO 'tmp/by_source' USING PigStorage(); > > > STORE by_target INTO 'tmp/by_target' USING PigStorage(); > > > > > > by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int, > > > target:int); > > > by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int, > > > target:int); > > > > > > joined = JOIN by_source BY source, by_target BY target USING 'merge'; > > > > > > STORE joined INTO 'tmp/joined' ; > > > > > > > > > -- > > > > > > Founder/CEO Spinn3r.com > > > > > > Location: *San Francisco, CA* > > > Skype: *burtonator* > > > > > > Skype-in: *(415) 871-0687* > > > > > > > > > -- > > Founder/CEO Spinn3r.com > > Location: *San Francisco, CA* > Skype: *burtonator* > > Skype-in: *(415) 871-0687* >
