Ah! I think I was close to figuring that out last night… It seems that for the most part Pig is 'functional' right now but optimization takes a lot of work.
Right now the major tasks I can see are a more efficient binary encoding as well as making merge joins 'just work'… I have a functional version of my code in Pig right now, it's just not as fast as I would like. :-) Just my $0.02 On Sun, Aug 21, 2011 at 9:51 AM, Ashutosh Chauhan <[email protected]>wrote: > Try the following: > > data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int, target:int); > > by_source = ORDER data BY source; > by_target = FOREACH (ORDER data BY target) GENERATE target, source; > > STORE by_source INTO 'tmp/by_source' USING PigStorage(); > STORE by_target INTO 'tmp/by_target' USING PigStorage(); > > -- Add this magical keyword here. > exec; > > by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int, > target:int); > by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int, > target:int); > > joined = JOIN by_source BY source, by_target BY target USING 'merge'; > > STORE joined INTO 'tmp/joined' ; > > > > Since pig looks across the entire script it finds the lineage of data > across > 'load-store' and finds all the predecessors for Merge-Join and since > currently only few of those operators can be predecessors, query fails to > compile. By introducing 'exec', compiler is forced to stop and compile and > execute the script till there and then picks up execution after the current > one is finished, and then it only sees loads as the predecessors for merge > join. > > Hope it helps, > Ashutosh > > On Sat, Aug 20, 2011 at 14:03, Kevin Burton <[email protected]> wrote: > > > OK….. I still can't get this to work. > > > > I've read the documentation and i still get the same error on 0.9.0 … > > > > Here's my code. I think it's implying that I need to have the predecessor > > as > > a LOAD and meet the following conditions: > > > > > > Inner merge join (between two tables) will only work under these > > conditions: > > > > > > - Between the load of the sorted input and the merge join statement > > > there can only be filter statements and foreach statement where the > > foreach > > > statement should meet the following conditions: > > > > > > > > > - There should be no UDFs in the foreach statement. > > > > > > > > > - The foreach statement should not change the position of the join > > > keys. > > > > > > > > > - There should be no transformation on the join keys which will > change > > > the sort order. > > > > > > > > > - Data must be sorted on join keys in ascending (ASC) order on both > > > sides. > > > > > > > > > - Right-side loader must implement either the {OrderedLoadFunc} > > > interface or {IndexableLoadFunc} interface. > > > > > > > > > - Type information must be provided for the join key in the schema. > > > > > > The Zebra and PigStorage loaders satisfy all of these conditions. > > > > > > …… which I believe I AM….. but it's still not working. > > > > Here's the data: > > > > > > 1,1 > > 1,2 > > 1,3 > > 1,4 > > 1,1000000000 > > 0,1 > > 0,2 > > 0,3 > > 0,4 > > 0,1000000000 > > > > > > … and the script. > > > > data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int, > target:int); > > > > by_source = ORDER data BY source; > > by_target = FOREACH (ORDER data BY target) GENERATE target, source; > > > > STORE by_source INTO 'tmp/by_source' USING PigStorage(); > > STORE by_target INTO 'tmp/by_target' USING PigStorage(); > > > > by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int, > > target:int); > > by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int, > > target:int); > > > > joined = JOIN by_source BY source, by_target BY target USING 'merge'; > > > > STORE joined INTO 'tmp/joined' ; > > > > > > -- > > > > Founder/CEO Spinn3r.com > > > > Location: *San Francisco, CA* > > Skype: *burtonator* > > > > Skype-in: *(415) 871-0687* > > > -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* Skype-in: *(415) 871-0687*
