Ah!  I think I was close to figuring that out last night…

It seems that for the most part Pig is 'functional' right now but
optimization takes a lot of work.

Right now the major tasks I can see are a more efficient binary encoding as
well as making merge joins 'just work'…

I have a functional version of my code in Pig right now, it's just not as
fast as I would like. :-)

Just my $0.02

On Sun, Aug 21, 2011 at 9:51 AM, Ashutosh Chauhan <[email protected]>wrote:

> Try the following:
>
> data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int, target:int);
>
> by_source = ORDER data BY source;
> by_target = FOREACH (ORDER data BY target) GENERATE target, source;
>
> STORE by_source INTO 'tmp/by_source' USING PigStorage();
> STORE by_target INTO 'tmp/by_target' USING PigStorage();
>
> -- Add this magical keyword here.
> exec;
>
> by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int,
> target:int);
> by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int,
> target:int);
>
> joined = JOIN by_source BY source, by_target BY target USING 'merge';
>
> STORE joined           INTO 'tmp/joined' ;
>
>
>
> Since pig looks across the entire script it finds the lineage of data
> across
> 'load-store' and finds all the predecessors for Merge-Join and since
> currently only few of those operators can be predecessors, query fails to
> compile. By introducing 'exec', compiler is forced to stop and compile and
> execute the script till there and then picks up execution after the current
> one is finished, and then it only sees loads as the predecessors for merge
> join.
>
> Hope it helps,
> Ashutosh
>
> On Sat, Aug 20, 2011 at 14:03, Kevin Burton <[email protected]> wrote:
>
> > OK….. I still can't get this to work.
> >
> > I've read the documentation and i still get the same error on 0.9.0 …
> >
> > Here's my code. I think it's implying that I need to have the predecessor
> > as
> > a LOAD and meet the following conditions:
> >
> >
> > Inner merge join (between two tables) will only work under these
> > conditions:
> > >
> > >    - Between the load of the sorted input and the merge join statement
> > >    there can only be filter statements and foreach statement where the
> > foreach
> > >    statement should meet the following conditions:
> > >
> > >
> > >    - There should be no UDFs in the foreach statement.
> > >
> > >
> > >    - The foreach statement should not change the position of the join
> > >    keys.
> > >
> > >
> > >    - There should be no transformation on the join keys which will
> change
> > >    the sort order.
> > >
> > >
> > >    - Data must be sorted on join keys in ascending (ASC) order on both
> > >    sides.
> > >
> > >
> > >    - Right-side loader must implement either the {OrderedLoadFunc}
> > >    interface or {IndexableLoadFunc} interface.
> > >
> > >
> > >    - Type information must be provided for the join key in the schema.
> > >
> > > The Zebra and PigStorage loaders satisfy all of these conditions.
> >
> >
> > …… which I believe I AM….. but it's still not working.
> >
> > Here's the data:
> >
> >
> > 1,1
> > 1,2
> > 1,3
> > 1,4
> > 1,1000000000
> > 0,1
> > 0,2
> > 0,3
> > 0,4
> > 0,1000000000
> >
> >
> > … and the script.
> >
> > data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int,
> target:int);
> >
> > by_source = ORDER data BY source;
> > by_target = FOREACH (ORDER data BY target) GENERATE target, source;
> >
> > STORE by_source INTO 'tmp/by_source' USING PigStorage();
> > STORE by_target INTO 'tmp/by_target' USING PigStorage();
> >
> > by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int,
> > target:int);
> > by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int,
> > target:int);
> >
> > joined = JOIN by_source BY source, by_target BY target USING 'merge';
> >
> > STORE joined           INTO 'tmp/joined' ;
> >
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> >
> > Location: *San Francisco, CA*
> > Skype: *burtonator*
> >
> > Skype-in: *(415) 871-0687*
> >
>



-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Reply via email to