I'm willing to do both, especially if I get this into production.  Right now
I'm trying to wrap my head around the code and make some progress to see
what needs work and what needs fixing.

… so definitely thanks for all the feedback.

:)

Willing to step in and do the work myself but I need to grok the internals
of Pig more.

I think my general lay of the land right now is that Pig is functional and
has a lot of potential but it's not yet optimized for massive performance.

Once I figure out what we need to get working to get this into production
I'll be able to help improve things.

Part of the reason for sending some of these emails is to make sure I'm not
insane and that these things are actually problems.

This is replacing an hand written system that is VERY tuned and fast but
uses a custom map reduce framework which isn't designed for the general case
which we want to avoid moving forward.

I'm not totally in love with Hadoop but with ETL jobs it's fine (which is
what we're using it for).

Kevin

On Sun, Aug 21, 2011 at 9:03 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Kevin,
> Merge join in particular hasn't found a customer willing to do the work to
> make it great. Parts that get more use get more polish. Opening tickets
> with
> concrete suggestions for improvement helps. Submitting patches helps even
> more :).
>
> D
>
> On Sun, Aug 21, 2011 at 2:00 PM, Kevin Burton <[email protected]> wrote:
>
> > Ah!  I think I was close to figuring that out last night…
> >
> > It seems that for the most part Pig is 'functional' right now but
> > optimization takes a lot of work.
> >
> > Right now the major tasks I can see are a more efficient binary encoding
> as
> > well as making merge joins 'just work'…
> >
> > I have a functional version of my code in Pig right now, it's just not as
> > fast as I would like. :-)
> >
> > Just my $0.02
> >
> > On Sun, Aug 21, 2011 at 9:51 AM, Ashutosh Chauhan <[email protected]
> > >wrote:
> >
> > > Try the following:
> > >
> > > data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int,
> > target:int);
> > >
> > > by_source = ORDER data BY source;
> > > by_target = FOREACH (ORDER data BY target) GENERATE target, source;
> > >
> > > STORE by_source INTO 'tmp/by_source' USING PigStorage();
> > > STORE by_target INTO 'tmp/by_target' USING PigStorage();
> > >
> > > -- Add this magical keyword here.
> > > exec;
> > >
> > > by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int,
> > > target:int);
> > > by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int,
> > > target:int);
> > >
> > > joined = JOIN by_source BY source, by_target BY target USING 'merge';
> > >
> > > STORE joined           INTO 'tmp/joined' ;
> > >
> > >
> > >
> > > Since pig looks across the entire script it finds the lineage of data
> > > across
> > > 'load-store' and finds all the predecessors for Merge-Join and since
> > > currently only few of those operators can be predecessors, query fails
> to
> > > compile. By introducing 'exec', compiler is forced to stop and compile
> > and
> > > execute the script till there and then picks up execution after the
> > current
> > > one is finished, and then it only sees loads as the predecessors for
> > merge
> > > join.
> > >
> > > Hope it helps,
> > > Ashutosh
> > >
> > > On Sat, Aug 20, 2011 at 14:03, Kevin Burton <[email protected]>
> wrote:
> > >
> > > > OK….. I still can't get this to work.
> > > >
> > > > I've read the documentation and i still get the same error on 0.9.0 …
> > > >
> > > > Here's my code. I think it's implying that I need to have the
> > predecessor
> > > > as
> > > > a LOAD and meet the following conditions:
> > > >
> > > >
> > > > Inner merge join (between two tables) will only work under these
> > > > conditions:
> > > > >
> > > > >    - Between the load of the sorted input and the merge join
> > statement
> > > > >    there can only be filter statements and foreach statement where
> > the
> > > > foreach
> > > > >    statement should meet the following conditions:
> > > > >
> > > > >
> > > > >    - There should be no UDFs in the foreach statement.
> > > > >
> > > > >
> > > > >    - The foreach statement should not change the position of the
> join
> > > > >    keys.
> > > > >
> > > > >
> > > > >    - There should be no transformation on the join keys which will
> > > change
> > > > >    the sort order.
> > > > >
> > > > >
> > > > >    - Data must be sorted on join keys in ascending (ASC) order on
> > both
> > > > >    sides.
> > > > >
> > > > >
> > > > >    - Right-side loader must implement either the {OrderedLoadFunc}
> > > > >    interface or {IndexableLoadFunc} interface.
> > > > >
> > > > >
> > > > >    - Type information must be provided for the join key in the
> > schema.
> > > > >
> > > > > The Zebra and PigStorage loaders satisfy all of these conditions.
> > > >
> > > >
> > > > …… which I believe I AM….. but it's still not working.
> > > >
> > > > Here's the data:
> > > >
> > > >
> > > > 1,1
> > > > 1,2
> > > > 1,3
> > > > 1,4
> > > > 1,1000000000
> > > > 0,1
> > > > 0,2
> > > > 0,3
> > > > 0,4
> > > > 0,1000000000
> > > >
> > > >
> > > > … and the script.
> > > >
> > > > data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int,
> > > target:int);
> > > >
> > > > by_source = ORDER data BY source;
> > > > by_target = FOREACH (ORDER data BY target) GENERATE target, source;
> > > >
> > > > STORE by_source INTO 'tmp/by_source' USING PigStorage();
> > > > STORE by_target INTO 'tmp/by_target' USING PigStorage();
> > > >
> > > > by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int,
> > > > target:int);
> > > > by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int,
> > > > target:int);
> > > >
> > > > joined = JOIN by_source BY source, by_target BY target USING 'merge';
> > > >
> > > > STORE joined           INTO 'tmp/joined' ;
> > > >
> > > >
> > > > --
> > > >
> > > > Founder/CEO Spinn3r.com
> > > >
> > > > Location: *San Francisco, CA*
> > > > Skype: *burtonator*
> > > >
> > > > Skype-in: *(415) 871-0687*
> > > >
> > >
> >
> >
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> >
> > Location: *San Francisco, CA*
> > Skype: *burtonator*
> >
> > Skype-in: *(415) 871-0687*
> >
>



-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Reply via email to