Re: Merge join

Ankur Jain Wed, 20 Jul 2011 14:16:59 -0700

Yeah, I need (full) outer join, which has this constraint on the loader.

Thanks.



On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan <[email protected]>wrote:

> It depends on whether you want to do inner or outer (also called
> co-group) merge join. If you are doing inner merge join on two tables
> PigStorage satisfies all the criteria and can be used.  If you want to
> do outer merge join (or inner merge join on more then two tables),
> then you need CollectableLoadFunc which PigStorage doesn't implement
> and only Zebra's TableLoader does.
>
> Hope it helps,
> Ashutosh
> On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky
> <[email protected]> wrote:
> > Not sure if this would be helpful, but docs says that the default
> > PigStorage does implement that. I guess that your data needs to be
> > already sorted if you do not want to go through the reduce phase
> > during the join.
> >
> > T
> >
> > On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[email protected]>
> wrote:
> >> Thanks Ashutosh! Right, I too realized that yesterday. So, is there any
> >> other loader that implements
> >> CollectableLoadFunc interface required by the merge join?
> >>
> >>
> >> Thanks,
> >> Ankur
> >>
> >>
> >> On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan <
> [email protected]>wrote:
> >>
> >>> Hey Ankur,
> >>>
> >>> Zebra's TableLoader works with the data written out using Zebra's
> >>> TableStorer. So, you need to write the data first using Zebra and then
> >>> subsequently load using TableLoader and do merge-join.
> >>>
> >>> Ashutosh
> >>> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[email protected]>
> wrote:
> >>> > Hi all,
> >>> >
> >>> > I'm trying to do a map-side only merge join [1] in pig using Zebra's
> >>> > TableLoader. (My data allows merge join.) But I'm being unable to use
> the
> >>> > TableLoader. Even a simple script that loads a table and just stores
> it
> >>> back
> >>> > doesn't work -
> >>> >
> >>> >  ----
> >>> >  A = load 'my_input' using
> org.apache.hadoop.zebra.pig.TableLoader('',
> >>> > 'sorted');
> >>> >  store A into 'my_output';
> >>> >  ----
> >>> >
> >>> >
> >>> >  'my_input' is input directory containing a single file with just 1
> >>> column -
> >>> >  ---
> >>> >  1
> >>> >  2
> >>> >  3
> >>> >  ---
> >>> >
> >>> >  The error I get is -
> >>> >
> >>> >  "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected
> >>> internal
> >>> > error. Failed to find deleted column groupsjava.io.IOException: BT
> Schema
> >>> > file doesn't exist: *file:/......./my_input/.btschema*"
> >>> >
> >>> >
> >>> >  I have tried specifying the schema using the 'AS' clause and the
> >>> DESCRIBE
> >>> > statement as well, but its fetches me the same error. Is the
> .btschema
> >>> file
> >>> > required? Is there any documentation available on its format? (I
> tried
> >>> > comma-separated column names with/without type info)
> >>> >
> >>> >
> >>> > I am also willing to work with any other loader that satisfies the
> merge
> >>> > join constraints. Thanks in anticipation.
> >>> >
> >>> >
> >>> >  Regards,
> >>> >  Ankur
> >>> >
> >>> >
> >>> >  [1] *
> http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
> >>> >
> >>>
> >>
> >
>

Re: Merge join

Reply via email to