Thanks Ashutosh. Let me re-consider various options available to me.

-Ankur

On Wed, Jul 20, 2011 at 2:21 PM, Ashutosh Chauhan <[email protected]>wrote:

> If you control the generation of data which needs to be joined, then
> you can store it with Zebra and then do the joins. If not, then you
> either need to rewrite the data using Zebra or need to implement
> another loader which implements CollectableLoadFunc.
>
> Ashutosh
> On Wed, Jul 20, 2011 at 14:16, Ankur Jain <[email protected]> wrote:
> > Yeah, I need (full) outer join, which has this constraint on the loader.
> >
> > Thanks.
> >
> >
> > On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan <[email protected]
> >wrote:
> >
> >> It depends on whether you want to do inner or outer (also called
> >> co-group) merge join. If you are doing inner merge join on two tables
> >> PigStorage satisfies all the criteria and can be used.  If you want to
> >> do outer merge join (or inner merge join on more then two tables),
> >> then you need CollectableLoadFunc which PigStorage doesn't implement
> >> and only Zebra's TableLoader does.
> >>
> >> Hope it helps,
> >> Ashutosh
> >> On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky
> >> <[email protected]> wrote:
> >> > Not sure if this would be helpful, but docs says that the default
> >> > PigStorage does implement that. I guess that your data needs to be
> >> > already sorted if you do not want to go through the reduce phase
> >> > during the join.
> >> >
> >> > T
> >> >
> >> > On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[email protected]>
> >> wrote:
> >> >> Thanks Ashutosh! Right, I too realized that yesterday. So, is there
> any
> >> >> other loader that implements
> >> >> CollectableLoadFunc interface required by the merge join?
> >> >>
> >> >>
> >> >> Thanks,
> >> >> Ankur
> >> >>
> >> >>
> >> >> On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan <
> >> [email protected]>wrote:
> >> >>
> >> >>> Hey Ankur,
> >> >>>
> >> >>> Zebra's TableLoader works with the data written out using Zebra's
> >> >>> TableStorer. So, you need to write the data first using Zebra and
> then
> >> >>> subsequently load using TableLoader and do merge-join.
> >> >>>
> >> >>> Ashutosh
> >> >>> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[email protected]>
> >> wrote:
> >> >>> > Hi all,
> >> >>> >
> >> >>> > I'm trying to do a map-side only merge join [1] in pig using
> Zebra's
> >> >>> > TableLoader. (My data allows merge join.) But I'm being unable to
> use
> >> the
> >> >>> > TableLoader. Even a simple script that loads a table and just
> stores
> >> it
> >> >>> back
> >> >>> > doesn't work -
> >> >>> >
> >> >>> >  ----
> >> >>> >  A = load 'my_input' using
> >> org.apache.hadoop.zebra.pig.TableLoader('',
> >> >>> > 'sorted');
> >> >>> >  store A into 'my_output';
> >> >>> >  ----
> >> >>> >
> >> >>> >
> >> >>> >  'my_input' is input directory containing a single file with just
> 1
> >> >>> column -
> >> >>> >  ---
> >> >>> >  1
> >> >>> >  2
> >> >>> >  3
> >> >>> >  ---
> >> >>> >
> >> >>> >  The error I get is -
> >> >>> >
> >> >>> >  "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected
> >> >>> internal
> >> >>> > error. Failed to find deleted column groupsjava.io.IOException: BT
> >> Schema
> >> >>> > file doesn't exist: *file:/......./my_input/.btschema*"
> >> >>> >
> >> >>> >
> >> >>> >  I have tried specifying the schema using the 'AS' clause and the
> >> >>> DESCRIBE
> >> >>> > statement as well, but its fetches me the same error. Is the
> >> .btschema
> >> >>> file
> >> >>> > required? Is there any documentation available on its format? (I
> >> tried
> >> >>> > comma-separated column names with/without type info)
> >> >>> >
> >> >>> >
> >> >>> > I am also willing to work with any other loader that satisfies the
> >> merge
> >> >>> > join constraints. Thanks in anticipation.
> >> >>> >
> >> >>> >
> >> >>> >  Regards,
> >> >>> >  Ankur
> >> >>> >
> >> >>> >
> >> >>> >  [1] *
> >> http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
> >> >>> >
> >> >>>
> >> >>
> >> >
> >>
> >
>

Reply via email to