Yeah, I need (full) outer join, which has this constraint on the loader. Thanks.
On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan <[email protected]>wrote: > It depends on whether you want to do inner or outer (also called > co-group) merge join. If you are doing inner merge join on two tables > PigStorage satisfies all the criteria and can be used. If you want to > do outer merge join (or inner merge join on more then two tables), > then you need CollectableLoadFunc which PigStorage doesn't implement > and only Zebra's TableLoader does. > > Hope it helps, > Ashutosh > On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky > <[email protected]> wrote: > > Not sure if this would be helpful, but docs says that the default > > PigStorage does implement that. I guess that your data needs to be > > already sorted if you do not want to go through the reduce phase > > during the join. > > > > T > > > > On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[email protected]> > wrote: > >> Thanks Ashutosh! Right, I too realized that yesterday. So, is there any > >> other loader that implements > >> CollectableLoadFunc interface required by the merge join? > >> > >> > >> Thanks, > >> Ankur > >> > >> > >> On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan < > [email protected]>wrote: > >> > >>> Hey Ankur, > >>> > >>> Zebra's TableLoader works with the data written out using Zebra's > >>> TableStorer. So, you need to write the data first using Zebra and then > >>> subsequently load using TableLoader and do merge-join. > >>> > >>> Ashutosh > >>> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[email protected]> > wrote: > >>> > Hi all, > >>> > > >>> > I'm trying to do a map-side only merge join [1] in pig using Zebra's > >>> > TableLoader. (My data allows merge join.) But I'm being unable to use > the > >>> > TableLoader. Even a simple script that loads a table and just stores > it > >>> back > >>> > doesn't work - > >>> > > >>> > ---- > >>> > A = load 'my_input' using > org.apache.hadoop.zebra.pig.TableLoader('', > >>> > 'sorted'); > >>> > store A into 'my_output'; > >>> > ---- > >>> > > >>> > > >>> > 'my_input' is input directory containing a single file with just 1 > >>> column - > >>> > --- > >>> > 1 > >>> > 2 > >>> > 3 > >>> > --- > >>> > > >>> > The error I get is - > >>> > > >>> > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected > >>> internal > >>> > error. Failed to find deleted column groupsjava.io.IOException: BT > Schema > >>> > file doesn't exist: *file:/......./my_input/.btschema*" > >>> > > >>> > > >>> > I have tried specifying the schema using the 'AS' clause and the > >>> DESCRIBE > >>> > statement as well, but its fetches me the same error. Is the > .btschema > >>> file > >>> > required? Is there any documentation available on its format? (I > tried > >>> > comma-separated column names with/without type info) > >>> > > >>> > > >>> > I am also willing to work with any other loader that satisfies the > merge > >>> > join constraints. Thanks in anticipation. > >>> > > >>> > > >>> > Regards, > >>> > Ankur > >>> > > >>> > > >>> > [1] * > http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* > >>> > > >>> > >> > > >
