Thanks Ashutosh. Let me re-consider various options available to me. -Ankur
On Wed, Jul 20, 2011 at 2:21 PM, Ashutosh Chauhan <[email protected]>wrote: > If you control the generation of data which needs to be joined, then > you can store it with Zebra and then do the joins. If not, then you > either need to rewrite the data using Zebra or need to implement > another loader which implements CollectableLoadFunc. > > Ashutosh > On Wed, Jul 20, 2011 at 14:16, Ankur Jain <[email protected]> wrote: > > Yeah, I need (full) outer join, which has this constraint on the loader. > > > > Thanks. > > > > > > On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan <[email protected] > >wrote: > > > >> It depends on whether you want to do inner or outer (also called > >> co-group) merge join. If you are doing inner merge join on two tables > >> PigStorage satisfies all the criteria and can be used. If you want to > >> do outer merge join (or inner merge join on more then two tables), > >> then you need CollectableLoadFunc which PigStorage doesn't implement > >> and only Zebra's TableLoader does. > >> > >> Hope it helps, > >> Ashutosh > >> On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky > >> <[email protected]> wrote: > >> > Not sure if this would be helpful, but docs says that the default > >> > PigStorage does implement that. I guess that your data needs to be > >> > already sorted if you do not want to go through the reduce phase > >> > during the join. > >> > > >> > T > >> > > >> > On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[email protected]> > >> wrote: > >> >> Thanks Ashutosh! Right, I too realized that yesterday. So, is there > any > >> >> other loader that implements > >> >> CollectableLoadFunc interface required by the merge join? > >> >> > >> >> > >> >> Thanks, > >> >> Ankur > >> >> > >> >> > >> >> On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan < > >> [email protected]>wrote: > >> >> > >> >>> Hey Ankur, > >> >>> > >> >>> Zebra's TableLoader works with the data written out using Zebra's > >> >>> TableStorer. So, you need to write the data first using Zebra and > then > >> >>> subsequently load using TableLoader and do merge-join. > >> >>> > >> >>> Ashutosh > >> >>> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[email protected]> > >> wrote: > >> >>> > Hi all, > >> >>> > > >> >>> > I'm trying to do a map-side only merge join [1] in pig using > Zebra's > >> >>> > TableLoader. (My data allows merge join.) But I'm being unable to > use > >> the > >> >>> > TableLoader. Even a simple script that loads a table and just > stores > >> it > >> >>> back > >> >>> > doesn't work - > >> >>> > > >> >>> > ---- > >> >>> > A = load 'my_input' using > >> org.apache.hadoop.zebra.pig.TableLoader('', > >> >>> > 'sorted'); > >> >>> > store A into 'my_output'; > >> >>> > ---- > >> >>> > > >> >>> > > >> >>> > 'my_input' is input directory containing a single file with just > 1 > >> >>> column - > >> >>> > --- > >> >>> > 1 > >> >>> > 2 > >> >>> > 3 > >> >>> > --- > >> >>> > > >> >>> > The error I get is - > >> >>> > > >> >>> > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected > >> >>> internal > >> >>> > error. Failed to find deleted column groupsjava.io.IOException: BT > >> Schema > >> >>> > file doesn't exist: *file:/......./my_input/.btschema*" > >> >>> > > >> >>> > > >> >>> > I have tried specifying the schema using the 'AS' clause and the > >> >>> DESCRIBE > >> >>> > statement as well, but its fetches me the same error. Is the > >> .btschema > >> >>> file > >> >>> > required? Is there any documentation available on its format? (I > >> tried > >> >>> > comma-separated column names with/without type info) > >> >>> > > >> >>> > > >> >>> > I am also willing to work with any other loader that satisfies the > >> merge > >> >>> > join constraints. Thanks in anticipation. > >> >>> > > >> >>> > > >> >>> > Regards, > >> >>> > Ankur > >> >>> > > >> >>> > > >> >>> > [1] * > >> http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* > >> >>> > > >> >>> > >> >> > >> > > >> > > >
