Re: Problem while using merge join

John Fri, 13 Sep 2013 12:33:55 -0700

Sure, it is not so fast while loading, but on the other hand I can safe the
foreach operation after the load function. The best way would be to get all
Columns and return a bag, but I see there no way because the LoadFunc
return a Tuple and no Bag. I will try this way and see how fast it is. If
there are other ideas to make that faster I will try it.


regards,
john


2013/9/13 Shahab Yunus <[email protected]>

> Wouldn't this slow down your data retrieval? Once column in each call
> instead of a batch?
>
> Regards,
> Shahab
>
>
> On Fri, Sep 13, 2013 at 2:34 PM, John <[email protected]> wrote:
>
> > I think I might have found a way to transform it directly into a bag.
> > Inside the HBaseStorage() Load Function I have set the HBase scan batch
> to
> > 1, so I got for every scan.next() one column instead of all columns. See
> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
> >
> > setBatch(int batch)
> > Set the maximum number of values to return for each call to next()
> >
> > I think this will work. Any idea if this way have disadvantages?
> >
> > regards
> >
> >
> > 2013/9/13 John <[email protected]>
> >
> > > hi,
> > >
> > > the join key is in the bag, thats the problem. The Load Function
> returns
> > > only one element 0$ and that is the map. This map is transformed in the
> > > next step with the UDF "MapToBagUDF" into a bag. for example the load
> > > functions returns this ([col1,col2,col3), then this map inside the
> tuple
> > is
> > > transformed to:
> > >
> > > (col1)
> > > (col2)
> > > (col3)
> > >
> > > Maybe there is is way to transform the map directly in the load
> function
> > > into a bag? The problem I see is that the next() Method in the LoadFunc
> > has
> > > to be a Tuple and no Bag. :/
> > >
> > >
> > >
> > > 2013/9/13 Pradeep Gollakota <[email protected]>
> > >
> > >> Since your join key is not in the Bag, can you do your join first and
> > then
> > >> execute your UDF?
> > >>
> > >>
> > >> On Fri, Sep 13, 2013 at 10:04 AM, John <[email protected]>
> > >> wrote:
> > >>
> > >> > Okay, I think I have found the problem here:
> > >> > http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ... there
> is
> > >> > wirtten;
> > >> >
> > >> > There may be filter statements and foreach statements between the
> > sorted
> > >> > data source and the join statement. The foreach statement should
> meet
> > >> the
> > >> > following conditions:
> > >> >
> > >> >    - There should be no UDFs in the foreach statement.
> > >> >    - The foreach statement should not change the position of the
> join
> > >> keys.
> > >> >    - There should be no transformation on the join keys which will
> > >> change
> > >> >    the sort order.
> > >> >
> > >> >
> > >> > I have to use a UDF to transform the Map into a Bag ... any
> Workaround
> > >> > idea?
> > >> >
> > >> > thanks
> > >> >
> > >> >
> > >> > 2013/9/13 John <[email protected]>
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I try to use a merge join for 2 bags. Here is my pig code:
> > >> > > http://pastebin.com/Y9b2UtNk .
> > >> > >
> > >> > > But I got this error:
> > >> > >
> > >> > > Caused by:
> > >> > >
> > >> >
> > >>
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
> > >> > > ERROR 1103: Merge join/Cogroup only supports Filter, Foreach,
> > >> Ascending
> > >> > > Sort, or Load as its predecessors. Found
> > >> > >
> > >> > > I think the reason is that there is no sort function or something
> > like
> > >> > > this. But the bags are definitely sorted. How can I do the merge
> > join?
> > >> > >
> > >> > > thanks
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Problem while using merge join

Reply via email to