Sure, it is not so fast while loading, but on the other hand I can safe the foreach operation after the load function. The best way would be to get all Columns and return a bag, but I see there no way because the LoadFunc return a Tuple and no Bag. I will try this way and see how fast it is. If there are other ideas to make that faster I will try it.
regards, john 2013/9/13 Shahab Yunus <[email protected]> > Wouldn't this slow down your data retrieval? Once column in each call > instead of a batch? > > Regards, > Shahab > > > On Fri, Sep 13, 2013 at 2:34 PM, John <[email protected]> wrote: > > > I think I might have found a way to transform it directly into a bag. > > Inside the HBaseStorage() Load Function I have set the HBase scan batch > to > > 1, so I got for every scan.next() one column instead of all columns. See > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > > > setBatch(int batch) > > Set the maximum number of values to return for each call to next() > > > > I think this will work. Any idea if this way have disadvantages? > > > > regards > > > > > > 2013/9/13 John <[email protected]> > > > > > hi, > > > > > > the join key is in the bag, thats the problem. The Load Function > returns > > > only one element 0$ and that is the map. This map is transformed in the > > > next step with the UDF "MapToBagUDF" into a bag. for example the load > > > functions returns this ([col1,col2,col3), then this map inside the > tuple > > is > > > transformed to: > > > > > > (col1) > > > (col2) > > > (col3) > > > > > > Maybe there is is way to transform the map directly in the load > function > > > into a bag? The problem I see is that the next() Method in the LoadFunc > > has > > > to be a Tuple and no Bag. :/ > > > > > > > > > > > > 2013/9/13 Pradeep Gollakota <[email protected]> > > > > > >> Since your join key is not in the Bag, can you do your join first and > > then > > >> execute your UDF? > > >> > > >> > > >> On Fri, Sep 13, 2013 at 10:04 AM, John <[email protected]> > > >> wrote: > > >> > > >> > Okay, I think I have found the problem here: > > >> > http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ... there > is > > >> > wirtten; > > >> > > > >> > There may be filter statements and foreach statements between the > > sorted > > >> > data source and the join statement. The foreach statement should > meet > > >> the > > >> > following conditions: > > >> > > > >> > - There should be no UDFs in the foreach statement. > > >> > - The foreach statement should not change the position of the > join > > >> keys. > > >> > - There should be no transformation on the join keys which will > > >> change > > >> > the sort order. > > >> > > > >> > > > >> > I have to use a UDF to transform the Map into a Bag ... any > Workaround > > >> > idea? > > >> > > > >> > thanks > > >> > > > >> > > > >> > 2013/9/13 John <[email protected]> > > >> > > > >> > > Hi, > > >> > > > > >> > > I try to use a merge join for 2 bags. Here is my pig code: > > >> > > http://pastebin.com/Y9b2UtNk . > > >> > > > > >> > > But I got this error: > > >> > > > > >> > > Caused by: > > >> > > > > >> > > > >> > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException: > > >> > > ERROR 1103: Merge join/Cogroup only supports Filter, Foreach, > > >> Ascending > > >> > > Sort, or Load as its predecessors. Found > > >> > > > > >> > > I think the reason is that there is no sort function or something > > like > > >> > > this. But the bags are definitely sorted. How can I do the merge > > join? > > >> > > > > >> > > thanks > > >> > > > > >> > > > >> > > > > > > > > >
