Hi Nithin, Unfortunately, the HBase classes aren't included in the published API docs. I just took a look at adding them, but it appears to be more complex than I would have hoped -- I'll create a JIRA ticket to look into this further, but I won't be able to get to it right away.
In any case, these HBase classes (HBaseFrom, HBaseTo) are in the org.apache.crunch.io.hbase package in the crunch-hbase module. - Gabriel On Wed, Mar 18, 2015 at 2:16 AM Nithin Asokan <[email protected]> wrote: > Thanks for looking at this everyone. > > I can try the suggestion Gabriel posted here, I'm not familiar with the > HBaseFrom.table(String) API, and tried searching online. It will be really > helpful if someone can point me to the API. > > Thanks everyone! > > On Tue, Mar 17, 2015 at 3:34 PM Gabriel Reid <[email protected]> > wrote: > >> Yep, that looks like it could be pretty handy -- according to that ticket >> it's in 0.98.1 as well. >> >> >> On Tue, Mar 17, 2015 at 8:54 PM Josh Wills <[email protected]> wrote: >> >>> Would this help for 0.99+? >>> >>> https://issues.apache.org/jira/browse/HBASE-10413 >>> >>> On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <[email protected]> >>> wrote: >>> >>>> That sounds like it would work pretty well, although the situation >>>> where a custom Scan is used is still problematic. >>>> >>>> I think Hannibal [1] does some clever stuff as far as figuring out data >>>> size as well (I think just via HBase RPC and not by looking at HDFS), there >>>> could be some useful ideas in there. >>>> >>>> - Gabriel >>>> >>>> 1. https://github.com/sentric/hannibal >>>> >>>> >>>> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <[email protected]> >>>> wrote: >>>> >>>>> Could we make an estimate based on # of regions * >>>>> hbase.hregion.max.filesize? >>>>> The case where this would overestimate would be if someone pre-split >>>>> a table upon creation. Otherwise as the table fills up over time in >>>>> theory each region would grow and split evenly (and possibly hit max size >>>>> and therefore split again). >>>>> >>>>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <[email protected]> >>>>> wrote: >>>>> >>>>>> Also open to suggestion here-- this has annoyed me for some time (as >>>>>> Gabriel pointed out), but I don't have a good fix for it. >>>>>> >>>>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Hi Nithin, >>>>>>> >>>>>>> This is a long-standing issue in Crunch (I think it's been present >>>>>>> since Crunch was originally open-sourced). I'd love to get this fixed >>>>>>> somehow, although it seems to not be trivial to do -- it can be >>>>>>> difficult >>>>>>> to accurately estimate the size of data that will come from an HBase >>>>>>> table, >>>>>>> especially considering that filters and selections of a subset of >>>>>>> columns >>>>>>> can be done on an HBase table. >>>>>>> >>>>>>> One short-term way of working around this is to add a simple >>>>>>> identity function directly after the HBaseSourceTarget that implements >>>>>>> the >>>>>>> scaleFactor method to manipulate the calculated size of the HBase data, >>>>>>> but >>>>>>> this is just another hack. >>>>>>> >>>>>>> Maybe the better solution would be to estimate the size of the HBase >>>>>>> table based on its size on HDFS when using the HBaseFrom.table(String) >>>>>>> method, and then also overload the HBaseFrom.table(String, Scan) method >>>>>>> to >>>>>>> also take a long value which is the estimated byte size (or perhaps >>>>>>> scale >>>>>>> factor) of the table content that is expected to be returned from the >>>>>>> given >>>>>>> Scan. >>>>>>> >>>>>>> Any thoughts on either of these? >>>>>>> >>>>>>> - Gabriel >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> I came across a unique behavior while using HBaseSourceTarget. >>>>>>>> Suppose I >>>>>>>> have a job(from MRPipeline) that reads from HBase using >>>>>>>> HBaseSourceTarget >>>>>>>> and passes all the data to a reduce phase, the number of reducers >>>>>>>> set by >>>>>>>> planner will be equal to 1. The reason being [1]. So, it looks like >>>>>>>> the >>>>>>>> planner assumes there is only about 1Gb of data that's read from the >>>>>>>> source, and sets the number of reducers accordingly. However, let's >>>>>>>> say my >>>>>>>> HBase scan is returning very less data or huge amounts of data. The >>>>>>>> planner >>>>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more >>>>>>>> interesting is, if there are dependent jobs, the planner will set >>>>>>>> the >>>>>>>> number of reducers based on the initially determined size from >>>>>>>> HBase source. >>>>>>>> >>>>>>>> As a fix for the above problem, I can set the number of reducers on >>>>>>>> the >>>>>>>> groupByKey(), but that does not offer much flexibility when dealing >>>>>>>> with >>>>>>>> data that is of varying sizes. The other option, is to have a map >>>>>>>> only job >>>>>>>> that reads from HBase and writes to HDFS and have a run(). The next >>>>>>>> job >>>>>>>> will determine the size right, since FileSourceImpl calculates the >>>>>>>> size on >>>>>>>> disk. >>>>>>>> >>>>>>>> I noticed the comment on HBaseSourceTarget, and was wondering if >>>>>>>> there was >>>>>>>> anything planned to have it implemented. >>>>>>>> >>>>>>>> [1] >>>>>>>> >>>>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173 >>>>>>>> >>>>>>>> Thanks >>>>>>>> Nithin >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Director of Data Science >>>>>> Cloudera <http://www.cloudera.com> >>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>>>> >>>>> >>>>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >>
