Yep, that looks like it could be pretty handy -- according to that ticket it's in 0.98.1 as well.
On Tue, Mar 17, 2015 at 8:54 PM Josh Wills <[email protected]> wrote: > Would this help for 0.99+? > > https://issues.apache.org/jira/browse/HBASE-10413 > > On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <[email protected]> > wrote: > >> That sounds like it would work pretty well, although the situation where >> a custom Scan is used is still problematic. >> >> I think Hannibal [1] does some clever stuff as far as figuring out data >> size as well (I think just via HBase RPC and not by looking at HDFS), there >> could be some useful ideas in there. >> >> - Gabriel >> >> 1. https://github.com/sentric/hannibal >> >> >> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <[email protected]> >> wrote: >> >>> Could we make an estimate based on # of regions * >>> hbase.hregion.max.filesize? >>> The case where this would overestimate would be if someone pre-split a >>> table upon creation. Otherwise as the table fills up over time in theory >>> each region would grow and split evenly (and possibly hit max size and >>> therefore split again). >>> >>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <[email protected]> >>> wrote: >>> >>>> Also open to suggestion here-- this has annoyed me for some time (as >>>> Gabriel pointed out), but I don't have a good fix for it. >>>> >>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <[email protected]> >>>> wrote: >>>> >>>>> Hi Nithin, >>>>> >>>>> This is a long-standing issue in Crunch (I think it's been present >>>>> since Crunch was originally open-sourced). I'd love to get this fixed >>>>> somehow, although it seems to not be trivial to do -- it can be difficult >>>>> to accurately estimate the size of data that will come from an HBase >>>>> table, >>>>> especially considering that filters and selections of a subset of columns >>>>> can be done on an HBase table. >>>>> >>>>> One short-term way of working around this is to add a simple identity >>>>> function directly after the HBaseSourceTarget that implements the >>>>> scaleFactor method to manipulate the calculated size of the HBase data, >>>>> but >>>>> this is just another hack. >>>>> >>>>> Maybe the better solution would be to estimate the size of the HBase >>>>> table based on its size on HDFS when using the HBaseFrom.table(String) >>>>> method, and then also overload the HBaseFrom.table(String, Scan) method to >>>>> also take a long value which is the estimated byte size (or perhaps scale >>>>> factor) of the table content that is expected to be returned from the >>>>> given >>>>> Scan. >>>>> >>>>> Any thoughts on either of these? >>>>> >>>>> - Gabriel >>>>> >>>>> >>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> I came across a unique behavior while using HBaseSourceTarget. >>>>>> Suppose I >>>>>> have a job(from MRPipeline) that reads from HBase using >>>>>> HBaseSourceTarget >>>>>> and passes all the data to a reduce phase, the number of reducers set >>>>>> by >>>>>> planner will be equal to 1. The reason being [1]. So, it looks like >>>>>> the >>>>>> planner assumes there is only about 1Gb of data that's read from the >>>>>> source, and sets the number of reducers accordingly. However, let's >>>>>> say my >>>>>> HBase scan is returning very less data or huge amounts of data. The >>>>>> planner >>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more >>>>>> interesting is, if there are dependent jobs, the planner will set the >>>>>> number of reducers based on the initially determined size from HBase >>>>>> source. >>>>>> >>>>>> As a fix for the above problem, I can set the number of reducers on >>>>>> the >>>>>> groupByKey(), but that does not offer much flexibility when dealing >>>>>> with >>>>>> data that is of varying sizes. The other option, is to have a map >>>>>> only job >>>>>> that reads from HBase and writes to HDFS and have a run(). The next >>>>>> job >>>>>> will determine the size right, since FileSourceImpl calculates the >>>>>> size on >>>>>> disk. >>>>>> >>>>>> I noticed the comment on HBaseSourceTarget, and was wondering if >>>>>> there was >>>>>> anything planned to have it implemented. >>>>>> >>>>>> [1] >>>>>> >>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173 >>>>>> >>>>>> Thanks >>>>>> Nithin >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera <http://www.cloudera.com> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>> >>> >>> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
