Yep, that looks like it could be pretty handy -- according to that ticket
it's in 0.98.1 as well.

On Tue, Mar 17, 2015 at 8:54 PM Josh Wills <[email protected]> wrote:

> Would this help for 0.99+?
>
> https://issues.apache.org/jira/browse/HBASE-10413
>
> On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <[email protected]>
> wrote:
>
>> That sounds like it would work pretty well, although the situation where
>> a custom Scan is used is still problematic.
>>
>> I think Hannibal [1] does some clever stuff as far as figuring out data
>> size as well (I think just via HBase RPC and not by looking at HDFS), there
>> could be some useful ideas in there.
>>
>> - Gabriel
>>
>> 1. https://github.com/sentric/hannibal
>>
>>
>> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <[email protected]>
>> wrote:
>>
>>> Could we make an estimate based on # of regions * 
>>> hbase.hregion.max.filesize?
>>>  The case where this would overestimate would be if someone pre-split a
>>> table upon creation.   Otherwise as the table fills up over time in theory
>>> each region would grow and split evenly (and possibly hit max size and
>>> therefore split again).
>>>
>>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <[email protected]>
>>> wrote:
>>>
>>>> Also open to suggestion here-- this has annoyed me for some time (as
>>>> Gabriel pointed out), but I don't have a good fix for it.
>>>>
>>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Nithin,
>>>>>
>>>>> This is a long-standing issue in Crunch (I think it's been present
>>>>> since Crunch was originally open-sourced). I'd love to get this fixed
>>>>> somehow, although it seems to not be trivial to do -- it can be difficult
>>>>> to accurately estimate the size of data that will come from an HBase 
>>>>> table,
>>>>> especially considering that filters and selections of a subset of columns
>>>>> can be done on an HBase table.
>>>>>
>>>>> One short-term way of working around this is to add a simple identity
>>>>> function directly after the HBaseSourceTarget that implements the
>>>>> scaleFactor method to manipulate the calculated size of the HBase data, 
>>>>> but
>>>>> this is just another hack.
>>>>>
>>>>> Maybe the better solution would be to estimate the size of the HBase
>>>>> table based on its size on HDFS when using the HBaseFrom.table(String)
>>>>> method, and then also overload the HBaseFrom.table(String, Scan) method to
>>>>> also take a long value which is the estimated byte size (or perhaps scale
>>>>> factor) of the table content that is expected to be returned from the 
>>>>> given
>>>>> Scan.
>>>>>
>>>>> Any thoughts on either of these?
>>>>>
>>>>> - Gabriel
>>>>>
>>>>>
>>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>> I came across a unique behavior while using HBaseSourceTarget.
>>>>>> Suppose I
>>>>>> have a job(from MRPipeline) that reads from HBase using
>>>>>> HBaseSourceTarget
>>>>>> and passes all the data to a reduce phase, the number of reducers set
>>>>>> by
>>>>>> planner will be equal to 1. The reason being [1]. So, it looks like
>>>>>> the
>>>>>> planner assumes there is only about 1Gb of data that's read from the
>>>>>> source, and sets the number of reducers accordingly. However, let's
>>>>>> say my
>>>>>> HBase scan is returning very less data or huge amounts of data. The
>>>>>> planner
>>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>>>>>> interesting is, if there are dependent jobs, the planner will set the
>>>>>> number of reducers based on the initially determined size from HBase
>>>>>> source.
>>>>>>
>>>>>> As a fix for the above problem, I can set the number of reducers on
>>>>>> the
>>>>>> groupByKey(), but that does not offer much flexibility when dealing
>>>>>> with
>>>>>> data that is of varying sizes. The other option, is to have a map
>>>>>> only job
>>>>>> that reads from HBase and writes to HDFS and have a run(). The next
>>>>>> job
>>>>>> will determine the size right, since FileSourceImpl calculates the
>>>>>> size on
>>>>>> disk.
>>>>>>
>>>>>> I noticed the comment on HBaseSourceTarget, and was wondering if
>>>>>> there was
>>>>>> anything planned to have it implemented.
>>>>>>
>>>>>> [1]
>>>>>>
>>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>>>>
>>>>>> Thanks
>>>>>> Nithin
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Reply via email to