Not sure if there is a way to do that.  You could get a really rough
estimate if you did the job I described and subtracted the total bytes
calculated for the records from the "hadoop fs -dus /hbase/<table_name>"
bytes.  Then that would give an idea of the amount of overhead.  I have
a feeling it is negligible in the grand scheme of things.

-Xavier

On 1/24/11 10:23 AM, Norbert Burger wrote:
> Good idea.  But it seems like this approach would give me the size of just
> the raw data itself, ignoring any kind of container (like HFiles) that are
> used to store the data.  What I'd like ideally is to get an idea of what the
> fixed cost (in terms of bytes) is for each my tables, and then understand
> how I can calculate a variable bytes/record cost.
>
> Is this feasible?
>
> Norbert
>
> On Mon, Jan 24, 2011 at 1:16 PM, Xavier Stevens <[email protected]>wrote:
>
>> Norbert,
>>
>> It would probably be best if you wrote a quick MapReduce job that
>> iterates over those records and outputs the sum of bytes for each one.
>> Then you could use that output and get some general descriptive
>> statistics based on it.
>>
>> Cheers,
>>
>>
>> -Xavier
>>
>>
>> On 1/24/11 9:37 AM, Norbert Burger wrote:
>>> Hi folks - is there a recommended way of estimating HBase HDFS usage for
>> a
>>> new environment?
>>>
>>> We have a DEV HBase cluster in place, and from this, I'm trying to
>> estimate
>>> the specs of our not-yet-built PROD environment.  One of the variables
>> we're
>>> considering is HBase usage of HDFS.  What I've just tried is to calculate
>> an
>>> average bytes/record ratio by using "hadoop dfs -du /hbase", and dividing
>> by
>>> the number of records/table.  But this ignores any kind of fixed
>> overhead,
>>> so I have concerns about it.
>>>
>>> Is there a better way?
>>>
>>> Norbert
>>>

Reply via email to