Not sure if there is a way to do that. You could get a really rough estimate if you did the job I described and subtracted the total bytes calculated for the records from the "hadoop fs -dus /hbase/<table_name>" bytes. Then that would give an idea of the amount of overhead. I have a feeling it is negligible in the grand scheme of things.
-Xavier On 1/24/11 10:23 AM, Norbert Burger wrote: > Good idea. But it seems like this approach would give me the size of just > the raw data itself, ignoring any kind of container (like HFiles) that are > used to store the data. What I'd like ideally is to get an idea of what the > fixed cost (in terms of bytes) is for each my tables, and then understand > how I can calculate a variable bytes/record cost. > > Is this feasible? > > Norbert > > On Mon, Jan 24, 2011 at 1:16 PM, Xavier Stevens <[email protected]>wrote: > >> Norbert, >> >> It would probably be best if you wrote a quick MapReduce job that >> iterates over those records and outputs the sum of bytes for each one. >> Then you could use that output and get some general descriptive >> statistics based on it. >> >> Cheers, >> >> >> -Xavier >> >> >> On 1/24/11 9:37 AM, Norbert Burger wrote: >>> Hi folks - is there a recommended way of estimating HBase HDFS usage for >> a >>> new environment? >>> >>> We have a DEV HBase cluster in place, and from this, I'm trying to >> estimate >>> the specs of our not-yet-built PROD environment. One of the variables >> we're >>> considering is HBase usage of HDFS. What I've just tried is to calculate >> an >>> average bytes/record ratio by using "hadoop dfs -du /hbase", and dividing >> by >>> the number of records/table. But this ignores any kind of fixed >> overhead, >>> so I have concerns about it. >>> >>> Is there a better way? >>> >>> Norbert >>>
