Good idea. But it seems like this approach would give me the size of just the raw data itself, ignoring any kind of container (like HFiles) that are used to store the data. What I'd like ideally is to get an idea of what the fixed cost (in terms of bytes) is for each my tables, and then understand how I can calculate a variable bytes/record cost.
Is this feasible? Norbert On Mon, Jan 24, 2011 at 1:16 PM, Xavier Stevens <[email protected]>wrote: > Norbert, > > It would probably be best if you wrote a quick MapReduce job that > iterates over those records and outputs the sum of bytes for each one. > Then you could use that output and get some general descriptive > statistics based on it. > > Cheers, > > > -Xavier > > > On 1/24/11 9:37 AM, Norbert Burger wrote: > > Hi folks - is there a recommended way of estimating HBase HDFS usage for > a > > new environment? > > > > We have a DEV HBase cluster in place, and from this, I'm trying to > estimate > > the specs of our not-yet-built PROD environment. One of the variables > we're > > considering is HBase usage of HDFS. What I've just tried is to calculate > an > > average bytes/record ratio by using "hadoop dfs -du /hbase", and dividing > by > > the number of records/table. But this ignores any kind of fixed > overhead, > > so I have concerns about it. > > > > Is there a better way? > > > > Norbert > > >
