Tom, Yes, you are right. According to this analysis ( http://prafull-blog.blogspot.in/2012/06/how-to-calculate-record-size-of-hbase.html) if it is right, then the overhead is quite big if the cell value occupies a small portion.
In the analysis in that link, the overhead is actually 10x!!!!(the real values only takes 12B and it costs 123B in HBase to store them...) Is that real???? In this case, should we do some combination to reduce the overhead? Thanks, Nick On Mon, Jan 27, 2014 at 2:33 PM, Tom Brown <[email protected]> wrote: > I believe each cell stores its own copy of the entire row key, column > qualifier, and timestamp. Could that account for the increase in size? > > --Tom > > > On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <[email protected]> > wrote: > > > I'm importing a set of data into HBase. The CSV file contains 82 entries > > per line. Starting with 8 byte ID, followed by 16 byte date and the rest > > are 80 numbers with 4 bytes each. > > > > The current HBase schema is: ID as row key, date as a 'date' family with > > 'value' qualifier, the rest is in another family called 'readings' with > > 'P0', 'P1', 'P2', ... through 'P79' as qualifiers. > > > > I'm testing this on a single node cluster with HBase running in pseudo > > distributed mode (no replication, no compression for HBase)...After > > importing a CSV file with 150MB of size in HDFS(no replication), I > checked > > the the table size, and it shows ~900MB which is 6x times larger than it > is > > in HDFS.... > > > > Why there is so large overhead on this? Am I doing anything wrong here? > > > > Thanks, > > > > Nick > > >
