There can be a lot of duplication in what ends up in HFiles but 500MB -> 32MB does seem too good to be true.
Could you try writing without GZIP or mess with the hfile reader[1] to see what your keys look like when at rest in an HFile (and maybe save the decompressed hfile to compare sizes?) St.Ack 1. http://hbase.apache.org/book.html#hfile On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <[email protected]> wrote: > I'm talking about the store files size and the ratio between store file > size and the byte count as counted in PutSortReducer. > > > On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <[email protected]> wrote: > > > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1 > > > > > > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <[email protected]> wrote: > > > > > Hi all, > > > I'm trying to measure the size (in bytes) of the data I'm about to load > > > into HBase. > > > I'm using bulk load with PutSortReducer. > > > All bulk load data is loaded into new regions and not added to existing > > > ones. > > > > > > In order to count the size of all KeyValues in the Put object I iterate > > > over the Put's familyMap.values() and sum the KeyValue lengths. > > > After loading the data, I check the region size by summing the > > > RegionLoad.getStorefileSizeMB(). > > > Counting the Put objects size predicted ~500MB per region but in > > practice I > > > got ~32MB per region. > > > the table uses GZ compression but this cannot be the cause of such a > > > difference. > > > > > > Is counting the Put's KeyValues the correct way to count a row size ? > Is > > it > > > comparable to the store files size ? > > > > > > Thanks, > > > Amit. > > > > > >
