See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1
On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <[email protected]> wrote: > Hi all, > I'm trying to measure the size (in bytes) of the data I'm about to load > into HBase. > I'm using bulk load with PutSortReducer. > All bulk load data is loaded into new regions and not added to existing > ones. > > In order to count the size of all KeyValues in the Put object I iterate > over the Put's familyMap.values() and sum the KeyValue lengths. > After loading the data, I check the region size by summing the > RegionLoad.getStorefileSizeMB(). > Counting the Put objects size predicted ~500MB per region but in practice I > got ~32MB per region. > the table uses GZ compression but this cannot be the cause of such a > difference. > > Is counting the Put's KeyValues the correct way to count a row size ? Is it > comparable to the store files size ? > > Thanks, > Amit. >
