Hi all, I'm trying to measure the size (in bytes) of the data I'm about to load into HBase. I'm using bulk load with PutSortReducer. All bulk load data is loaded into new regions and not added to existing ones.
In order to count the size of all KeyValues in the Put object I iterate over the Put's familyMap.values() and sum the KeyValue lengths. After loading the data, I check the region size by summing the RegionLoad.getStorefileSizeMB(). Counting the Put objects size predicted ~500MB per region but in practice I got ~32MB per region. the table uses GZ compression but this cannot be the cause of such a difference. Is counting the Put's KeyValues the correct way to count a row size ? Is it comparable to the store files size ? Thanks, Amit.
