I'm importing a set of data into HBase. The CSV file contains 82 entries per line. Starting with 8 byte ID, followed by 16 byte date and the rest are 80 numbers with 4 bytes each.
The current HBase schema is: ID as row key, date as a 'date' family with 'value' qualifier, the rest is in another family called 'readings' with 'P0', 'P1', 'P2', ... through 'P79' as qualifiers. I'm testing this on a single node cluster with HBase running in pseudo distributed mode (no replication, no compression for HBase)...After importing a CSV file with 150MB of size in HDFS(no replication), I checked the the table size, and it shows ~900MB which is 6x times larger than it is in HDFS.... Why there is so large overhead on this? Am I doing anything wrong here? Thanks, Nick
