Re: HBase 6x bigger than raw data

Tom Brown Mon, 27 Jan 2014 14:34:37 -0800

I believe each cell stores its own copy of the entire row key, column
qualifier, and timestamp. Could that account for the increase in size?


--Tom


On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <[email protected]> wrote:

> I'm importing a set of data into HBase. The CSV file contains 82 entries
> per line. Starting with 8 byte ID, followed by 16 byte date and the rest
> are 80 numbers with 4 bytes each.
>
> The current HBase schema is: ID as row key, date as a 'date' family with
> 'value' qualifier, the rest is in another family called 'readings' with
> 'P0', 'P1', 'P2', ... through 'P79' as qualifiers.
>
> I'm testing this on a single node cluster with HBase running in pseudo
> distributed mode (no replication, no compression for HBase)...After
> importing a CSV file with 150MB of size in HDFS(no replication), I checked
> the the table size, and it shows ~900MB which is 6x times larger than it is
> in HDFS....
>
> Why there is so large overhead on this? Am I doing anything wrong here?
>
> Thanks,
>
> Nick
>

Re: HBase 6x bigger than raw data

Reply via email to