Re: HBase 6x bigger than raw data

Tom Brown Mon, 27 Jan 2014 15:00:07 -0800

Does enabling compression include prefix compression (HBASE-4218), or is
there a separate switch for that?


--Tom


On Mon, Jan 27, 2014 at 3:48 PM, Ted Yu <[email protected]> wrote:

> To make better use of block cache, see:
>
> HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix
> compression)
>
> which is in 0.94 and above
>
> To reduce size of HFiles, please see:
> http://hbase.apache.org/book.html#compression
>
>
> On Mon, Jan 27, 2014 at 2:40 PM, Nick Xie <[email protected]>
> wrote:
>
> > Tom,
> >
> > Yes, you are right. According to this analysis (
> >
> >
> http://prafull-blog.blogspot.in/2012/06/how-to-calculate-record-size-of-hbase.html
> > )
> > if it is right, then the overhead is quite big if the cell value
> > occupies
> > a small portion.
> >
> > In the analysis in that link, the overhead is actually 10x!!!!(the real
> > values only takes 12B and it costs 123B in HBase to store them...) Is
> that
> > real????
> >
> > In this case, should we do some combination to reduce the overhead?
> >
> > Thanks,
> >
> > Nick
> >
> >
> >
> >
> > On Mon, Jan 27, 2014 at 2:33 PM, Tom Brown <[email protected]> wrote:
> >
> > > I believe each cell stores its own copy of the entire row key, column
> > > qualifier, and timestamp. Could that account for the increase in size?
> > >
> > > --Tom
> > >
> > >
> > > On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <[email protected]>
> > > wrote:
> > >
> > > > I'm importing a set of data into HBase. The CSV file contains 82
> > entries
> > > > per line. Starting with 8 byte ID, followed by 16 byte date and the
> > rest
> > > > are 80 numbers with 4 bytes each.
> > > >
> > > > The current HBase schema is: ID as row key, date as a 'date' family
> > with
> > > > 'value' qualifier, the rest is in another family called 'readings'
> with
> > > > 'P0', 'P1', 'P2', ... through 'P79' as qualifiers.
> > > >
> > > > I'm testing this on a single node cluster with HBase running in
> pseudo
> > > > distributed mode (no replication, no compression for HBase)...After
> > > > importing a CSV file with 150MB of size in HDFS(no replication), I
> > > checked
> > > > the the table size, and it shows ~900MB which is 6x times larger than
> > it
> > > is
> > > > in HDFS....
> > > >
> > > > Why there is so large overhead on this? Am I doing anything wrong
> here?
> > > >
> > > > Thanks,
> > > >
> > > > Nick
> > > >
> > >
> >
>

Re: HBase 6x bigger than raw data

Reply via email to