Re: HBase 6x bigger than raw data

Ted Yu Mon, 27 Jan 2014 15:37:24 -0800

Yes.


On Mon, Jan 27, 2014 at 3:34 PM, Koert Kuipers <[email protected]> wrote:

> if compression is already enabled on a column family, do i understand it
> correctly that the main benefit of DATA_BLOCK_ENCODING is in memory?
>
>
> On Mon, Jan 27, 2014 at 6:02 PM, Nick Xie <[email protected]>
> wrote:
>
> > Thanks all for the information. Appreciated!! I'll take a look and try.
> >
> > Thanks,
> >
> > Nick
> >
> >
> >
> >
> > On Mon, Jan 27, 2014 at 2:43 PM, Vladimir Rodionov
> > <[email protected]>wrote:
> >
> > > Overhead of storing small values is quite high in HBase unless you use
> > > DATA_BLOCK_ENCODING
> > > (not available in 0.92). I recommend you moving to 0.94.latest.
> > >
> > > See: https://issues.apache.org/jira/browse/HBASE-4218
> > >
> > > Best regards,
> > > Vladimir Rodionov
> > > Principal Platform Engineer
> > > Carrier IQ, www.carrieriq.com
> > > e-mail: [email protected]
> > >
> > > ________________________________________
> > > From: Nick Xie [[email protected]]
> > > Sent: Monday, January 27, 2014 2:40 PM
> > > To: [email protected]
> > > Subject: Re: HBase 6x bigger than raw data
> > >
> > > Tom,
> > >
> > > Yes, you are right. According to this analysis (
> > >
> > >
> >
> http://prafull-blog.blogspot.in/2012/06/how-to-calculate-record-size-of-hbase.html
> > > )
> > > if it is right, then the overhead is quite big if the cell value
> > > occupies
> > > a small portion.
> > >
> > > In the analysis in that link, the overhead is actually 10x!!!!(the real
> > > values only takes 12B and it costs 123B in HBase to store them...) Is
> > that
> > > real????
> > >
> > > In this case, should we do some combination to reduce the overhead?
> > >
> > > Thanks,
> > >
> > > Nick
> > >
> > >
> > >
> > >
> > > On Mon, Jan 27, 2014 at 2:33 PM, Tom Brown <[email protected]>
> wrote:
> > >
> > > > I believe each cell stores its own copy of the entire row key, column
> > > > qualifier, and timestamp. Could that account for the increase in
> size?
> > > >
> > > > --Tom
> > > >
> > > >
> > > > On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <[email protected]
> >
> > > > wrote:
> > > >
> > > > > I'm importing a set of data into HBase. The CSV file contains 82
> > > entries
> > > > > per line. Starting with 8 byte ID, followed by 16 byte date and the
> > > rest
> > > > > are 80 numbers with 4 bytes each.
> > > > >
> > > > > The current HBase schema is: ID as row key, date as a 'date' family
> > > with
> > > > > 'value' qualifier, the rest is in another family called 'readings'
> > with
> > > > > 'P0', 'P1', 'P2', ... through 'P79' as qualifiers.
> > > > >
> > > > > I'm testing this on a single node cluster with HBase running in
> > pseudo
> > > > > distributed mode (no replication, no compression for HBase)...After
> > > > > importing a CSV file with 150MB of size in HDFS(no replication), I
> > > > checked
> > > > > the the table size, and it shows ~900MB which is 6x times larger
> than
> > > it
> > > > is
> > > > > in HDFS....
> > > > >
> > > > > Why there is so large overhead on this? Am I doing anything wrong
> > here?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Nick
> > > > >
> > > >
> > >
> > > Confidentiality Notice:  The information contained in this message,
> > > including any attachments hereto, may be confidential and is intended
> to
> > be
> > > read only by the individual or entity to whom this message is
> addressed.
> > If
> > > the reader of this message is not the intended recipient or an agent or
> > > designee of the intended recipient, please note that any review, use,
> > > disclosure or distribution of this message or its attachments, in any
> > form,
> > > is strictly prohibited.  If you have received this message in error,
> > please
> > > immediately notify the sender and/or [email protected] and
> > > delete or destroy any copy of this message and its attachments.
> > >
> >
>

Re: HBase 6x bigger than raw data

Reply via email to