Yes.
On Mon, Jan 27, 2014 at 3:34 PM, Koert Kuipers <[email protected]> wrote: > if compression is already enabled on a column family, do i understand it > correctly that the main benefit of DATA_BLOCK_ENCODING is in memory? > > > On Mon, Jan 27, 2014 at 6:02 PM, Nick Xie <[email protected]> > wrote: > > > Thanks all for the information. Appreciated!! I'll take a look and try. > > > > Thanks, > > > > Nick > > > > > > > > > > On Mon, Jan 27, 2014 at 2:43 PM, Vladimir Rodionov > > <[email protected]>wrote: > > > > > Overhead of storing small values is quite high in HBase unless you use > > > DATA_BLOCK_ENCODING > > > (not available in 0.92). I recommend you moving to 0.94.latest. > > > > > > See: https://issues.apache.org/jira/browse/HBASE-4218 > > > > > > Best regards, > > > Vladimir Rodionov > > > Principal Platform Engineer > > > Carrier IQ, www.carrieriq.com > > > e-mail: [email protected] > > > > > > ________________________________________ > > > From: Nick Xie [[email protected]] > > > Sent: Monday, January 27, 2014 2:40 PM > > > To: [email protected] > > > Subject: Re: HBase 6x bigger than raw data > > > > > > Tom, > > > > > > Yes, you are right. According to this analysis ( > > > > > > > > > http://prafull-blog.blogspot.in/2012/06/how-to-calculate-record-size-of-hbase.html > > > ) > > > if it is right, then the overhead is quite big if the cell value > > > occupies > > > a small portion. > > > > > > In the analysis in that link, the overhead is actually 10x!!!!(the real > > > values only takes 12B and it costs 123B in HBase to store them...) Is > > that > > > real???? > > > > > > In this case, should we do some combination to reduce the overhead? > > > > > > Thanks, > > > > > > Nick > > > > > > > > > > > > > > > On Mon, Jan 27, 2014 at 2:33 PM, Tom Brown <[email protected]> > wrote: > > > > > > > I believe each cell stores its own copy of the entire row key, column > > > > qualifier, and timestamp. Could that account for the increase in > size? > > > > > > > > --Tom > > > > > > > > > > > > On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <[email protected] > > > > > > wrote: > > > > > > > > > I'm importing a set of data into HBase. The CSV file contains 82 > > > entries > > > > > per line. Starting with 8 byte ID, followed by 16 byte date and the > > > rest > > > > > are 80 numbers with 4 bytes each. > > > > > > > > > > The current HBase schema is: ID as row key, date as a 'date' family > > > with > > > > > 'value' qualifier, the rest is in another family called 'readings' > > with > > > > > 'P0', 'P1', 'P2', ... through 'P79' as qualifiers. > > > > > > > > > > I'm testing this on a single node cluster with HBase running in > > pseudo > > > > > distributed mode (no replication, no compression for HBase)...After > > > > > importing a CSV file with 150MB of size in HDFS(no replication), I > > > > checked > > > > > the the table size, and it shows ~900MB which is 6x times larger > than > > > it > > > > is > > > > > in HDFS.... > > > > > > > > > > Why there is so large overhead on this? Am I doing anything wrong > > here? > > > > > > > > > > Thanks, > > > > > > > > > > Nick > > > > > > > > > > > > > > > Confidentiality Notice: The information contained in this message, > > > including any attachments hereto, may be confidential and is intended > to > > be > > > read only by the individual or entity to whom this message is > addressed. > > If > > > the reader of this message is not the intended recipient or an agent or > > > designee of the intended recipient, please note that any review, use, > > > disclosure or distribution of this message or its attachments, in any > > form, > > > is strictly prohibited. If you have received this message in error, > > please > > > immediately notify the sender and/or [email protected] and > > > delete or destroy any copy of this message and its attachments. > > > > > >
