Re: should i use compression?

Marcos Luis Ortiz Valmaseda Wed, 03 Apr 2013 08:42:02 -0700

Here´s the API documentation:

*FAST_DIFF*:
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.html


"Encoder similar to
DiffKeyDeltaEncoder<http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.html>
but
supposedly faster.
Compress using:
 - store size of common prefix
- save column family once in the first KeyValue
- use integer compression for key, value and prefix (7-bit encoding)
- use bits to avoid duplication key length, value length and type if it
same as previous
- store in 3 bits length of prefix timestamp with previous KeyValue's
timestamp
- one bit which allow to omit value if it is the same Format:
- 1 byte: flag
- 1-5 bytes: key length (only if FLAG_SAME_KEY_LENGTH is not set in flag)
- 1-5 bytes: value length (only if FLAG_SAME_VALUE_LENGTH is not set in
flag)
- 1-5 bytes: prefix length
- ... bytes: rest of the row (if prefix length is small enough)
- ... bytes: qualifier (or suffix depending on prefix length)
- 1-8 bytes: timestamp suffix - 1 byte: type (only if FLAG_SAME_TYPE is not
set in the flag)
- ... bytes: value (only if FLAG_SAME_VALUE is not set in the flag)"

*DIFF*:
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.html

"Compress using:
- store size of common prefix
- save column family once, it is same within HFile
- use integer compression for key, value and prefix (7-bit encoding)
- use bits to avoid duplication key length, value length and type if it
same as previous
- store in 3 bits length of timestamp field
- allow diff in timestamp instead of actual value Format:
- 1 byte: flag
- 1-5 bytes: key length (only if FLAG_SAME_KEY_LENGTH is not set in flag)
- 1-5 bytes: value length (only if FLAG_SAME_VALUE_LENGTH is not set in
flag)
- 1-5 bytes: prefix length
- ... bytes: rest of the row (if prefix length is small enough)
- ... bytes: qualifier (or suffix depending on prefix length)
- 1-8 bytes: timestamp or diff - 1 byte: type (only if FLAG_SAME_TYPE is
not set in the flag) - ... bytes: value"

I was reading the FAQ´s and there is not anything related to this topic. It
would be nice to include it in the documentation.

Lars, What do you think? It would be nice if you could write a detailed
blog post about this topic.





2013/4/3 Jean-Marc Spaggiari <[email protected]>

> I read the JIRA already but it was not clear to me. However Cloudera's
> link is very clear. Thanks for that. Any idea what's the difference
> between DIFF and FAST_DIFF?
>
> 2013/4/3 Marcos Luis Ortiz Valmaseda <[email protected]>:
> > You can read this JIra issue for this too:
> > https://issues.apache.org/jira/browse/HBASE-4218
> >
> >
> >
> > 2013/4/3 Marcos Luis Ortiz Valmaseda <[email protected]>
> >>
> >> Regards, Jean-Marc.
> >> The best resource that I found for this is a great blog post called
> Apache
> >> HBase I/O - HFile  from Matteo Bertozzi in Cloudera´s blog. Here´s the
> link:
> >> http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
> >>
> >>
> >>
> >>
> >> 2013/4/3 Jean-Marc Spaggiari <[email protected]>
> >>>
> >>> Is there any documentation anywhere regarding the differences between
> >>> PREFIX, DIFF and FAST_DIFF?
> >>>
> >>> 2013/4/3 prakash kadel <[email protected]>:
> >>> > thank you very much.
> >>> > i will try with snappy compression with data_block_encoding
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Apr 3, 2013 at 11:21 PM, Kevin O'dell
> >>> > <[email protected]>wrote:
> >>> >
> >>> >> Prakash,
> >>> >>
> >>> >>   Yes, I would recommend Snappy Compression.
> >>> >>
> >>> >> On Wed, Apr 3, 2013 at 10:18 AM, Prakash Kadel
> >>> >> <[email protected]>
> >>> >> wrote:
> >>> >> > Thanks,
> >>> >> >     is there any specific compression that is recommended of the
> use
> >>> >> case i have?
> >>> >> >    Since my values are all null will compression help?
> >>> >> >
> >>> >> >  I am thinking of using prefix data_block_encoding..
> >>> >> > Sincerely,
> >>> >> > Prakash Kadel
> >>> >> >
> >>> >> >
> >>> >> > On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
> >>> >> >
> >>> >> >> You should use data block encoding (in 0.94.x releases only). It
> is
> >>> >> helpful
> >>> >> >> for reads.
> >>> >> >>
> >>> >> >> You can also enable compression.
> >>> >> >>
> >>> >> >> Cheers
> >>> >> >>
> >>> >> >>
> >>> >> >> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel
> >>> >> >> <[email protected]
> >>> >> >wrote:
> >>> >> >>
> >>> >> >>> Hello,
> >>> >> >>>    I have a question.
> >>> >> >>>    I have a table where i store data in the column
> qualifiers(the
> >>> >> values
> >>> >> >>> itself are null).
> >>> >> >>>    I just have 1 column family.
> >>> >> >>>   The number of columns per row is variable (1~ few thousands)
> >>> >> >>>
> >>> >> >>> Currently i don't use compression or the data_block_encoding.
> >>> >> >>>
> >>> >> >>> Should i?
> >>> >> >>> I want to have faster reads.
> >>> >> >>>
> >>> >> >>> Please suggest.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Sincerely,
> >>> >> >>> Prakash Kadel
> >>> >> >
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Kevin O'Dell
> >>> >> Systems Engineer, Cloudera
> >>> >>
> >>
> >>
> >>
> >>
> >> --
> >> Marcos Ortiz Valmaseda,
> >> Data-Driven Product Manager at PDVSA
> >> Blog: http://dataddict.wordpress.com/
> >> LinkedIn: http://www.linkedin.com/in/marcosluis2186
> >> Twitter: @marcosluis2186
> >
> >
> >
> >
> > --
> > Marcos Ortiz Valmaseda,
> > Data-Driven Product Manager at PDVSA
> > Blog: http://dataddict.wordpress.com/
> > LinkedIn: http://www.linkedin.com/in/marcosluis2186
> > Twitter: @marcosluis2186
>



-- 
Marcos Ortiz Valmaseda,
*Data-Driven Product Manager* at PDVSA
*Blog*: http://dataddict.wordpress.com/
*LinkedIn: *http://www.linkedin.com/in/marcosluis2186
*Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: should i use compression?

Reply via email to