[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174961#comment-13174961 ] Phabricator commented on HBASE-4218: mbautin has commented on the revision [jira] [HBASE-4218] HFile data block encoding (delta encoding). Replying to Matt's comments. A new version of the diff will follow. @mcorgan: thanks for reviewing! INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java:137 Done. src/test/java/org/apache/hadoop/hbase/regionserver/EncodedSeekPerformanceTest.java:157 Done. src/test/java/org/apache/hadoop/hbase/regionserver/EncodedSeekPerformanceTest.java:161 Done. src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java:171 Done. src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java:175 Done. src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java:850 Done. src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java:162 Done. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175013#comment-13175013 ] Zhihong Yu commented on HBASE-4218: --- Please remove the last hunk from HFilePerformanceEvaluation.java which led to: {code} 1 out of 2 hunks FAILED -- saving rejects to file src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java.rej {code} Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta-encoding.patch-2011-12-22_11_52_07.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175021#comment-13175021 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508425/Delta-encoding.patch-2011-12-22_11_52_07.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 92 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -142 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/582//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/582//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/582//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta-encoding.patch-2011-12-22_11_52_07.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175258#comment-13175258 ] Zhihong Yu commented on HBASE-4218: --- Hadoop QA remembers attachment Id and wouldn't retest the same attachment. Please attach the patch again. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta-encoding.patch-2011-12-22_11_52_07.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174401#comment-13174401 ] Phabricator commented on HBASE-4218: tedyu has commented on the revision [jira] [HBASE-4218] HFile data block encoding (delta encoding). INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java:42 Should read 'have been created' src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java:49 I think delta should be removed here to be consistent with new naming convention I like the javadoc in HColumnDescriptor.java @ line 601 - it is more detailed. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.10.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174586#comment-13174586 ] Phabricator commented on HBASE-4218: mbautin has commented on the revision [jira] [HBASE-4218] HFile data block encoding (delta encoding). My most recent update also addresses the two new comments from Ted. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java:42 Done. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java:49 Done. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174590#comment-13174590 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508340/0001-Delta-encoding.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 92 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -142 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.TestHeapSize Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/578//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/578//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/578//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174621#comment-13174621 ] Zhihong Yu commented on HBASE-4218: --- TestHeapSize.testSizes error should be caused by this JIRA. Please adjust heap size accordingly. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174641#comment-13174641 ] Phabricator commented on HBASE-4218: mcorgan has commented on the revision [jira] [HBASE-4218] HFile data block encoding (delta encoding). First try at phabricator - hope i'm using it correctly. Found a few minor uses of the delta terminology. Looking great in general. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java:137 update to DATA_BLOCK_ENCODING src/test/java/org/apache/hadoop/hbase/regionserver/EncodedSeekPerformanceTest.java:157 should rename deltaAlgo to encoderAlgo? src/test/java/org/apache/hadoop/hbase/regionserver/EncodedSeekPerformanceTest.java:161 encoderAlgo src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java:162 rename to testDataBlockEncodingWithNormalSeek src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java:171 rename to testDataBlockEncodingWithEncodedSeek src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java:175 majorCompactionWithDataBlockEncoding src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java:850 testDataBlockEncodingMetaData REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173514#comment-13173514 ] Phabricator commented on HBASE-4218: mbautin has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. Replying to the rest of comments. A new version of the patch will follow. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java:65 Added missing javadoc for includingMemstoreTS. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java:126 seekBefore only matters in case of an exact match. I will update the javadoc. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java:34 Updated. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java:147 Added an assertion. src/test/java/org/apache/hadoop/hbase/io/deltaencoder/TestBufferedDeltaEncoder.java:34 Fixed. src/test/java/org/apache/hadoop/hbase/io/deltaencoder/TestDeltaEncoders.java:47 Fixed (LargeTests -- runs in 2 minutes). src/test/java/org/apache/hadoop/hbase/io/deltaencoder/TestBufferedDeltaEncoder.java:34 Fixed (SmallTests). src/test/java/org/apache/hadoop/hbase/util/TestByteBufferUtils.java:35 Fixed (SmallTests) REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173764#comment-13173764 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508181/D447.9.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 65 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/561//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173792#comment-13173792 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508190/D447.10.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 65 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/563//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.10.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173865#comment-13173865 ] Zhihong Yu commented on HBASE-4218: --- Thanks for the nice work, Mikhail. {code} 1 out of 1 hunk ignored -- saving rejects to file src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java.rej 1 out of 2 hunks FAILED -- saving rejects to file src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java.rej {code} Please fix the above conflicts by rebasing against TRUNK. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.10.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172872#comment-13172872 ] Phabricator commented on HBASE-4218: mbautin has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. Replying to a part of the comments. Will post a new version when I am done going through all the pending comments. Running tests, too. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java:93 It is possible to use two different delta encodings on disk and in the block cache. So e.g. we could use no delta encoding on disk and only delta-encode in cache. This is the option that we want to use for testing. In addition to that, there is a boolean option, DELTA_ENCODING_IN_MEMORY, probably somewhat confusingly named, that Jacek implemented towards the end of his internship. This option allows to use encoded scanners. I think this might be OK if we rename this option to make it less confusing and document all three of these options. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2020 Done. src/main/java/org/apache/hadoop/hbase/KeyValue.java:153 Done. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2036 Done. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2130 commonPrefix does include the rowkey portion, but it is OK to pass zero as commonPrefix at line 2051, because this function will not compare the row anyway. I modified the documentation and got rid of passing lrowlength and rrowlength to this function, replacing them by only one parameter, because they are always equal. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java:443 Moved the above methods to ByteBufferUtils. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java:470 Nice catch! Fixed this (also made sure that newKeyBufferLength is set to at least 1). src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java:475 Yes, nice catch. Added a unit test. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java:635 Yes, seems like a bug. Fixed. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see:
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171648#comment-13171648 ] Phabricator commented on HBASE-4218: Kannan has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. some more comments... INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java:65 javadoc fix for the new param includesMemstoreTS is needed on a few of these methods. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java:126 little confused with the doc. Could you clarify what happens in the inexact match case: where are we left pointing to for the seekBefore = true and seekBefore=false cases. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java:34 here and a bunch of other places... 128 bit encoding should read 7 bit encoding src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java:475 It seems like we are missing a: keyBuffer = newKeyBuffer; step here after the arrayCopy step. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java:470 I think the logic here has an unintentional bug. newKeyBufferLength = keyLength * 2; should be: newKeyBufferLength = keyBuffer.length * 2; Otherwise, the check on the subsequent line will always be FALSE. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171662#comment-13171662 ] Phabricator commented on HBASE-4218: Kannan has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java:635 Since we are only copying the non-common-suffix part in this case, shouldn't the offset arguments in both current previous be current.lastCommonPrefix (instead of 0s)? src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java:147 perhaps we add an assertion that the commonLength == 0 for the first key in the block? REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169613#comment-13169613 ] Phabricator commented on HBASE-4218: Kannan has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java:93 I forget how there ended up being 3 options here. Jacek would have more context here. But I am guessing maybe there should just be 2 options: a) What delta encoding algo is to be used for a CF? b) Whether the encoding is to be in-memory only or on-disk also? [This is primarily a testing mode/dev-time option, where one can experiment with different delta encoders without touching on-disk format or risking corrupting on disk data. So most folks should not even have to worry about this option.] REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169621#comment-13169621 ] Phabricator commented on HBASE-4218: tedyu has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/KeyValue.java:2036 I think SamePrefixComparator should carry byte[] as type parameter. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2020 How about 'avoids redundant comparisons for better performance' ? src/test/java/org/apache/hadoop/hbase/util/TestByteBufferUtils.java:35 Missing test category. src/test/java/org/apache/hadoop/hbase/io/deltaencoder/TestBufferedDeltaEncoder.java:34 Missing test category. src/test/java/org/apache/hadoop/hbase/io/deltaencoder/TestDeltaEncoders.java:47 Missing test category. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169741#comment-13169741 ] Phabricator commented on HBASE-4218: Kannan has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/KeyValue.java:153 perhaps change these too to use the newly introduced constants.. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2130 In this function (compareWithoutRow), is commonPrefix the common part including the rowkey portion? - If no, then @line 2119, should you pass commonPrefix - (rowLen + sizeOfShort) instead of commonPrefix - If yes, then @line 2051, should you pass rowLen + sizeOfShort instead of 0? REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169842#comment-13169842 ] Mikhail Bautin commented on HBASE-4218: --- bq. Maybe we could call it KeyValueEncoding, DataBlockEncoding, HCellEncoding, BlockEncoding... Matt: do you have a specific re-naming of delta encoders in mind? Jacek's original delta encoding algorithm names are {Bitset,Prefix,Diff,FastDiff}KeyDeltaEncoder. How do these correspond to the alternative encoder names you are suggesting? Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169860#comment-13169860 ] Phabricator commented on HBASE-4218: stack has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. More to follow (Sorry for piecemealing this review... ) INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java:443 Do all methods up to here belong elsewhere out in a utility class? CompressedInts or something? In ByteBufferUtils would be a better place? REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169894#comment-13169894 ] Matt Corgan commented on HBASE-4218: Another thought I had was that all reading and writing could go through the encoder/decoder. The current patch leaves the old access path in place and has the DeltaEncoderSeeker on the side. It would reduce the code base's complexity if everything passed through the DeltaEncoder and you set DeltaEncoderAlgorithm.NONE if you didn't want any encoding. That could be done later though. Would need to be careful of performance regressions. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169981#comment-13169981 ] stack commented on HBASE-4218: -- @Matt Thats a reasonable point re: naming and your latter note wondering if all reading/writing could go same path. Out of interest do you think you could shoehorn your TRIE encoder/decoder into the frame that Jacek has rigged here? Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169987#comment-13169987 ] Matt Corgan commented on HBASE-4218: Shoehorn is probably the right term, but yeah, i got it mostly working a couple months ago. The fit actually isn't too bad (though far from ideal) and could be improved over time. I'll try to work it into this newest patch in the next few weeks. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169989#comment-13169989 ] stack commented on HBASE-4218: -- Then I'd say that if you managed to make your trie encoder/decoder fit the deltaencoder framework, it helps your case that the framework name should be broadened beyond deltaencoding only. Good stuff. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168935#comment-13168935 ] Zhihong Yu commented on HBASE-4218: --- There are two files which need to be refreshed: {code} 1 out of 2 hunks FAILED -- saving rejects to file src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java.rej 14 out of 14 hunks ignored -- saving rejects to file src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java.rej {code} Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168970#comment-13168970 ] Phabricator commented on HBASE-4218: mbautin has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. Addressing Michael's comments. A new version of the diff will follow. Running unit tests. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java:99 Renamed to DEFAULT_DELTA_ENCODING_IN_MEMORY_ENABLED. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2022 How about SamePrefixComparator? This means the same thing as the latter but is shorter. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:34-42 Done. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:56 Done. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:69 Done. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java:32-35 Done. src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java:90 Fixed. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2020 This extension to the comparator interface is used in BufferedDeltaEncoder to improve performance if the supplied comparator implements this interface. We don't need to compare the first commonPrefix bytes of the two keys if we already know they are the same. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2148 This is the same as the old comparator code. We are assuming that the two KVs are valid. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2156 I've looked into this and indeed saw some code duplication. I refactored the rest of this function into a common one shared between the two comparators. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:89 I guess we might need to think about a bigger unified compression framework for HFiles, HLogs, and RPC at some point. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168996#comment-13168996 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12507282/D447.8.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 65 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/502//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13167721#comment-13167721 ] Phabricator commented on HBASE-4218: mbautin has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. See responses inline. I will follow up with a new version of the diff shortly. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java:65 Removed javadoc comments from these enum items, because they don't add information. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java:33 Jacek's delta encoding algorithm names are {Bitset,Prefix,Diff,FastDiff}KeyDeltaEncoder. I don't see how Matt's alternative encoding names correspond to these. I will follow up with Matt on the JIRA. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderBufferTooSmallException.java:22 Fixed, thanks! src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java:28 Done. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java:49 Done. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java:346 Done. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java:405 Done. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/FastDiffDeltaEncoder.java:28 Done. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/FastDiffDeltaEncoder.java:53 Done. src/main/java/org/apache/hadoop/hbase/io/hfile/EmptyBlockDeltaEncoder.java:29 Done. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java:337 Fixed. As far as I understand, this fix takes advantage of the fact that delta encoding API is designed to be idempotent (i.e. when we do beforeBlockCache and give the already-encoded block to afterReadFromDiskAndPuttingIntoCache, it will work correctly). REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13167784#comment-13167784 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12507046/D447.6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 63 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/491//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168115#comment-13168115 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12507122/0001-Delta-encoding-fixed-encoded-scanners.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 85 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -135 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.TestHeapSize Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/495//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/495//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/495//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168165#comment-13168165 ] Phabricator commented on HBASE-4218: stack has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/KeyValue.java:2148 Are this calculations dangerous? Could they be beyond commonPrefix into unallocated space? src/main/java/org/apache/hadoop/hbase/KeyValue.java:2020 I'm not sure I understand what this is for. Any chance of an example showing when this would be used? src/main/java/org/apache/hadoop/hbase/KeyValue.java:2156 This code looks like the old comparator code. We are not duplicating it here are we? (Thats some ugly code... would be a tradegy having it show up twice) We should at miminum tie the two together with comments warning no change of one w/o changing other. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:53 I love this. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:89 I wonder if we could use this stuff writing over rpc; it might be too costly compressing but maybe for big KVs. Anyways. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:158 I love it. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159391#comment-13159391 ] Phabricator commented on HBASE-4218: tedyu has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. Nice work, Mikhail and Jacek. Please add category to the new tests. Are there performance numbers for various encoders other than Prefix encoder ? INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java:337 As Matt pointed out, the return value should be stored in hfileBlock so that we don't incur double encoding. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:305 Similar to the case in HFileReaderV1, return value should be stored in dataBlock. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java:33 Matt suggested alternative names for DeltaEncoding: KeyValueEncoding, DataBlockEncoding, HCellEncoding, BlockEncoding. DataBlockEncoding sounds good. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java:405 Misspelling: comperator should be comparator. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java:65 Javadoc doesn't match actual class name. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/FastDiffDeltaEncoder.java:53 The tail should read '128 bit encoding' src/main/java/org/apache/hadoop/hbase/io/deltaencoder/FastDiffDeltaEncoder.java:28 This class is only used locally. It should be an inner class. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java:49 Tail should read '128 bit encoding' src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java:346 Please remove extra blank line. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java:28 Please change this class to inner class. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderBufferTooSmallException.java:22 Should read 'which indicates' REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159531#comment-13159531 ] Phabricator commented on HBASE-4218: todd has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. I only got through a little bit of the giant patch, but it looks well done and decently unit-tested, so I'm +1 once you have some cluster testing results that show it basically works :) Test-plan should include an upgrade test from an unpatched HFile v2 format and an HFile v1 (0.90) upgrade INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java:99 seems odd that the type of this is boolean whereas the IN_CACHE one is an Algorithm type. If it's a requirement that the algo be the same, then maybe rename this one to be DEFAULT_DELTA_ENCODING_IN_MEMORY_ENABLED src/main/java/org/apache/hadoop/hbase/KeyValue.java:2022 This interface name isn't quite clear to me, since it doesn't compare prefixes. Maybe SuffixComparator? Or ComparatorAssumingEqualPrefix (though that's a bit lengthy)? src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:34-42 should use inline HTML to format this right src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:56 s/writeHere/out/g for consistent style src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java:69 s/source/in/g src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java:32-35 use HTML ul... src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java:90 typo src/main/java/org/apache/hadoop/hbase/io/hfile/EmptyBlockDeltaEncoder.java:29 maybe NoOpDeltaEncoder is a better name? (it's not that the block is empty) REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159045#comment-13159045 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505438/Delta_encoding_with_memstore_TS.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 81 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -145 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 72 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/399//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/399//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/399//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159048#comment-13159048 ] Ted Yu commented on HBASE-4218: --- HadoopQA isn't functioning as usual. Manual execution of test suite is needed. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150197#comment-13150197 ] Ted Yu commented on HBASE-4218: --- I went over some of my earlier comments and found that exceptDeltaEncoderId is still misspelled. Please go over my comments. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: D447.1.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127079#comment-13127079 ] jirapos...@reviews.apache.org commented on HBASE-4218: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2308/#review2573 --- http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java https://reviews.apache.org/r/2308/#comment5767 Nit [Coding style]: space between (byte) and 9. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java https://reviews.apache.org/r/2308/#comment5769 Add a comment about what the following string constants are for (presumably FileInfo keys). http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java https://reviews.apache.org/r/2308/#comment5768 Remove trailing whitespace here and below. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java https://reviews.apache.org/r/2308/#comment5770 Create a string constant for NONE. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java https://reviews.apache.org/r/2308/#comment5771 Use the string constant for NONE. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java https://reviews.apache.org/r/2308/#comment5772 Size of the key length field in bytes http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java https://reviews.apache.org/r/2308/#comment5773 Size of the key type field in bytes http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java https://reviews.apache.org/r/2308/#comment5774 Size of the row length field in bytes http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java https://reviews.apache.org/r/2308/#comment5775 Size of the family length field in bytes http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java https://reviews.apache.org/r/2308/#comment5776 Size of the timestamp field in bytes http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java https://reviews.apache.org/r/2308/#comment5777 This needs to use the new constants defined for row length, etc. - Mikhail On 2011-10-08 00:51:01, Jacek Migdal wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2308/ bq. --- bq. bq. (Updated 2011-10-08 00:51:01) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. Delta encoding for key values. bq. bq. bq. This addresses bug HBASE-4218. bq. https://issues.apache.org/jira/browse/HBASE-4218 bq. bq. bq. Diffs bq. - bq. bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/CompressionState.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/CopyKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncodedBlock.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderToSmallBufferException.java PRE-CREATION bq.
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127185#comment-13127185 ] Matt Corgan commented on HBASE-4218: I'm trying to hook the prefix trie code into this, which is going well enough. Testing on some HFileV1 data, i think i'm seeing some double-decoding in HFileReaderV1.java:328. You encode the block to put in the block cache in blockDeltaEncoder.beforeBlockCache(..), but then go back to using the unencoded version, which triggers a second encoding a few lines later at blockDeltaEncoder.afterReadFromDiskAndPuttingInCache(..). Possible change: {code} // Cache the block if (cacheBlock blockCache != null) { HFileBlock cachedBlock = blockDeltaEncoder.beforeBlockCache(hfileBlock); blockCache.cacheBlock(cacheKey, cachedBlock, inMemory); } hfileBlock = blockDeltaEncoder.afterReadFromDiskAndPuttingInCache( hfileBlock, isCompaction); {code} {code} // Cache the block if (cacheBlock blockCache != null) { hfileBlock = blockDeltaEncoder.beforeBlockCache(hfileBlock); blockCache.cacheBlock(cacheKey, hfileBlock, inMemory); } hfileBlock = blockDeltaEncoder.afterReadFromDiskAndPuttingInCache( hfileBlock, isCompaction); {code} A few other comments: * I wonder if we could make some of the naming more general than Delta encoding since that's not the only type it can support. I added a TRIE entry to DeltaEncoderAlgorithms. Maybe we could call it KeyValueEncoding, DataBlockEncoding, HCellEncoding, BlockEncoding, etc... * saw comparator spelled comperator several places * seems like PREFIX is always the winner. are the others better at certain datasets, or are they just there for comparison? * i've been running the tests on different block sizes from 1KB to 1MB and seeing seeks/s decline from ~300,000/s to 3,000/s because of the sequential access inside a block. even using 64KB block is ~6x slower than 1KB blocks {code} table,encoding,blockSize,numCells,avgKeyBytes,avgValueBytes,sequentialMB/s,seeks/s,~cycles/seek Count5s,PREFIX,1KB ,1338940,85,9,167,323685, 6178 Count5s,PREFIX,4KB ,1338627,85,9,281,334873, 5972 Count5s,PREFIX,16KB ,1338420,85,9,381,168987, 11835 Count5s,PREFIX,64KB ,1338016,85,9,380, 52781, 37891 Count5s,PREFIX,256KB,1339210,85,9,392, 14203,140810 Count5s,PREFIX,1MB ,1337318,85,9,371, 3703,539958 {code} Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127238#comment-13127238 ] Ted Yu commented on HBASE-4218: --- I think similar change (as suggested by Matt) for HFileReaderV2.java @ line 279 should be made. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123704#comment-13123704 ] Ted Yu commented on HBASE-4218: --- There seems to be a typo in the comment of KeyValue.java: {noformat} /** Size in bytes of field the row length */ public static final int FAMILY_LENGTH_SIZE = Bytes.SIZEOF_BYTE; {noformat} Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123709#comment-13123709 ] Ted Yu commented on HBASE-4218: --- HFileBlockDeltaEncoder.java, RedundantKVGenerator.java, TestBufferedDeltaEncoder.java, TestDeltaEncoders.java need license. RedundantKVGenerator ctor has many parameters. Is it possible to use some wrapper to hold the parameters ? Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123785#comment-13123785 ] Ted Yu commented on HBASE-4218: --- For BlockDeltaEncoder.decodeDataBlock(): {code} private HFileBlock decodeDataBlock(HFileBlock block, boolean verifyEncoding, short exceptDeltaEncoderId) { {code} exceptDeltaEncoderId should be called expectedDeltaEncoderId. RuntimeException is thrown in case of IOException. I think decodeDataBlock() can be declared to throw IOException. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123792#comment-13123792 ] Ted Yu commented on HBASE-4218: --- For BlockDeltaEncoder.inMemory: {code} private final boolean inMemory; {code} Would encodedInMemory be a better name ? From javadoc in the code, it seems inMemory indicates whether in memory encoding is desired. For BlockDeltaEncoder.afterReadFromDiskAndPuttingInCache(), {code} if (block.getBlockType() == BlockType.ENCODED_DATA) { throw new IllegalStateException(Unexcepted encoding); } {code} I think block.getDeltaEncodingId() should be included in the exception. Further, can we use a call such as the following to decode the block instead of throwing exception ? {code} decodeDataBlock(block, true, block.getDeltaEncodingId()) {code} Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123799#comment-13123799 ] Ted Yu commented on HBASE-4218: --- For BlockDeltaEncoder.useEncodedScanner(), why doesn't isCompaction appear in the second condition on line 227 ? TestHFileBlockDeltaEncoder, DeltaEncodingSeekPerformance need license. For BitsetKeyDeltaEncoder.uncompressKeyValues(), the IllegalStateException on line 81 should contain source.available() and skipLastBytes. BitsetKeyDeltaEncoder.isPartEqual() should be named arePartsEqual(). Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123510#comment-13123510 ] jirapos...@reviews.apache.org commented on HBASE-4218: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2308/#review2466 --- I ran unit tests with Jacek's patch. 1199 unit tests passed. The only one that failed was ServerCustomProtocol, which also seems to fail sporadically without the patch. Without the patch, there are only 1028 tests, so the patch is apparently very well unit-tested. - Mikhail On 2011-10-08 00:51:01, Jacek Migdal wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2308/ bq. --- bq. bq. (Updated 2011-10-08 00:51:01) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. Delta encoding for key values. bq. bq. bq. This addresses bug HBASE-4218. bq. https://issues.apache.org/jira/browse/HBASE-4218 bq. bq. bq. Diffs bq. - bq. bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/CompressionState.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/CopyKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncodedBlock.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderToSmallBufferException.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/FastDiffDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/EmptyBlockDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java 1180113 bq.
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123576#comment-13123576 ] Ted Yu commented on HBASE-4218: --- For BlockDeltaEncoder.afterBlockCache(), I am not sure if the following matches the logic: {code} // Postcondition: if (isCompaction is set and onDisk is not NONR) or //inMemory is not set - don;t encode. {code} Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123597#comment-13123597 ] Ted Yu commented on HBASE-4218: --- EmptyBlockDeltaEncoder, CompressionState, BlockDeltaEncoder need license. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123330#comment-13123330 ] jirapos...@reviews.apache.org commented on HBASE-4218: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2308/ --- Review request for hbase. Summary --- Delta encoding for key values. This addresses bug HBASE-4218. https://issues.apache.org/jira/browse/HBASE-4218 Diffs - http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/CompressionState.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/CopyKeyDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncodedBlock.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderToSmallBufferException.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/FastDiffDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/EmptyBlockDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockDeltaEncoder.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 1180113 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/CompressionTest.java 1180113
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123337#comment-13123337 ] Jacek Migdal commented on HBASE-4218: - Performance results on production data. CopyKeyDeltaEncoder: Compression performance: 1136.33 MB/s (+/- 60.91 MB/s) Decompression performance: 373.29 MB/s (+/- 281.22 MB/s) BitsetKeyDeltaEncoder: Compression performance: 147.57 MB/s (+/- 0.58 MB/s) Decompression performance: 166.78 MB/s (+/- 54.81 MB/s) PrefixKeyDeltaEncoder: Compression performance: 293.94 MB/s (+/- 1.97 MB/s) Decompression performance: 233.61 MB/s (+/- 91.97 MB/s) FastDiffDeltaEncoder: Compression performance: 203.47 MB/s (+/- 0.37 MB/s) Decompression performance: 196.77 MB/s (+/- 43.22 MB/s) DiffKeyDeltaEncoder: Compression performance: 187.74 MB/s (+/- 0.24 MB/s) Decompression performance: 163.13 MB/s (+/- 12.17 MB/s) LZO: Compression performance: 260.35 MB/s (+/- 0.76 MB/s) Decompression performance: 173.45 MB/s (+/- 76.13 MB/s) CopyKeyDeltaEncoder Saved bytes: -4 Key compression ratio:-0.00 % All compression ratio:-0.00 % LZO compressed size: 152019 LZO compression ratio:85.79 % BitsetKeyDeltaEncoder Saved bytes: 747061 Key compression ratio:75.46 % All compression ratio:69.82 % LZO compressed size: 124438 LZO compression ratio:88.37 % PrefixKeyDeltaEncoder Saved bytes: 831602 Key compression ratio:84.00 % All compression ratio:77.72 % LZO compressed size: 117285 LZO compression ratio:89.04 % FastDiffDeltaEncoder Saved bytes: 935275 Key compression ratio:94.47 % All compression ratio:87.41 % LZO compressed size: 94360 LZO compression ratio:91.18 % DiffKeyDeltaEncoder Saved bytes: 909175 Key compression ratio:91.84 % All compression ratio:84.97 % LZO compressed size: 96597 LZO compression ratio:90.97 % Total KV prefix length: 8 Total key length: 91 Total key redundancy: 781606 Total value length: 8 DeltaEncodingSeekPerformance BlockDeltaEncoder onDisk='NONE' inCache='NONE' inMemory=false Read speed: 63.99 (MB/s) Seeks per second: 54901.21 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='BITSET' inMemory=false Read speed: 46.73 (MB/s) Seeks per second: 13570.50 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='PREFIX' inMemory=false Read speed: 55.88 (MB/s) Seeks per second: 20298.89 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='DIFF' inMemory=false Read speed: 54.39 (MB/s) Seeks per second: 15082.79 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='FAST_DIFF' inMemory=false Read speed: 54.12 (MB/s) Seeks per second: 15432.61 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='NONE' inMemory=true Read speed: 64.37 (MB/s) Seeks per second: 56779.82 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='BITSET' inMemory=true Read speed: 35.42 (MB/s) Seeks per second: 46170.87 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='PREFIX' inMemory=true Read speed: 43.54 (MB/s) Seeks per second: 60108.48 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='DIFF' inMemory=true Read speed: 40.62 (MB/s) Seeks per second: 48779.68 (#/s) BlockDeltaEncoder onDisk='NONE' inCache='FAST_DIFF' inMemory=true Read speed: 40.76 (MB/s) Seeks per second: 57291.22 (#/s) Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Labels: compression Attachments: open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80%
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123366#comment-13123366 ] jirapos...@reviews.apache.org commented on HBASE-4218: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2308/#review2460 --- http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java https://reviews.apache.org/r/2308/#comment5565 Should be 'bytes are required' http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java https://reviews.apache.org/r/2308/#comment5564 The value of i should be included in the exception. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java https://reviews.apache.org/r/2308/#comment5566 Can this logic be written without recursion ? http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java https://reviews.apache.org/r/2308/#comment5567 Should this exception be called DeltaEncoderBufferTooSmallException ? http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java https://reviews.apache.org/r/2308/#comment5568 Would arePartsEqual be a better name ? - Ted On 2011-10-08 00:51:01, Jacek Migdal wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2308/ bq. --- bq. bq. (Updated 2011-10-08 00:51:01) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. Delta encoding for key values. bq. bq. bq. This addresses bug HBASE-4218. bq. https://issues.apache.org/jira/browse/HBASE-4218 bq. bq. bq. Diffs bq. - bq. bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BitsetKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/BufferedDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/CompressionState.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/CopyKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncodedBlock.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderToSmallBufferException.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DiffKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/FastDiffDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockDeltaEncoder.java PRE-CREATION bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java 1180113 bq. http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/EmptyBlockDeltaEncoder.java PRE-CREATION bq.
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13121752#comment-13121752 ] Matt Corgan commented on HBASE-4218: Jacek - have you done anything with the KeyValue/scanner/searching interfaces? I'm curious to see your approach. Like you, I'm materializing a the iterator's current cell, but the materialized row/family/qualifier/timestamp/type/value all reside in separate arrays/fields. The scanner can only materialize one cell at a time, which i think can work long term but doesn't play well with some of the current scanner interfaces. The problem can be dodged by spawning a new array and copying everything into the KeyValue format, but we would see a massive speedup and could possibly eliminate all object instantiation (and furious garbage collection) if we could do comparisons on the intermediate arrays. I've mocked up some cell interfaces and comparators but am wondering what you've already got in progress. Regarding scanners - Supported operations on a block are next(), previous(), nextRow(), previousRow(), positionAt(KeyValue kv, boolean beforeIfMiss), and some others. Main problem is that i can't peek() which is used in the current version of the KeyValue heap, though i've mocked an alternate approach without it. I'm also starting to think that a traditional iterator's hasNext() method should not be supported so that true streaming can be done and so that blocks don't need to know about their neighbors. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088827#comment-13088827 ] Jacek Migdal commented on HBASE-4218: - Regarding variable byte encoding. There is also another option than VInt and FInt: within a block have the same width of int, but it could be different across blocks. * exploit similarity of data within given block * usually have the same size as VInt * few branches * the key value format is not uniform across all of the data Having said that, in many Key Values there are only a few different sizes. That allows even more efficient encoding. On the other hand, when value lengths are getting longer, they vary a lot. But in that case keys are a tiny percent of whole file, so any savings from VB will be insignificant. Your mileage may vary. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088266#comment-13088266 ] Matt Corgan commented on HBASE-4218: I lean towards byte-encoding ints whenever they're used often enough to have an impact on memory. KeyValue could probably do better with some VInts. You can encode 128 values in 1 byte and decode it with just one branch to check if b[0] 0. Given the number of other byte comparisons going during reading the key, that doesn't seem too heavyweight (especially since many of those other byte comparisons are casting the byte to a positive integer before comparing). If you reserved 2-4 bytes for that same number, then you may be doing even more work. One problem with VInt decoders is that sometimes they do bounds checking which can slow things down a lot. I think validation should be done at write time, and then possibly using a block-level checksum when a block is copied back into memory. Then assume everything is correct. For prefix compression, we're talking about encoding things at the block level where most of the ints are internal pointers that are less than the block size of 64k, so most ints can fit in 2 bytes. But it's important that they be able to grow gracefully when block sizes grow beyond 64k or are configured to be bigger. I've been using two types of encoded integers: VInt and FInt. FInts are basically an optimization over VInts for cases where you have many ints with the same characteristics, and can therefore store their width at the block level rather than encoding it in every occurrence. VInt (variable width int) * width is not known ahead of time, so must interpret byte-by-byte * slower because of branch on each byte, but still pretty fast * only 2^7 values/byte, so 2 bytes can hold 16k values FInt (fixed width int) * width is known ahead of time and stored externally (at block level in PtBlockMeta in this project) * an FInt is faster to encode decode because of the lack of if-statements * each byte can store 2^8 values, so 2 bytes gets you 64k values (hbase block size) * a list of these numbers provides random access. important for binary searching * if encoding the numbers 0-10,000, for example, then VInts will save you 1 byte on the numbers 0-255, but that is a small % savings. so use FInts for lists of numbers - Sidenote: I've been meaning to make a CVInt (comparable variable width int) that: * sorts based on raw bytes even if different widths (good for suffixing hbase row/colQualifier values) * to interpret, count the number of leading 1 bits, and that is how many additional bytes there are beyond the first byte * bits beyond the first 0 bit comprise the value * should also be faster to decode because of fewer branches Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression --
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13087805#comment-13087805 ] stack commented on HBASE-4218: -- I was reading a paper this morning and it was going on about size savings doing variable byte encoding. Should KV do VB? At implementation time, using VB made the parse harder so we punted on it. Maybe now we have smarter fellas in the mix, VB is worth a second look (in this context)? Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13087538#comment-13087538 ] stack commented on HBASE-4218: -- /me hearts this issue Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086545#comment-13086545 ] Ted Yu commented on HBASE-4218: --- bq. Moreover, it should allow far more efficient seeking which should improve performance a bit. Can performance improvement be quantified ? Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086556#comment-13086556 ] Jacek Migdal commented on HBASE-4218: - Yes, I plan to measure seek performance within one block. I haven't implement it yet, but I rather expect that it will make seeking and decompressing KeyValues as fast as operating on uncompressed bytes. The primary goal is to save memory in buffers. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086574#comment-13086574 ] Matt Corgan commented on HBASE-4218: Sorry I haven't chimed in on this in a while, but I've made significant progress implementing some of the ideas I mentioned in the discussion you linked to. Taking a sorted ListKeyValue, converting to a compressed byte[], and then providing fast mechanisms for reading the byte[] back to KeyValues. It should work for block indexes and data blocks. I don't think I'll be able to do the full integration into HBase, but I'm trying to get the code to a point where it's well designed, tested, and easy (possible) to start working in to the code base. I'll try to get it on github in the next couple weeks. I wish I could dedicate more time, but it's been a nights/weekends project. Here's a quick storage format overview. Class names begin with Pt for Prefix Trie. A block of KeyValues gets converted to a byte[] composed of 5 sections: 1) PtBlockMeta stores some offsets into the block, the width of some byte-encoded integers, etc.. http://pastebin.com/iizJz3f4 2) PtRowNodes are the bulk of the complexity. They store a trie structure for rebuilding the row keys in the block. Each Leaf node has a list of offsets that point to the corresponding columns, timestamps, and data offsets/lengths in that row. The row data is structured for efficient sequential iteration and/or individual row lookups. http://pastebin.com/cb79N0Ge 3) PtColNodes store a trie structure that provides random access to column qualifiers. A PtRowNode points at one of these and it traverses its parents backwards through the trie to rebuild the full column qualifier. Important for wide rows. http://pastebin.com/7rsq7epp 4) TimestampDeltas are byte-encoded deltas from the minimum timestamp in the block. The PtRowNodes contain pointers to these deltas. The width of all deltas is determined by the longest one. Supports having all timestamps equal to the minTimestamp resulting in zero storage cost. 5) A data section made of all data values concatenated together. The PtRowNodes contain the offsets/lengths. My first priority is getting the storage format right. Then optimizing the read path. Then the write path. I'd love to hear any comments, and will continue to work on getting the full code ready. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086634#comment-13086634 ] Matt Corgan commented on HBASE-4218: That sounds great Jacek. Let me know how to get the interfaces, tests, and benchmarks when you're ready to share them. They would be really helpful. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086650#comment-13086650 ] Jacek Migdal commented on HBASE-4218: - So far the implemented interface looks like: {noformat} /** * Fast compression of KeyValue. It aims to be fast and efficient * using assumptions: * - the KeyValue are stored sorted by key * - we know the structure of KeyValue * - the values are iterated always forward from beginning of block * - application specific knowledge * * It is designed to work fast enough to be feasible as in memory compression. */ public interface DeltaEncoder { /** * Compress KeyValues and write them to output buffer. * @param writeHere Where to write compressed data. * @param rawKeyValues Source of KeyValue for compression. * @throws IOException If there is an error in writeHere. */ public void compressKeyValue(OutputStream writeHere, ByteBuffer rawKeyValues) throws IOException; /** * Uncompress assuming that original size is known. * @param source Compressed stream of KeyValues. * @param decompressedSize Size in bytes of uncompressed KeyValues. * @return Uncompressed block of KeyValues. * @throws IOException If there is an error in source. * @throws DeltaEncoderToSmallBufferException If specified uncompressed *size is too small. */ public ByteBuffer uncompressKeyValue(DataInputStream source, int decompressedSize) throws IOException, DeltaEncoderToSmallBufferException; } {noformat} I also need some kind of interface for iterating and seeking. I haven't got it yet but would like to have something like: {noformat} public IteratorKeyValue getIterator(ByteBuffer encodedKeyValues); public IteratorKeyValue getIteratorStartingFrom(ByteBuffer encodedKeyValues, byte[] keyBuffer, int offset, int length); {noformat} For me it would work, but for you I might have changing it to something like: {noformat} public EncodingIterator getState(ByteBuffer encodedKeyValues); class EncodingIterator implements IteratorKeyValue { ... public void seekToBeginning(); public void seekTo(byte[] keyBuffer, int offset, int length); {noformat} I will figure out how we could share the code. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086704#comment-13086704 ] Matt Corgan commented on HBASE-4218: I should be able to work with ByteBuffer as the backing block data. Like you said above, we'll have to work on smarter iterators and comparators that can do most things without instantiating a full KeyValue in it's current form. Sounds like it will be a longer term project to make KeyValue into a more flexible interface, so in the mean time there will be places it has to cut a full KeyValue by copying bytes. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086708#comment-13086708 ] Jonathan Gray commented on HBASE-4218: -- bq. in the mean time there will be places it has to cut a full KeyValue by copying bytes Agreed. There's some other work going on around slab allocators and object reuse that could be paired with this to ameliorate some of that overhead. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Reporter: Jacek Migdal Labels: compression A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira