Prefix Compression - Trie data block encoding
---------------------------------------------

                 Key: HBASE-4676
                 URL: https://issues.apache.org/jira/browse/HBASE-4676
             Project: HBase
          Issue Type: New Feature
          Components: io
    Affects Versions: 0.94.0
            Reporter: Matt Corgan


The HBase data block format has room for 2 significant improvements for 
applications that have high block cache hit ratios.  

First, there is no prefix compression, and the current KeyValue format is 
somewhat metadata heavy, so there can be tremendous memory bloat for many 
common data layouts, specifically those with long keys and short values.

Second, there is no random access to KeyValues inside data blocks.  This means 
that every time you double the datablock size, average seek time (or average 
cpu consumption) goes up by a factor of 2.  The standard 64KB block size is 
~10x slower for random seeks than a 4KB block size, but block sizes as small as 
4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be 
more efficient from a disk access and block-cache perspective in many big-data 
applications, but doing so is infeasible from a random seek perspective.

The PrefixTrie block encoding format attempts to solve both of these problems.  
Some features:

* trie format for row key encoding completely eliminates duplicate row keys and 
encodes similar row keys into a standard trie structure which also saves a lot 
of space
* the column family is currently stored once at the beginning of each block.  
this could easily be modified to allow multiple family names per block
* all qualifiers in the block are stored in their own trie format which caters 
nicely to wide rows.  duplicate qualifers between rows are eliminated.  the 
size of this trie determines the width of the block's qualifier fixed-width-int
* the minimum timestamp is stored at the beginning of the block, and deltas are 
calculated from that.  the maximum delta determines the width of the block's 
timestamp fixed-width-int

The block is structured with metadata at the beginning, then a section for the 
row trie, then the column trie, then the timestamp deltas, and then then all 
the values.  Most work is done in the row trie, where every leaf node 
(corresponding to a row) contains a list of offsets/references corresponding to 
the cells in that row.  Each cell is fixed-width to enable binary searching and 
is represented by [1 byte operationType, X bytes qualifier offset, X bytes 
timestamp delta offset].

If all operation types are the same for a block, there will be zero per-cell 
overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, 
the compression aspect is very strong, but makes a few small sacrifices on 
VarInt size to enable faster binary searches in trie fan-out nodes.

A more compressed but slower version might build on this by also applying 
further (suffix, etc) compression on the trie nodes at the cost of slower write 
speed.  Even further compression could be obtained by using all VInts instead 
of FInts with a sacrifice on random seek speed (though not huge).

One current drawback is the current write speed.  While programmed with good 
constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
programmed with the same level of optimization as the read path.  Work will 
need to be done to optimize the data structures used for encoding and could 
probably show a 10x increase.  It will still be slower than delta encoding, but 
with a much higher decode speed.  I have not yet created a thorough benchmark 
for write speed nor sequential read speed.

Though the trie is reaching a point where it is internally very efficient 
(probably within half or a quarter of its max read speed) the way that hbase 
currently uses it is far from optimal.  The KeyValueScanner and related classes 
that iterate through the trie will eventually need to be smarter and have 
methods to do things like skipping to the next row of results without scanning 
every cell in between.  When that is accomplished it will also allow much 
faster compactions because the full row key will not have to be compared as 
often as it is now.

Current code is on github.  The trie code is in a separate project than the 
slightly modified hbase.  There is an hbase project there as well with the 
DeltaEncoding patch applied, and it builds on top of that.

https://github.com/hotpads/hbase/tree/delta-encoding-plus-trie
https://github.com/hotpads/hbase-prefix-trie

I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to