compression of keys for a sequential scan over an inverted index

Jonathan Wonders Thu, 22 Oct 2015 08:09:41 -0700

I have been digging into some details of Accumulo to model the disk and
network costs associated with various types of scan patterns and I have a
few questions regarding compression.


Assuming an inverted index table with rows following the pattern of

<key><value><id>

and a scan that specifies an exact key and value so as to constrain the
range, it seems that the dominant factor in network utiltization would be
sending key-value pairs from the tablet server to the client and a
secondary factor would be transmitting data from non-local RFiles (assuming
no caching).

Is my understanding correct that the on-disk compression of this type of
table is predominantly a function of the average number of differing bits
between adjacent ids?  Or, has anyone observed a significant improvement
with gz or lzo vs no additional compression?  I'm considering running some
experiments to measure the difference for a few types of ids (uuid,
snowflake-like, content based hashes), but I'm curious if anyone else has
done similar experiments.

Given a scan that specifies a range for an exact key and value, is there
any transport compression performed for tablet server to client
communication beyond the Key.compress method which appears to only compress
equivalent rows, columns, etc as opposed to those that share a common
prefix?

It seems possible to implement a more specialized compression algorithm
with the iterator framework, performing the decompression on the client
side, but I'm curious if it could lead to general scan performance
improvements if the default compression also involved run-length encoding.

Any insight on this subject is much appreciated.

V/R
Jonathan

compression of keys for a sequential scan over an inverted index

Reply via email to