Jonathan Wonders wrote:
I have been digging into some details of Accumulo to model the disk and
network costs associated with various types of scan patterns and I have
a few questions regarding compression.
Assuming an inverted index table with rows following the pattern of
<key><value><id>
and a scan that specifies an exact key and value so as to constrain the
range, it seems that the dominant factor in network utiltization would
be sending key-value pairs from the tablet server to the client and a
secondary factor would be transmitting data from non-local RFiles
(assuming no caching).
Sounds about right to me.
Is my understanding correct that the on-disk compression of this type of
table is predominantly a function of the average number of differing
bits between adjacent ids? Or, has anyone observed a significant
improvement with gz or lzo vs no additional compression? I'm
considering running some experiments to measure the difference for a few
types of ids (uuid, snowflake-like, content based hashes), but I'm
curious if anyone else has done similar experiments.
There is definitely a noticed difference between gzip, lzo/snappy
(they're about on par with speed and size), and no compression. I'm not
sure what the deltas are off the top of my head, but I would expect that
you see noticeable differences.
You can also notice differences in "densities" when the lexicographical
delta between two keys. Sequential keys that differ very little will
result in very dense RFiles (lots of keys), where keys that vary greatly
will result in less keys per file. We've had some discussions about the
run-length encoding in RFile lately -- have you stumbled on that?
Given a scan that specifies a range for an exact key and value, is there
any transport compression performed for tablet server to client
communication beyond the Key.compress method which appears to only
compress equivalent rows, columns, etc as opposed to those that share a
common prefix?
No, there are no compression algorithms applied to the data before
sending it. By default we use the TCompactProtocol from Thrift. If
you're interested in the specifics, Thrift has some documentation on the
subject.
I do recall some experiments I tried previously in which a logical
"object" in Accumulo was comprised of multiple key-value pairs which
were returned in one key-value pair (very similar to how the
WholeRowIterator serializes things). In this case, I experimented with
compressing the serialized Key-Values before returning from the Iterator
stack. Sadly, I don't recall what the takeaway was, but I don't believe
it was outright "bad" :)
It seems possible to implement a more specialized compression algorithm
with the iterator framework, performing the decompression on the client
side, but I'm curious if it could lead to general scan performance
improvements if the default compression also involved run-length encoding.
This would be really neat to test.
Some rigor in designing an experiment would be the first step, IMO. If
you want to spear-head this, I'm sure there are many who would be happy
to give some feedback. It would be prudent to figure out what the
variables in the "equation" would be, specifically the distribution of
Key-Value pairs -- amount stored, compression on disk, cache sizes,
query workload. First out what you want to test, what you will tweak,
and define a hypothesis you want to test.
You could consider using the continuous ingest framework for
data-generation and query workloads. You could also look at YCSB [1]
for some inspiration. Either of these would make your life easier in not
having to generate the read/write workloads.
[1] https://github.com/brianfrankcooper/YCSB
Any insight on this subject is much appreciated.
V/R
Jonathan