Jonathan Wonders wrote:
I have been digging into some details of Accumulo to model the disk and
network costs associated with various types of scan patterns and I have
a few questions regarding compression.

Assuming an inverted index table with rows following the pattern of

<key><value><id>

and a scan that specifies an exact key and value so as to constrain the
range, it seems that the dominant factor in network utiltization would
be sending key-value pairs from the tablet server to the client and a
secondary factor would be transmitting data from non-local RFiles
(assuming no caching).

Sounds about right to me.

Is my understanding correct that the on-disk compression of this type of
table is predominantly a function of the average number of differing
bits between adjacent ids?  Or, has anyone observed a significant
improvement with gz or lzo vs no additional compression?  I'm
considering running some experiments to measure the difference for a few
types of ids (uuid, snowflake-like, content based hashes), but I'm
curious if anyone else has done similar experiments.

There is definitely a noticed difference between gzip, lzo/snappy (they're about on par with speed and size), and no compression. I'm not sure what the deltas are off the top of my head, but I would expect that you see noticeable differences.

You can also notice differences in "densities" when the lexicographical delta between two keys. Sequential keys that differ very little will result in very dense RFiles (lots of keys), where keys that vary greatly will result in less keys per file. We've had some discussions about the run-length encoding in RFile lately -- have you stumbled on that?

Given a scan that specifies a range for an exact key and value, is there
any transport compression performed for tablet server to client
communication beyond the Key.compress method which appears to only
compress equivalent rows, columns, etc as opposed to those that share a
common prefix?

No, there are no compression algorithms applied to the data before sending it. By default we use the TCompactProtocol from Thrift. If you're interested in the specifics, Thrift has some documentation on the subject.

I do recall some experiments I tried previously in which a logical "object" in Accumulo was comprised of multiple key-value pairs which were returned in one key-value pair (very similar to how the WholeRowIterator serializes things). In this case, I experimented with compressing the serialized Key-Values before returning from the Iterator stack. Sadly, I don't recall what the takeaway was, but I don't believe it was outright "bad" :)

It seems possible to implement a more specialized compression algorithm
with the iterator framework, performing the decompression on the client
side, but I'm curious if it could lead to general scan performance
improvements if the default compression also involved run-length encoding.

This would be really neat to test.

Some rigor in designing an experiment would be the first step, IMO. If you want to spear-head this, I'm sure there are many who would be happy to give some feedback. It would be prudent to figure out what the variables in the "equation" would be, specifically the distribution of Key-Value pairs -- amount stored, compression on disk, cache sizes, query workload. First out what you want to test, what you will tweak, and define a hypothesis you want to test.

You could consider using the continuous ingest framework for data-generation and query workloads. You could also look at YCSB [1] for some inspiration. Either of these would make your life easier in not having to generate the read/write workloads.


[1] https://github.com/brianfrankcooper/YCSB

Any insight on this subject is much appreciated.

V/R
Jonathan

Reply via email to