Definitely. This is pretty cool. Thanks for sharing. I'll have to try to make some time to poke around with this :)
William Slacum wrote:
Thanks, Jonathan! I've wondered about specific numbers on this topic when dealing with geohashes, so this is a very useful tool. On Sun, Oct 25, 2015 at 11:22 AM, Jonathan Wonders <[email protected] <mailto:[email protected]>> wrote: I have been able to put some more thought into this over the weekend and make initial observations on tables I currently have populated. Looking at the rfile-info for a few different tables, I noticed that one which has particularly small lexicographical deltas between keys costs an average of ~2.5 bits per key to store on disk. All of the data is stored in the row component of the key and a full row is typically about 36 bytes. I wrote a little utility to recreate ScanResultobjects for batches of sequential key-value pairs returned from a scanner and then used the TCompactProtocol to write the ScanResult to a byte array. Each key-value pair costs rougly 48 bytes which makes sense given that every row is different and there will be some space required for the timestamps, visibilities, and other bookeeping info. Another table I looked at has larger lexicographical deltas between keys and costs roughly 5 bytes per key to store on disk. This table is a reverse index with very large rows, each column within a row identifies data that resides in another table. Each column is rougly 12 bytes uncompressed. When encoded in a ScanResult, each key-value pair costs roughly 25 bytes which makes sense since the row cost should be negligible for large batch sizes and the overhead from timestamp, visibility, and other bookkeeping info is roungly the same as the other table. Since compression depends heavily on both table design and the actual data, it seemed the next logical step would be to create a tool that the community could use to easily measure the compression ratio for ScanResultobjects. So, I threw together a shell extension to wrap the utility that I previously described. It measures compression ratio for the default strategy Key.compress as well as a few other simple strategies that seemed reasonable to test. The usage is almost the same as the scan command, it just prints out compression statistics rather than the data. It lives at https://github.com/jwonders/accumulo-experiments with branches for Accumulo 1.6.x, 1.7.x, and 1.8.x. Any feedback is welcome. I hope others find this useful for understanding this particular aspect of scan performance. V/R Jonathan On Thu, Oct 22, 2015 at 4:37 PM, Jonathan Wonders <[email protected] <mailto:[email protected]>> wrote: Josh, Thanks for the information. I did read through the discussion about compression of visibility expressions and columns within RFiles a while back which got me thinking about some of this. It makes sense that gzip or lzo/snappy compression would have a very noticeable impact when there are columns or visibility expressions that are not compressed with RLE even if neighboring rows have very small lexicographical deltas. I will put some thought into desiging an experiment to evaluate whether or not there is any benefit to applying RLE during key-value transport from server to client. Even if it proves to be situationally beneficial, I think it could be implemented as a common iterator similar to the WholeRowIterator. Given, the current compression strategy I would expect better server-client transport compression retrieving a single row with many columns <key><value> : <id> [] compared to many lexicographically close rows. <key><value><id> : [] with the understanding that very large rows can lead to poor load balancing. V/R Jonathan On Thu, Oct 22, 2015 at 11:54 AM, Josh Elser <[email protected] <mailto:[email protected]>> wrote: Jonathan Wonders wrote: I have been digging into some details of Accumulo to model the disk and network costs associated with various types of scan patterns and I have a few questions regarding compression. Assuming an inverted index table with rows following the pattern of <key><value><id> and a scan that specifies an exact key and value so as to constrain the range, it seems that the dominant factor in network utiltization would be sending key-value pairs from the tablet server to the client and a secondary factor would be transmitting data from non-local RFiles (assuming no caching). Sounds about right to me. Is my understanding correct that the on-disk compression of this type of table is predominantly a function of the average number of differing bits between adjacent ids? Or, has anyone observed a significant improvement with gz or lzo vs no additional compression? I'm considering running some experiments to measure the difference for a few types of ids (uuid, snowflake-like, content based hashes), but I'm curious if anyone else has done similar experiments. There is definitely a noticed difference between gzip, lzo/snappy (they're about on par with speed and size), and no compression. I'm not sure what the deltas are off the top of my head, but I would expect that you see noticeable differences. You can also notice differences in "densities" when the lexicographical delta between two keys. Sequential keys that differ very little will result in very dense RFiles (lots of keys), where keys that vary greatly will result in less keys per file. We've had some discussions about the run-length encoding in RFile lately -- have you stumbled on that? Given a scan that specifies a range for an exact key and value, is there any transport compression performed for tablet server to client communication beyond the Key.compress method which appears to only compress equivalent rows, columns, etc as opposed to those that share a common prefix? No, there are no compression algorithms applied to the data before sending it. By default we use the TCompactProtocol from Thrift. If you're interested in the specifics, Thrift has some documentation on the subject. I do recall some experiments I tried previously in which a logical "object" in Accumulo was comprised of multiple key-value pairs which were returned in one key-value pair (very similar to how the WholeRowIterator serializes things). In this case, I experimented with compressing the serialized Key-Values before returning from the Iterator stack. Sadly, I don't recall what the takeaway was, but I don't believe it was outright "bad" :) It seems possible to implement a more specialized compression algorithm with the iterator framework, performing the decompression on the client side, but I'm curious if it could lead to general scan performance improvements if the default compression also involved run-length encoding. This would be really neat to test. Some rigor in designing an experiment would be the first step, IMO. If you want to spear-head this, I'm sure there are many who would be happy to give some feedback. It would be prudent to figure out what the variables in the "equation" would be, specifically the distribution of Key-Value pairs -- amount stored, compression on disk, cache sizes, query workload. First out what you want to test, what you will tweak, and define a hypothesis you want to test. You could consider using the continuous ingest framework for data-generation and query workloads. You could also look at YCSB [1] for some inspiration. Either of these would make your life easier in not having to generate the read/write workloads. [1] https://github.com/brianfrankcooper/YCSB Any insight on this subject is much appreciated. V/R Jonathan
