Thanks, Jonathan! I've wondered about specific numbers on this topic when dealing with geohashes, so this is a very useful tool.
On Sun, Oct 25, 2015 at 11:22 AM, Jonathan Wonders <[email protected]> wrote: > I have been able to put some more thought into this over the weekend and > make initial observations on tables I currently have populated. Looking at > the rfile-info for a few different tables, I noticed that one which has > particularly small lexicographical deltas between keys costs an average of > ~2.5 bits per key to store on disk. All of the data is stored in the row > component of the key and a full row is typically about 36 bytes. I wrote a > little utility to recreate ScanResult objects for batches of sequential > key-value pairs returned from a scanner and then used the TCompactProtocol > to write the ScanResult to a byte array. Each key-value pair costs > rougly 48 bytes which makes sense given that every row is different and > there will be some space required for the timestamps, visibilities, and > other bookeeping info. > > Another table I looked at has larger lexicographical deltas between keys > and costs roughly 5 bytes per key to store on disk. This table is a > reverse index with very large rows, each column within a row identifies > data that resides in another table. Each column is rougly 12 bytes > uncompressed. When encoded in a ScanResult, each key-value pair costs > roughly 25 bytes which makes sense since the row cost should be negligible > for large batch sizes and the overhead from timestamp, visibility, and > other bookkeeping info is roungly the same as the other table. > > Since compression depends heavily on both table design and the actual > data, it seemed the next logical step would be to create a tool that the > community could use to easily measure the compression ratio for ScanResult > objects. > So, I threw together a shell extension to wrap the utility that I > previously described. It measures compression ratio for the default > strategy Key.compress as well as a few other simple strategies that seemed > reasonable to test. The usage is almost the same as the scan command, it > just prints out compression statistics rather than the data. > > It lives at https://github.com/jwonders/accumulo-experiments with > branches for Accumulo 1.6.x, 1.7.x, and 1.8.x. > > Any feedback is welcome. I hope others find this useful for understanding > this particular aspect of scan performance. > > V/R > Jonathan > > > On Thu, Oct 22, 2015 at 4:37 PM, Jonathan Wonders <[email protected]> > wrote: > >> Josh, >> >> Thanks for the information. I did read through the discussion about >> compression of visibility expressions and columns within RFiles a while >> back which got me thinking about some of this. It makes sense that gzip or >> lzo/snappy compression would have a very noticeable impact when there are >> columns or visibility expressions that are not compressed with RLE even if >> neighboring rows have very small lexicographical deltas. >> >> I will put some thought into desiging an experiment to evaluate whether >> or not there is any benefit to applying RLE during key-value transport from >> server to client. Even if it proves to be situationally beneficial, I >> think it could be implemented as a common iterator similar to the >> WholeRowIterator. >> >> Given, the current compression strategy I would expect better >> server-client transport compression retrieving a single row with many >> columns >> >> <key><value> : <id> [] >> >> compared to many lexicographically close rows. >> >> <key><value><id> : [] >> >> with the understanding that very large rows can lead to poor load >> balancing. >> >> V/R >> Jonathan >> >> On Thu, Oct 22, 2015 at 11:54 AM, Josh Elser <[email protected]> >> wrote: >> >>> Jonathan Wonders wrote: >>> >>>> I have been digging into some details of Accumulo to model the disk and >>>> network costs associated with various types of scan patterns and I have >>>> a few questions regarding compression. >>>> >>>> Assuming an inverted index table with rows following the pattern of >>>> >>>> <key><value><id> >>>> >>>> and a scan that specifies an exact key and value so as to constrain the >>>> range, it seems that the dominant factor in network utiltization would >>>> be sending key-value pairs from the tablet server to the client and a >>>> secondary factor would be transmitting data from non-local RFiles >>>> (assuming no caching). >>>> >>> >>> Sounds about right to me. >>> >>> Is my understanding correct that the on-disk compression of this type of >>>> table is predominantly a function of the average number of differing >>>> bits between adjacent ids? Or, has anyone observed a significant >>>> improvement with gz or lzo vs no additional compression? I'm >>>> considering running some experiments to measure the difference for a few >>>> types of ids (uuid, snowflake-like, content based hashes), but I'm >>>> curious if anyone else has done similar experiments. >>>> >>> >>> There is definitely a noticed difference between gzip, lzo/snappy >>> (they're about on par with speed and size), and no compression. I'm not >>> sure what the deltas are off the top of my head, but I would expect that >>> you see noticeable differences. >>> >>> You can also notice differences in "densities" when the lexicographical >>> delta between two keys. Sequential keys that differ very little will result >>> in very dense RFiles (lots of keys), where keys that vary greatly will >>> result in less keys per file. We've had some discussions about the >>> run-length encoding in RFile lately -- have you stumbled on that? >>> >>> Given a scan that specifies a range for an exact key and value, is there >>>> any transport compression performed for tablet server to client >>>> communication beyond the Key.compress method which appears to only >>>> compress equivalent rows, columns, etc as opposed to those that share a >>>> common prefix? >>>> >>> >>> No, there are no compression algorithms applied to the data before >>> sending it. By default we use the TCompactProtocol from Thrift. If you're >>> interested in the specifics, Thrift has some documentation on the subject. >>> >>> I do recall some experiments I tried previously in which a logical >>> "object" in Accumulo was comprised of multiple key-value pairs which were >>> returned in one key-value pair (very similar to how the WholeRowIterator >>> serializes things). In this case, I experimented with compressing the >>> serialized Key-Values before returning from the Iterator stack. Sadly, I >>> don't recall what the takeaway was, but I don't believe it was outright >>> "bad" :) >>> >>> It seems possible to implement a more specialized compression algorithm >>>> with the iterator framework, performing the decompression on the client >>>> side, but I'm curious if it could lead to general scan performance >>>> improvements if the default compression also involved run-length >>>> encoding. >>>> >>> >>> This would be really neat to test. >>> >>> Some rigor in designing an experiment would be the first step, IMO. If >>> you want to spear-head this, I'm sure there are many who would be happy to >>> give some feedback. It would be prudent to figure out what the variables in >>> the "equation" would be, specifically the distribution of Key-Value pairs >>> -- amount stored, compression on disk, cache sizes, query workload. First >>> out what you want to test, what you will tweak, and define a hypothesis you >>> want to test. >>> >>> You could consider using the continuous ingest framework for >>> data-generation and query workloads. You could also look at YCSB [1] >>> for some inspiration. Either of these would make your life easier in not >>> having to generate the read/write workloads. >>> >>> >>> [1] https://github.com/brianfrankcooper/YCSB >>> >>> >>> Any insight on this subject is much appreciated. >>>> >>>> V/R >>>> Jonathan >>>> >>> >> >
