I have been digging into some details of Accumulo to model the disk and network costs associated with various types of scan patterns and I have a few questions regarding compression.
Assuming an inverted index table with rows following the pattern of <key><value><id> and a scan that specifies an exact key and value so as to constrain the range, it seems that the dominant factor in network utiltization would be sending key-value pairs from the tablet server to the client and a secondary factor would be transmitting data from non-local RFiles (assuming no caching). Is my understanding correct that the on-disk compression of this type of table is predominantly a function of the average number of differing bits between adjacent ids? Or, has anyone observed a significant improvement with gz or lzo vs no additional compression? I'm considering running some experiments to measure the difference for a few types of ids (uuid, snowflake-like, content based hashes), but I'm curious if anyone else has done similar experiments. Given a scan that specifies a range for an exact key and value, is there any transport compression performed for tablet server to client communication beyond the Key.compress method which appears to only compress equivalent rows, columns, etc as opposed to those that share a common prefix? It seems possible to implement a more specialized compression algorithm with the iterator framework, performing the decompression on the client side, but I'm curious if it could lead to general scan performance improvements if the default compression also involved run-length encoding. Any insight on this subject is much appreciated. V/R Jonathan
