Hi Nicolas, Apologies for the slow response. Answers inline:
On Mon, Dec 5, 2016 at 8:21 PM, Nicolas Fouché <[email protected]> wrote: > Hi, > > I'm evaluating Kudu and I'd need some hints about column encoding and > compression. > > A- Does it make sense adding LZ4 compression to a field with Dictionary > Encoding ? > This has a slightly complex answer. If the column has low cardinality, then dictionary compression stores the codeword blocks (i.e the numeric indexes into the dictionary) using bitshuffle encoding, which is inherently LZ4-compressed. So, adding LZ4 on top will do nothing except add overhead. The complexity comes in that the dictionary encoding implementation automatically falls back to "PLAIN" if the cardinality is too high to create an effective dictionary. In that case, the LZ4 compression would be useful (just as it would on PLAIN). Given this, I'm hoping to work on a patch very soon which allows you to specify LZ4 encoding, and it will only take effect in the fall-back case. See KUDU-1600 for more info on this. > B- Does it make sense adding LZ4 compression to a field with Run-Length > Encoding ? > Probably wouldn't help much. LZ4 only compresses repeated sequences, and typically the only cross-row sequences you'd have in an integer column would be runs, which are already well compressed by RLE. Something like ZLIB encoding (which does huffman coding) would be effective on top of RLE, but at a pretty high cost. > C- I have a non-key column with randomly distributed INT32 numbers, I > guess I won't add an encoding. But what about compression ? Would LZ4 make > sense ? Would it slow down aggregations (`SUM`) ? > If they're truly randomly distributed, then no compression or encoding will be able to do much with them. If they're randomly distributed but tend to be clustered together into a particular range within the whole INT32 domain (eg something like timestamps) then BITSHUFFLE is probably a good bet. -Todd -- Todd Lipcon Software Engineer, Cloudera
