Hi,
  The RLE in ORC is a tradeoff (as is all compression) between tight
representations for commonly occurring patterns and longer representations
for rarely occurring patterns. The question at hand is how to use the bits
available to reduce the average size of the column. In Hive 0.12, ORC
gained a second version of the RLE, so I'll split out the two versions:

ORC RLEv1 (max run = 130):

   1 million integer 0: 7692 copies of 7f 00 00 followed by 24 00 00 =
23,079 bytes

ORC RLEv2 (max run = 511):

  1 million integer 0: 1956 copies of c1 ff 00 00 followed by c1 e4 00 00 =
7,828 bytes

With generic compression (ZLIB, Snappy) on top of this, it shrinks even
smaller.

So back to the original question, it was a tradeoff between complexity and
size of common cases. The length of the run in both cases has a fixed
number of bits and if we had used 32 bits of the repetition length, the
more typical case of 5 to 10 repetitions would have been far worse.


On Mon, Nov 11, 2013 at 6:22 AM, qihua wu <[email protected]> wrote:

> In vertica, if I have a column sorted, and the same value repeat 1M times,
> it only used very small storage as it only stores (value, 1M). But in ORC,
> looks like the max length is less than 200 ( not very sure, but at about
> the same level of hundreds), why restrict the max run length?
>

Reply via email to