As Owen noted, max run for version 0.11 is 130. 3 is minimum run for RLE to be used. So max value that can be interpreted from 7 bits is 130.
Thanks Prasanth Jayachandran On Nov 11, 2013, at 9:51 AM, Owen O'Malley <[email protected]> wrote: > Hi, > The RLE in ORC is a tradeoff (as is all compression) between tight > representations for commonly occurring patterns and longer representations > for rarely occurring patterns. The question at hand is how to use the bits > available to reduce the average size of the column. In Hive 0.12, ORC gained > a second version of the RLE, so I'll split out the two versions: > > ORC RLEv1 (max run = 130): > > 1 million integer 0: 7692 copies of 7f 00 00 followed by 24 00 00 = 23,079 > bytes > > ORC RLEv2 (max run = 511): > > 1 million integer 0: 1956 copies of c1 ff 00 00 followed by c1 e4 00 00 = > 7,828 bytes > > With generic compression (ZLIB, Snappy) on top of this, it shrinks even > smaller. > > So back to the original question, it was a tradeoff between complexity and > size of common cases. The length of the run in both cases has a fixed number > of bits and if we had used 32 bits of the repetition length, the more typical > case of 5 to 10 repetitions would have been far worse. > > > On Mon, Nov 11, 2013 at 6:22 AM, qihua wu <[email protected]> wrote: > In vertica, if I have a column sorted, and the same value repeat 1M times, it > only used very small storage as it only stores (value, 1M). But in ORC, looks > like the max length is less than 200 ( not very sure, but at about the same > level of hundreds), why restrict the max run length? > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
