Hi Dejun
Currently the encoding mode is decided on the buffered data (512 values) 
combined. Only short repeat (delta 0 and run length <10 IIRC) is encoded 
immediately. For all other encodings all 512 values are analyzed together to 
make a decision. For the sequence that you had mentioned, 1 2 3 4 5 6 7 3 
sequence is not a monotonic sequence and hence direct encoding is chosen. The 
encoding headers have 9 bits to encode the runs and hence 512 literals are 
analyzed together to minimize the header to data overhead. Patched delta is 
chosen only when there is sudden spike in bit requirements between 95th and 
100th percentile values. Choosing this way results in smaller overhead for 
encoding the header information. 
Thanks
Prasanth

                _____________________________
From: Teng, dejun <[email protected]>
Sent: Wednesday, December 28, 2016 1:55 AM
Subject: about rle encoding method selection
To:  <[email protected]>




Hi,

 

I’m currently working on a project which needs me to dig deep into the RLE 
encoding of ORC. ORC’s version 2 RLE supports 4 types of encoding methods, 
direct, repeat, delta and patched as defined in the document. My concern is, is 
there some standard for the selection of encoding methods?

 

For example, for a list of integers: 1 2 3 4 5 6 7 3, it could be encoded with 
direct encoding for all 8 integers, or delta encoding for first 7 integers and 
direct encoding for the last one. And the latter has a better compression 
ratio. 

I’ve read the source code of the class in RunLengthIntegerWriterV2.java, it 
seems like that the writer always track if any repeated values were written 
recently, and writeValues() the values before the repeated values in proper 
encoding and then start another run with the newly written repeated values. And 
for the integer list given above, it would be encoded with direct encoding for 
all 8 integers. However, for an integer list, 1 2 3 4 5 6 7 3 3 3, first 7 
integers will be encoded with delta encoding and last three with short repeat 
encoding.

 

I got puzzled with the selection of encoding methods. Is there a fixed policy 
for it? Like a standard finite state machine to describe the whole process. And 
what is the principle of the selection? If it is to compress the data as small 
as possible, the released code is not optimized which is proved by the given 
case above. For my project, I need an official definition  of the encoding 
policy of ORC’s RLE2 but I cannot find any hint on the official website. It 
will be appreciated if you can give me some official response for this issue. 
Do we have any standard encoding policy? Or any policies are fine as long as 
they use the defined four encoding methods.

 

 

Thanks,

Dejun 

 


        

Reply via email to