Hi, I'm currently working on a project which needs me to dig deep into the RLE encoding of ORC. ORC's version 2 RLE supports 4 types of encoding methods, direct, repeat, delta and patched as defined in the document. My concern is, is there some standard for the selection of encoding methods?
For example, for a list of integers: 1 2 3 4 5 6 7 3, it could be encoded with direct encoding for all 8 integers, or delta encoding for first 7 integers and direct encoding for the last one. And the latter has a better compression ratio. I've read the source code of the class in RunLengthIntegerWriterV2.java, it seems like that the writer always track if any repeated values were written recently, and writeValues() the values before the repeated values in proper encoding and then start another run with the newly written repeated values. And for the integer list given above, it would be encoded with direct encoding for all 8 integers. However, for an integer list, 1 2 3 4 5 6 7 3 3 3, first 7 integers will be encoded with delta encoding and last three with short repeat encoding. I got puzzled with the selection of encoding methods. Is there a fixed policy for it? Like a standard finite state machine to describe the whole process. And what is the principle of the selection? If it is to compress the data as small as possible, the released code is not optimized which is proved by the given case above. For my project, I need an official definition of the encoding policy of ORC's RLE2 but I cannot find any hint on the official website. It will be appreciated if you can give me some official response for this issue. Do we have any standard encoding policy? Or any policies are fine as long as they use the defined four encoding methods. Thanks, Dejun
