> the decoding becomes unnecessarily slow, especially after I vectorized all > decoding functions, decoding the header of each run becomes the bottleneck. > On the other hand, it doesn’t make the compression ratio bigger for many > cases. > I tried to disable this encoding method and re-encoding the lineitem table, > which is the biggest table of TPC-H benchmark, I find most of the columns are > even smaller without the short repeat encoding. … > store only the differences of all values. I think this is a common case. I > also added this feature in my test and the sizes of some columns are > significantly smaller. Those two statements sounds like you've been doing active modifications to the encoding loops for ORC.
I don't think the integer encoding in ORC is closed chapter, just in a temporary state of stability & I've been holding back most of my changes till we put all of ORC into one repo. Specifically, work on improving timestamp streams for click-streams (which fits the base + direct encoding case) has been on my TODO list for a while. If you have built a faster encoding loop or data layout, I encourage you to contribute to ORC & I will definitely review/benchmark any improvements to help you get your changes in. Cheers, Gopal
