> the decoding becomes unnecessarily slow, especially after I vectorized all 
> decoding functions, decoding the header of each run becomes the bottleneck. 
> On the other hand, it doesn’t make the compression ratio bigger for many 
> cases. 
> I tried to disable this encoding method and re-encoding the lineitem table, 
> which is the biggest table of TPC-H benchmark, I find most of the columns are 
> even smaller without the short repeat encoding. 
…
> store only the differences of all values. I think this is a common case. I 
> also added this feature in my test and the sizes of some columns are 
> significantly smaller. 
 
Those two statements sounds like you've been doing active modifications to the 
encoding loops for ORC.

I don't think the integer encoding in ORC is closed chapter, just in a 
temporary state of stability & I've been holding back most of my changes till 
we put all of ORC into one repo.

Specifically, work on improving 

timestamp streams for click-streams (which fits the base + direct encoding 
case) has been on my TODO list for a while.

If you have built a faster encoding loop or data layout, I encourage you to 
contribute to ORC & I will definitely review/benchmark any improvements to help 
you get your changes in.

Cheers,
Gopal


Reply via email to