Re: about rle encoding method selection

Teng, dejun Fri, 13 Jan 2017 11:59:49 -0800

Hi Gopal,

Thanks for your response. 
I'm working on a project using ORC as the data storage format, I do implemented 
my own version of ORC decoder based on the original released one with 
vectorization enabled. And also modified the original writer a little bit to 
prove some of my improvement suggestions as mentioned in my last email.
It will be my pleasure to contribute to ORC if possible.


Thanks,
Dejun 




发件人: Gopal Vijayaraghavan [mailto:[email protected]] 代表 Gopal Vijayaraghavan
发送时间: 2017年1月12日 4:53
收件人: [email protected]
抄送: Teng, dejun
主题: Re: 答复: about rle encoding method selection

> the decoding becomes unnecessarily slow, especially after I vectorized all 
> decoding functions, decoding the header of each run becomes the bottleneck. 
> On the other hand, it doesn’t make the compression ratio bigger for many 
> cases. 
> I tried to disable this encoding method and re-encoding the lineitem table, 
> which is the biggest table of TPC-H benchmark, I find most of the columns are 
> even smaller without the short repeat encoding. 
…
> store only the differences of all values. I think this is a common case. I 
> also added this feature in my test and the sizes of some columns are 
> significantly smaller. 
 
Those two statements sounds like you've been doing active modifications to the 
encoding loops for ORC.

I don't think the integer encoding in ORC is closed chapter, just in a 
temporary state of stability & I've been holding back most of my changes till 
we put all of ORC into one repo.

Specifically, work on improving 

timestamp streams for click-streams (which fits the base + direct encoding 
case) has been on my TODO list for a while.

If you have built a faster encoding loop or data layout, I encourage you to 
contribute to ORC & I will definitely review/benchmark any improvements to help 
you get your changes in.

Cheers,
Gopal

Re: about rle encoding method selection

Reply via email to