Dejun,
    This is great. We absolutely can and should improve the RLE. We have
had two versions of the RLE so far - v1 in the first version of ORC and v2
later in 2013.  Given the three additional years, we have a lot more
experience with storing real world data in ORC. As you correctly point out,
we need to push the vectorization all the way into the rle implementations.

   Another point to consider is the bit encoders. Right now they have a lot
of low hanging fruit to improve the encoding and performance.

.. Owen


On Thu, Jan 12, 2017 at 1:52 AM, Gopal Vijayaraghavan <[email protected]>
wrote:

> > the decoding becomes unnecessarily slow, especially after I vectorized
> all decoding functions, decoding the header of each run becomes the
> bottleneck. On the other hand, it doesn’t make the compression ratio bigger
> for many cases.
> > I tried to disable this encoding method and re-encoding the lineitem
> table, which is the biggest table of TPC-H benchmark, I find most of the
> columns are even smaller without the short repeat encoding.
> …
> > store only the differences of all values. I think this is a common case.
> I also added this feature in my test and the sizes of some columns are
> significantly smaller.
>
> Those two statements sounds like you've been doing active modifications to
> the encoding loops for ORC.
>
> I don't think the integer encoding in ORC is closed chapter, just in a
> temporary state of stability & I've been holding back most of my changes
> till we put all of ORC into one repo.
>
> Specifically, work on improving
>
> timestamp streams for click-streams (which fits the base + direct encoding
> case) has been on my TODO list for a while.
>
> If you have built a faster encoding loop or data layout, I encourage you
> to contribute to ORC & I will definitely review/benchmark any improvements
> to help you get your changes in.
>
> Cheers,
> Gopal
>
>
>

Reply via email to