Gang,
   When decimal was first introduced in Hive, they were infinite precision.
ORC therefore had support for it. You should look at the discussion on
https://issues.apache.org/jira/browse/ORC-161 , but you are absolutely
right that we should create a new encoding for decimal that doesn't encode
scale. We should also use rle for the values.

.. Owen

On Thu, Mar 16, 2017 at 10:17 PM, Wu Gang <[email protected]> wrote:

> Hi
>
> I'm Gang Wu who's an inactive Spark contributor and now curious about the
> design of decimal type in ORC. From the documentation and java code, I
> think it is as follows and correct me if I'm wrong:
>
> 1. A decimal type is at most 127 bits long for value plus 1 bit for
> indicating the sign. In total, at most 128 bits;
> 2. Although the precision and scale of a decimal are stored in the file
> footer, we still need to write every scale of each element in the secondary
> stream using signed integer RLE. The scale written has nothing to do with
> the scale stored in file footer. It can be same as the scale in file footer
> or totally different.
>
> If all the above statements are correct, then why not do something like
> the present stream? We can make the secondary stream of decimal columns
> optional. If all the scales of a column are same as the scale in the
> footer, then we can just ignore it.
>
> Also, I think we can save more spaces by writing delta scales in the
> secondary stream, meaning that we can write (actualScale -
> scaleInFileFooter) instead of actualScale. But this may break the backward
> compatibility.
>
> Any reply is welcome! Thanks!
>
> Best,
> Gang
>
>

Reply via email to