Hi

I'm Gang Wu who's an inactive Spark contributor and now curious about the
design of decimal type in ORC. From the documentation and java code, I
think it is as follows and correct me if I'm wrong:

1. A decimal type is at most 127 bits long for value plus 1 bit for
indicating the sign. In total, at most 128 bits;
2. Although the precision and scale of a decimal are stored in the file
footer, we still need to write every scale of each element in the secondary
stream using signed integer RLE. The scale written has nothing to do with
the scale stored in file footer. It can be same as the scale in file footer
or totally different.

If all the above statements are correct, then why not do something like the
present stream? We can make the secondary stream of decimal columns
optional. If all the scales of a column are same as the scale in the
footer, then we can just ignore it.

Also, I think we can save more spaces by writing delta scales in the
secondary stream, meaning that we can write (actualScale -
scaleInFileFooter) instead of actualScale. But this may break the backward
compatibility.

Any reply is welcome! Thanks!

Best,
Gang

Reply via email to