Gang, When decimal was first introduced in Hive, they were infinite precision. ORC therefore had support for it. You should look at the discussion on https://issues.apache.org/jira/browse/ORC-161 , but you are absolutely right that we should create a new encoding for decimal that doesn't encode scale. We should also use rle for the values.
.. Owen On Thu, Mar 16, 2017 at 10:17 PM, Wu Gang <[email protected]> wrote: > Hi > > I'm Gang Wu who's an inactive Spark contributor and now curious about the > design of decimal type in ORC. From the documentation and java code, I > think it is as follows and correct me if I'm wrong: > > 1. A decimal type is at most 127 bits long for value plus 1 bit for > indicating the sign. In total, at most 128 bits; > 2. Although the precision and scale of a decimal are stored in the file > footer, we still need to write every scale of each element in the secondary > stream using signed integer RLE. The scale written has nothing to do with > the scale stored in file footer. It can be same as the scale in file footer > or totally different. > > If all the above statements are correct, then why not do something like > the present stream? We can make the secondary stream of decimal columns > optional. If all the scales of a column are same as the scale in the > footer, then we can just ignore it. > > Also, I think we can save more spaces by writing delta scales in the > secondary stream, meaning that we can write (actualScale - > scaleInFileFooter) instead of actualScale. But this may break the backward > compatibility. > > Any reply is welcome! Thanks! > > Best, > Gang > >
