Hi I'm Gang Wu who's an inactive Spark contributor and now curious about the design of decimal type in ORC. From the documentation and java code, I think it is as follows and correct me if I'm wrong:
1. A decimal type is at most 127 bits long for value plus 1 bit for indicating the sign. In total, at most 128 bits; 2. Although the precision and scale of a decimal are stored in the file footer, we still need to write every scale of each element in the secondary stream using signed integer RLE. The scale written has nothing to do with the scale stored in file footer. It can be same as the scale in file footer or totally different. If all the above statements are correct, then why not do something like the present stream? We can make the secondary stream of decimal columns optional. If all the scales of a column are same as the scale in the footer, then we can just ignore it. Also, I think we can save more spaces by writing delta scales in the secondary stream, meaning that we can write (actualScale - scaleInFileFooter) instead of actualScale. But this may break the backward compatibility. Any reply is welcome! Thanks! Best, Gang
