> If all the above statements are correct, then why not do something like the
> present stream? We can make the secondary stream of decimal columns optional.
> If all the scales of a column are same as the scale in the footer, then we
> can just ignore it.
I think that is a valid case for suppressing that stream, since most people
using Decimal in SQL form will specify a consistent decimal size across all
values.
> Also, I think we can save more spaces by writing delta scales in the
> secondary stream, meaning that we can write (actualScale - scaleInFileFooter)
> instead of actualScale.
Integer columns should compress well without having to manually encode a delta
encoding there.
I was under the impression that the current scale folds neatly into the integer
encoding for RLE.
In my experiment, 1.9M values became 102 bytes, which is not nothing - but is
very small.
Column 1: count: 1920800 hasNull: false min: 1 max: 1920800 sum:
1844737280400
Column 2: count: 1920800 hasNull: false min: 1 max: 1920800 sum:
1844737280400
…
Stream: column 1 section DATA start: 7210 length 3178854
Stream: column 1 section SECONDARY start: 3186064 length 102
Stream: column 2 section DATA start: 3186166 length 5056
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DIRECT_V2
This is a Decimal(28,10), Bigint table.The part that matters is the DATA size
there, considering I inserted 1.9M sequential integers into this.
The same data as bigint is only 5056 bytes.
The SECONDARY seems to be compress pretty tight, however, as you mention it is
completely unnecessary & can be suppressed when it is all the same.
A better decimal encoding is badly needed, to get the best out of this, the
breaking change should also tackle DATA - experiments and ideas would be
appreciated.
Cheers,
Gopal