> If all the above statements are correct, then why not do something like the 
> present stream? We can make the secondary stream of decimal columns optional. 
> If all the scales of a column are same as the scale in the footer, then we 
> can just ignore it. 

I think that is a valid case for suppressing that stream, since most people 
using Decimal in SQL form will specify a consistent decimal size across all 
values.

> Also, I think we can save more spaces by writing delta scales in the 
> secondary stream, meaning that we can write (actualScale - scaleInFileFooter) 
> instead of actualScale.

Integer columns should compress well without having to manually encode a delta 
encoding there.

I was under the impression that the current scale folds neatly into the integer 
encoding for RLE.

In my experiment, 1.9M values became 102 bytes, which is not nothing - but is 
very small.

    Column 1: count: 1920800 hasNull: false min: 1 max: 1920800 sum: 
1844737280400
    Column 2: count: 1920800 hasNull: false min: 1 max: 1920800 sum: 
1844737280400
…
    Stream: column 1 section DATA start: 7210 length 3178854
    Stream: column 1 section SECONDARY start: 3186064 length 102
    Stream: column 2 section DATA start: 3186166 length 5056
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2

This is a Decimal(28,10), Bigint table.The part that matters is the DATA size 
there, considering I inserted 1.9M sequential integers into this.

The same data as bigint is only 5056 bytes.

The SECONDARY seems to be compress pretty tight, however, as you mention it is 
completely unnecessary & can be suppressed when it is all the same.

A better decimal encoding is badly needed, to get the best out of this, the 
breaking change should also tackle DATA - experiments and ideas would be 
appreciated.

Cheers,
Gopal


Reply via email to