Hi, I think that unfortunately parquet is underdetermined, for example, RLE-hybrid encoding: whether to use a RLE or bitpacked run in RLE-hybrid encoding is left for implementations to decide: an implementation may only use bitpacked runs, while other may use a combination. This leads to different binaries, although with equal semantics. RLE hybrid-encoding is commonly used to store validities / definition levels and repetition levels, so it is likely that you hit it in normal uses of parquet.
Best, Jorge On Sat, Dec 4, 2021 at 9:44 PM Sergii Mikhtoniuk <[email protected]> wrote: > Thanks everyone for suggestions, > > Hashing a _serialized_ representation is what I'm doing now (CSV, yuck), > but no format is strict enough it seems to guarantee hash stability - my > hashes may differ depending on implementation or even drift with a new > version of the same library. > > Arrow standardizes the exact memory layout of data making it already more > strict than any serialization format, so that's why I was hoping to use it > directly (which looks very doable, but a bit more work than I hoped). > > Another possibility I've been asking around in the Parquet community is > whether I can achieve stable hashing with it. Assuming I: > - Sort all records > - Disable statistics > - Write all data in a single row group > ... would that yield a reproducible binary representation? > > Parquet spec is very vague. From what I could gather it doesn't produce > value gaps [1, 2, ?, 4] for nullable data, but I could not confirm if there > are any other sources of non-determinism in there. > > Cheers, > Sergii >
