Re: Hashing and equivalence of datasets

Sergii Mikhtoniuk Sat, 04 Dec 2021 12:44:30 -0800

Thanks everyone for suggestions,

Hashing a _serialized_ representation is what I'm doing now (CSV, yuck),
but no format is strict enough it seems to guarantee hash stability - my
hashes may differ depending on implementation or even drift with a new
version of the same library.


Arrow standardizes the exact memory layout of data making it already more
strict than any serialization format, so that's why I was hoping to use it
directly (which looks very doable, but a bit more work than I hoped).

Another possibility I've been asking around in the Parquet community is
whether I can achieve stable hashing with it. Assuming I:
- Sort all records
- Disable statistics
- Write all data in a single row group
... would that yield a reproducible binary representation?

Parquet spec is very vague. From what I could gather it doesn't produce
value gaps [1, 2, ?, 4] for nullable data, but I could not confirm if there
are any other sources of non-determinism in there.

Cheers,
Sergii

Re: Hashing and equivalence of datasets

Reply via email to