Thanks everyone for suggestions, Hashing a _serialized_ representation is what I'm doing now (CSV, yuck), but no format is strict enough it seems to guarantee hash stability - my hashes may differ depending on implementation or even drift with a new version of the same library.
Arrow standardizes the exact memory layout of data making it already more strict than any serialization format, so that's why I was hoping to use it directly (which looks very doable, but a bit more work than I hoped). Another possibility I've been asking around in the Parquet community is whether I can achieve stable hashing with it. Assuming I: - Sort all records - Disable statistics - Write all data in a single row group ... would that yield a reproducible binary representation? Parquet spec is very vague. From what I could gather it doesn't produce value gaps [1, 2, ?, 4] for nullable data, but I could not confirm if there are any other sources of non-determinism in there. Cheers, Sergii
