Re: Hashing and equivalence of datasets

Jorge Cardoso Leitão Sat, 04 Dec 2021 13:46:40 -0800

Hi,

I think that unfortunately parquet is underdetermined, for example,
RLE-hybrid encoding: whether to use a RLE or bitpacked run in RLE-hybrid
encoding is left for implementations to decide: an implementation may only
use bitpacked runs, while other may use a combination. This leads to
different binaries, although with equal semantics. RLE hybrid-encoding is
commonly used to store validities / definition levels and repetition
levels, so it is likely that you hit it in normal uses of parquet.


Best,
Jorge



On Sat, Dec 4, 2021 at 9:44 PM Sergii Mikhtoniuk <[email protected]>
wrote:

> Thanks everyone for suggestions,
>
> Hashing a _serialized_ representation is what I'm doing now (CSV, yuck),
> but no format is strict enough it seems to guarantee hash stability - my
> hashes may differ depending on implementation or even drift with a new
> version of the same library.
>
> Arrow standardizes the exact memory layout of data making it already more
> strict than any serialization format, so that's why I was hoping to use it
> directly (which looks very doable, but a bit more work than I hoped).
>
> Another possibility I've been asking around in the Parquet community is
> whether I can achieve stable hashing with it. Assuming I:
> - Sort all records
> - Disable statistics
> - Write all data in a single row group
> ... would that yield a reproducible binary representation?
>
> Parquet spec is very vague. From what I could gather it doesn't produce
> value gaps [1, 2, ?, 4] for nullable data, but I could not confirm if there
> are any other sources of non-determinism in there.
>
> Cheers,
> Sergii
>

Re: Hashing and equivalence of datasets

Reply via email to