Hashing and equivalence of datasets

Sergii Mikhtoniuk Fri, 03 Dec 2021 14:38:38 -0800

Hi,

I'm working on a data processing tool that guarantees reproducibility /
determinism of operations. It's a frequent task for me to verify that one
dataset (Table) is equivalent to another.


I didn't find any functions related to computing hash sums in Arrow, but
wondering if anyone knows existing implementations?

If I were to implement a hashing over chunked arrays myself, does Arrow
guarantee that any sort of padding between aligned values is zeroed-out, so
that hashes are perfectly stable?

Bonus question: Has anyone seen hashing algorithms for tabular data that
can check for equivalence (rather than equality)? i.e. I consider datasets
equivalent if they contain the same set of records, but not necessarily in
the same order.

Thank you!
- Sergii

Hashing and equivalence of datasets

Reply via email to