You could also let Arrow handle these cases by hashing a serialized RecordBatch of your data selection. I'm not sure if it yields the exact same bytesequence in every implementation, though. I would expect that to be the case, but haven't tested it yet.
Best, Marnix On Sat, Dec 4, 2021 at 7:02 AM Jorge Cardoso Leitão < [email protected]> wrote: > AFAIK hashing in this context needs to be done on a slot by slot basis, > just like array equality, as any item on a null slot has a value on the > buffer that is undetermined. > > E.g. the layout of a primitive array [1, 2, None, 4] is two buffer > regions: > * [1, 2, ?, 4] and > * [true, true, false, true] (in bitmap) > > The question mark can be any number. Hashing needs to skip the "?", which > is achieved by iterating over [(1, true), (2, true), (?, false), (4, true)] > and using a unique hash for the false case (representing the None) > > Best, > Jorge > > > > On Sat, Dec 4, 2021 at 6:26 AM Weston Pace <[email protected]> wrote: > >> One possibility could be to calculate the hash of the logical data >> when serializing and then put the hash in the metadata. >> >> > I'm not even sure this can actually happen ... After all buffers should >> only carry primitive types (not some complex structs) and they all seem to >> be 16/32/64/128 bit long and should produce "gapless" buffers. >> >> Arrow buffers are aligned on 8 or 64 byte boundaries and there is a >> preference to align on 64 byte boundaries. So I think gaps/padding is >> a real possibility. >> >> On Fri, Dec 3, 2021 at 3:05 PM Sergii Mikhtoniuk <[email protected]> >> wrote: >> > >> > Apologies for the confusion, I was using wrong terminology. When I was >> talking about "array chunks" I meant Buffers - contiguous slices of memory >> with nullability, offsets, or value data. >> > >> > If Arrow is not explicit about Buffers having to be memset to zero >> before use - whenever the size of the vale is not a multiple of its >> alignment we would have garbage in between, messing up the stability of a >> buffer-wise hash. >> > >> > I'm not even sure this can actually happen ... After all buffers should >> only carry primitive types (not some complex structs) and they all seem to >> be 16/32/64/128 bit long and should produce "gapless" buffers. >> >
