You could also let Arrow handle these cases by hashing a serialized
RecordBatch of your data selection. I'm not sure if it yields the exact
same bytesequence in every implementation, though. I would expect that to
be the case, but haven't tested it yet.

Best,

Marnix




On Sat, Dec 4, 2021 at 7:02 AM Jorge Cardoso Leitão <
[email protected]> wrote:

> AFAIK hashing in this context needs to be done on a slot by slot basis,
> just like array equality, as any item on a null slot has a value on the
> buffer that is undetermined.
>
> E.g. the layout of a primitive array [1, 2, None, 4] is two buffer
> regions:
> * [1, 2, ?, 4] and
> * [true, true, false, true] (in bitmap)
>
> The question mark can be any number. Hashing needs to skip the "?", which
> is achieved by iterating over [(1, true), (2, true), (?, false), (4, true)]
> and using a unique hash for the false case (representing the None)
>
> Best,
> Jorge
>
>
>
> On Sat, Dec 4, 2021 at 6:26 AM Weston Pace <[email protected]> wrote:
>
>> One possibility could be to calculate the hash of the logical data
>> when serializing and then put the hash in the metadata.
>>
>> > I'm not even sure this can actually happen ... After all buffers should
>> only carry primitive types (not some complex structs) and they all seem to
>> be 16/32/64/128 bit long and should produce "gapless" buffers.
>>
>> Arrow buffers are aligned on 8 or 64 byte boundaries and there is a
>> preference to align on 64 byte boundaries.  So I think gaps/padding is
>> a real possibility.
>>
>> On Fri, Dec 3, 2021 at 3:05 PM Sergii Mikhtoniuk <[email protected]>
>> wrote:
>> >
>> > Apologies for the confusion, I was using wrong terminology. When I was
>> talking about "array chunks" I meant Buffers - contiguous slices of memory
>> with nullability, offsets, or value data.
>> >
>> > If Arrow is not explicit about Buffers having to be memset to zero
>> before use - whenever the size of the vale is not a multiple of its
>> alignment we would have garbage in between, messing up the stability of a
>> buffer-wise hash.
>> >
>> > I'm not even sure this can actually happen ... After all buffers should
>> only carry primitive types (not some complex structs) and they all seem to
>> be 16/32/64/128 bit long and should produce "gapless" buffers.
>>
>

Reply via email to