Re: Dictionaries and multiple record batches

Chris Nuernberger Tue, 22 Feb 2022 13:02:29 -0800

If you are going to read all the dictionary blocks prior to reading any
record batch anyway there is for sure a way to make it work now without
changing the file format itself. I think, however, that if what is there
currently is working there is no meaningful advantage gained by adding
whatever it would take to make replacement dictionaries work.


On Tue, Feb 22, 2022 at 12:29 PM Micah Kornfield <[email protected]>
wrote:

> I guess since the keys are only additive then you just create the master
>> dictionary before allowing random access to the data.
>
>
> Yes, this is what the implementation does.
>
> At some point we might want to create an updated file format that can
> handle replacements also, but this hasn't been a priority for anyone.
>
> On Tue, Feb 22, 2022 at 10:12 AM Chris Nuernberger <[email protected]>
> wrote:
>
>> I guess since the keys are only additive then you just create the master
>> dictionary before allowing random access to the data.
>>
>> On Tue, Feb 22, 2022 at 11:08 AM Chris Nuernberger <[email protected]>
>> wrote:
>>
>>> OK, thanks, I will work with delta dictionaries.
>>>
>>> How do delta dictionaries solve the random access issue?
>>>
>>> On Tue, Feb 22, 2022 at 9:51 AM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> Dictionary replacement isn't supported in the file format because the
>>>> metadata makes it difficult to associate a particular dictionary with a
>>>> record batch for Random access.
>>>>
>>>> Delta dictionaries are supported but there was a long standing bug that
>>>> prevented there use in Python (
>>>> https://issues.apache.org/jira/browse/ARROW-13467).  If you are still
>>>> seeing issues in pyarrow 7.0 please open a bug.
>>>>
>>>> In regards to the usefulness of the file format without these features
>>>> that is really use case dependent.
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>> On Tuesday, February 22, 2022, Chris Nuernberger <[email protected]>
>>>> wrote:
>>>>
>>>>> How are dictionaries intended to be used in a file with multiple
>>>>> record batches?
>>>>>
>>>>> I tried saving record-batch-specific dictionaries and got this error
>>>>> from python:
>>>>>
>>>>>  > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
>>>>> dictionary delta in IPC file
>>>>>
>>>>> This seems to defeat the purpose of having multiple record batches in
>>>>> a single arrow file; the work around appears to be to either preprocess 
>>>>> the
>>>>> entire sequence of datasets to unify the dictionaries or save multiple
>>>>> arrow files.
>>>>>
>>>>

Re: Dictionaries and multiple record batches

Reply via email to