hi Sam

On Fri, Jul 23, 2021 at 8:15 AM Sam Davis <[email protected]> wrote:
>
> Hi,
>
> We want to write out RecordBatches of data, where one or more columns in a 
> batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(), 
> pa.string()` as the column only contains a handful of unique values.
>
> However, PyArrow seems to lack support for writing these batches out to 
> either the streaming or (non-streaming) file format.
>
> When attempting to write two distinct batches the following error message is 
> triggered:
>
> > ArrowInvalid: Dictionary replacement detected when writing IPC file format. 
> > Arrow IPC files only support a single dictionary for a given field across 
> > all batches.
>
> I believe this message is false and that support is possible based on reading 
> the spec:
>
> > Dictionaries are written in the stream and file formats as a sequence of 
> > record batches...
> > ...
> > The dictionary isDelta flag allows existing dictionaries to be expanded for 
> > future record batch materializations. A dictionary batch with isDelta set 
> > indicates that its vector should be concatenated with those of any previous 
> > batches with the same id. In a stream which encodes one column, the list of 
> > strings ["A", "B", "C", "B", "D", "C", "E", "A"], with a delta dictionary 
> > batch could take the form:
>
> ```
> <SCHEMA>
> <DICTIONARY 0>
> (0) "A"
> (1) "B"
> (2) "C"
>
> <RECORD BATCH 0>
> 0
> 1
> 2
> 1
>
> <DICTIONARY 0 DELTA>
> (3) "D"
> (4) "E"
>
> <RECORD BATCH 1>
> 3
> 2
> 4
> 0
> EOS
> ```
>
> > Alternatively, if isDelta is set to false, then the dictionary replaces the 
> > existing dictionary for the same ID. Using the same example as above, an 
> > alternate encoding could be:
>
> ```
> <SCHEMA>
> <DICTIONARY 0>
> (0) "A"
> (1) "B"
> (2) "C"
>
> <RECORD BATCH 0>
> 0
> 1
> 2
> 1
>
> <DICTIONARY 0>
> (0) "A"
> (1) "C"
> (2) "D"
> (3) "E"
>
> <RECORD BATCH 1>
> 2
> 1
> 3
> 0
> EOS
> ```
>
> It also specifies in the IPC File Format (non-streaming) section:
>
> > In the file format, there is no requirement that dictionary keys should be 
> > defined in a DictionaryBatch before they are used in a RecordBatch, as long 
> > as the keys are defined somewhere in the file. Further more, it is invalid 
> > to have more than one non-delta dictionary batch per dictionary ID (i.e. 
> > dictionary replacement is not supported). Delta dictionaries are applied in 
> > the order they appear in the file footer.
>
> So for the non-streaming format multiple non-delta dictionaries are not 
> supported but one non-delta followed by delta dictionaries should be.
>
> Is it possible to do this in PyArrow? If so, how? If not, how easy would it 
> be to add? Is it currently possible via C++ and therefore can I write a 
> Cython or similar extension that will let me do this now without waiting for 
> a release?
>

In pyarrow (3.0.0 or later), you need to opt into emitting dictionary
deltas using pyarrow.ipc.IpcWriteOptions. Can you show your code?

https://github.com/apache/arrow/commit/8d76312dd397ebe07b71531f6d23b8caa76703dc

> Best,
>
> Sam
> IMPORTANT NOTICE: The information transmitted is intended only for the person 
> or entity to which it is addressed and may contain confidential and/or 
> privileged material. Any review, re-transmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer. Although we routinely screen for viruses, addressees should check 
> this e-mail and any attachment for viruses. We make no warranty as to absence 
> of viruses in this e-mail or any attachments.

Reply via email to