Dictionary replacements aren't supported in the file format, only deltas. Your use case is a replacement, not a delta. You could use the stream format instead.
On Fri, Jul 23, 2021 at 8:32 AM Sam Davis <[email protected]> wrote: > > Hey Wes, > > Thanks, I had not spotted this before! It doesn't seem to change the > behaviour with `pa.ipc.new_file` however. Maybe I'm using it incorrectly? > > ``` > import pandas as pd > import pyarrow as pa > > print(pa.__version__) > > schema = pa.schema([ > ("foo", pa.dictionary(pa.int16(), pa.string())) > ]) > > pd1 = pd.DataFrame({"foo": pd.Categorical(["aaaa"], categories=["a"*i for i > in range(64)])}) > b1 = pa.RecordBatch.from_pandas(pd1, schema=schema) > > pd2 = pd.DataFrame({"foo": pd.Categorical(["aaaa"], categories=["b"*i for i > in range(64)])}) > b2 = pa.RecordBatch.from_pandas(pd2, schema=schema) > > options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True) > > with pa.ipc.new_file("/tmp/sdavis_tmp.arrow", schema=b1.schema, > options=options) as writer: > writer.write(b1) > writer.write(b2) > ``` > > Version printed: 4.0.1 > > Sam > ________________________________ > From: Wes McKinney <[email protected]> > Sent: 23 July 2021 14:24 > To: [email protected] <[email protected]> > Subject: Re: [PyArrow] DictionaryArray isDelta Support > > hi Sam > > On Fri, Jul 23, 2021 at 8:15 AM Sam Davis <[email protected]> wrote: > > > > Hi, > > > > We want to write out RecordBatches of data, where one or more columns in a > > batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(), > > pa.string()` as the column only contains a handful of unique values. > > > > However, PyArrow seems to lack support for writing these batches out to > > either the streaming or (non-streaming) file format. > > > > When attempting to write two distinct batches the following error message > > is triggered: > > > > > ArrowInvalid: Dictionary replacement detected when writing IPC file > > > format. Arrow IPC files only support a single dictionary for a given > > > field across all batches. > > > > I believe this message is false and that support is possible based on > > reading the spec: > > > > > Dictionaries are written in the stream and file formats as a sequence of > > > record batches... > > > ... > > > The dictionary isDelta flag allows existing dictionaries to be expanded > > > for future record batch materializations. A dictionary batch with isDelta > > > set indicates that its vector should be concatenated with those of any > > > previous batches with the same id. In a stream which encodes one column, > > > the list of strings ["A", "B", "C", "B", "D", "C", "E", "A"], with a > > > delta dictionary batch could take the form: > > > > ``` > > <SCHEMA> > > <DICTIONARY 0> > > (0) "A" > > (1) "B" > > (2) "C" > > > > <RECORD BATCH 0> > > 0 > > 1 > > 2 > > 1 > > > > <DICTIONARY 0 DELTA> > > (3) "D" > > (4) "E" > > > > <RECORD BATCH 1> > > 3 > > 2 > > 4 > > 0 > > EOS > > ``` > > > > > Alternatively, if isDelta is set to false, then the dictionary replaces > > > the existing dictionary for the same ID. Using the same example as above, > > > an alternate encoding could be: > > > > ``` > > <SCHEMA> > > <DICTIONARY 0> > > (0) "A" > > (1) "B" > > (2) "C" > > > > <RECORD BATCH 0> > > 0 > > 1 > > 2 > > 1 > > > > <DICTIONARY 0> > > (0) "A" > > (1) "C" > > (2) "D" > > (3) "E" > > > > <RECORD BATCH 1> > > 2 > > 1 > > 3 > > 0 > > EOS > > ``` > > > > It also specifies in the IPC File Format (non-streaming) section: > > > > > In the file format, there is no requirement that dictionary keys should > > > be defined in a DictionaryBatch before they are used in a RecordBatch, as > > > long as the keys are defined somewhere in the file. Further more, it is > > > invalid to have more than one non-delta dictionary batch per dictionary > > > ID (i.e. dictionary replacement is not supported). Delta dictionaries are > > > applied in the order they appear in the file footer. > > > > So for the non-streaming format multiple non-delta dictionaries are not > > supported but one non-delta followed by delta dictionaries should be. > > > > Is it possible to do this in PyArrow? If so, how? If not, how easy would it > > be to add? Is it currently possible via C++ and therefore can I write a > > Cython or similar extension that will let me do this now without waiting > > for a release? > > > > In pyarrow (3.0.0 or later), you need to opt into emitting dictionary > deltas using pyarrow.ipc.IpcWriteOptions. Can you show your code? > > https://github.com/apache/arrow/commit/8d76312dd397ebe07b71531f6d23b8caa76703dc > > > Best, > > > > Sam > > IMPORTANT NOTICE: The information transmitted is intended only for the > > person or entity to which it is addressed and may contain confidential > > and/or privileged material. Any review, re-transmission, dissemination or > > other use of, or taking of any action in reliance upon, this information by > > persons or entities other than the intended recipient is prohibited. If you > > received this in error, please contact the sender and delete the material > > from any computer. Although we routinely screen for viruses, addressees > > should check this e-mail and any attachment for viruses. We make no > > warranty as to absence of viruses in this e-mail or any attachments. > IMPORTANT NOTICE: The information transmitted is intended only for the person > or entity to which it is addressed and may contain confidential and/or > privileged material. Any review, re-transmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. Although we routinely screen for viruses, addressees should check > this e-mail and any attachment for viruses. We make no warranty as to absence > of viruses in this e-mail or any attachments.
