Re: [PyArrow] DictionaryArray isDelta Support

Wes McKinney Fri, 23 Jul 2021 06:37:34 -0700

Dictionary replacements aren't supported in the file format, only
deltas. Your use case is a replacement, not a delta. You could use the
stream format instead.


On Fri, Jul 23, 2021 at 8:32 AM Sam Davis <[email protected]> wrote:
>
> Hey Wes,
>
> Thanks, I had not spotted this before! It doesn't seem to change the 
> behaviour with `pa.ipc.new_file` however. Maybe I'm using it incorrectly?
>
> ```
> import pandas as pd
> import pyarrow as pa
>
> print(pa.__version__)
>
> schema = pa.schema([
>     ("foo", pa.dictionary(pa.int16(), pa.string()))
> ])
>
> pd1 = pd.DataFrame({"foo": pd.Categorical(["aaaa"], categories=["a"*i for i 
> in range(64)])})
> b1 = pa.RecordBatch.from_pandas(pd1, schema=schema)
>
> pd2 = pd.DataFrame({"foo": pd.Categorical(["aaaa"], categories=["b"*i for i 
> in range(64)])})
> b2 = pa.RecordBatch.from_pandas(pd2, schema=schema)
>
> options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
>
> with pa.ipc.new_file("/tmp/sdavis_tmp.arrow", schema=b1.schema, 
> options=options) as writer:
>     writer.write(b1)
>     writer.write(b2)
> ```
>
> Version printed: 4.0.1
>
> Sam
> ________________________________
> From: Wes McKinney <[email protected]>
> Sent: 23 July 2021 14:24
> To: [email protected] <[email protected]>
> Subject: Re: [PyArrow] DictionaryArray isDelta Support
>
> hi Sam
>
> On Fri, Jul 23, 2021 at 8:15 AM Sam Davis <[email protected]> wrote:
> >
> > Hi,
> >
> > We want to write out RecordBatches of data, where one or more columns in a 
> > batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(), 
> > pa.string()` as the column only contains a handful of unique values.
> >
> > However, PyArrow seems to lack support for writing these batches out to 
> > either the streaming or (non-streaming) file format.
> >
> > When attempting to write two distinct batches the following error message 
> > is triggered:
> >
> > > ArrowInvalid: Dictionary replacement detected when writing IPC file 
> > > format. Arrow IPC files only support a single dictionary for a given 
> > > field across all batches.
> >
> > I believe this message is false and that support is possible based on 
> > reading the spec:
> >
> > > Dictionaries are written in the stream and file formats as a sequence of 
> > > record batches...
> > > ...
> > > The dictionary isDelta flag allows existing dictionaries to be expanded 
> > > for future record batch materializations. A dictionary batch with isDelta 
> > > set indicates that its vector should be concatenated with those of any 
> > > previous batches with the same id. In a stream which encodes one column, 
> > > the list of strings ["A", "B", "C", "B", "D", "C", "E", "A"], with a 
> > > delta dictionary batch could take the form:
> >
> > ```
> > <SCHEMA>
> > <DICTIONARY 0>
> > (0) "A"
> > (1) "B"
> > (2) "C"
> >
> > <RECORD BATCH 0>
> > 0
> > 1
> > 2
> > 1
> >
> > <DICTIONARY 0 DELTA>
> > (3) "D"
> > (4) "E"
> >
> > <RECORD BATCH 1>
> > 3
> > 2
> > 4
> > 0
> > EOS
> > ```
> >
> > > Alternatively, if isDelta is set to false, then the dictionary replaces 
> > > the existing dictionary for the same ID. Using the same example as above, 
> > > an alternate encoding could be:
> >
> > ```
> > <SCHEMA>
> > <DICTIONARY 0>
> > (0) "A"
> > (1) "B"
> > (2) "C"
> >
> > <RECORD BATCH 0>
> > 0
> > 1
> > 2
> > 1
> >
> > <DICTIONARY 0>
> > (0) "A"
> > (1) "C"
> > (2) "D"
> > (3) "E"
> >
> > <RECORD BATCH 1>
> > 2
> > 1
> > 3
> > 0
> > EOS
> > ```
> >
> > It also specifies in the IPC File Format (non-streaming) section:
> >
> > > In the file format, there is no requirement that dictionary keys should 
> > > be defined in a DictionaryBatch before they are used in a RecordBatch, as 
> > > long as the keys are defined somewhere in the file. Further more, it is 
> > > invalid to have more than one non-delta dictionary batch per dictionary 
> > > ID (i.e. dictionary replacement is not supported). Delta dictionaries are 
> > > applied in the order they appear in the file footer.
> >
> > So for the non-streaming format multiple non-delta dictionaries are not 
> > supported but one non-delta followed by delta dictionaries should be.
> >
> > Is it possible to do this in PyArrow? If so, how? If not, how easy would it 
> > be to add? Is it currently possible via C++ and therefore can I write a 
> > Cython or similar extension that will let me do this now without waiting 
> > for a release?
> >
>
> In pyarrow (3.0.0 or later), you need to opt into emitting dictionary
> deltas using pyarrow.ipc.IpcWriteOptions. Can you show your code?
>
> https://github.com/apache/arrow/commit/8d76312dd397ebe07b71531f6d23b8caa76703dc
>
> > Best,
> >
> > Sam
> > IMPORTANT NOTICE: The information transmitted is intended only for the 
> > person or entity to which it is addressed and may contain confidential 
> > and/or privileged material. Any review, re-transmission, dissemination or 
> > other use of, or taking of any action in reliance upon, this information by 
> > persons or entities other than the intended recipient is prohibited. If you 
> > received this in error, please contact the sender and delete the material 
> > from any computer. Although we routinely screen for viruses, addressees 
> > should check this e-mail and any attachment for viruses. We make no 
> > warranty as to absence of viruses in this e-mail or any attachments.
> IMPORTANT NOTICE: The information transmitted is intended only for the person 
> or entity to which it is addressed and may contain confidential and/or 
> privileged material. Any review, re-transmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer. Although we routinely screen for viruses, addressees should check 
> this e-mail and any attachment for viruses. We make no warranty as to absence 
> of viruses in this e-mail or any attachments.

Re: [PyArrow] DictionaryArray isDelta Support

Reply via email to