[PyArrow] DictionaryArray isDelta Support

Sam Davis Fri, 23 Jul 2021 06:15:16 -0700

Hi,

We want to write out RecordBatches of data, where one or more columns in a 
batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(), 
pa.string()` as the column only contains a handful of unique values.


However, PyArrow seems to lack support for writing these batches out to either 
the streaming or (non-streaming) file format.

When attempting to write two distinct batches the following error message is 
triggered:

> ArrowInvalid: Dictionary replacement detected when writing IPC file format. 
> Arrow IPC files only support a single dictionary for a given field across all 
> batches.

I believe this message is false and that support is possible based on reading 
the spec:

> Dictionaries are written in the stream and file formats as a sequence of 
> record batches...
> ...
> The dictionary isDelta flag allows existing dictionaries to be expanded for 
> future record batch materializations. A dictionary batch with isDelta set 
> indicates that its vector should be concatenated with those of any previous 
> batches with the same id. In a stream which encodes one column, the list of 
> strings ["A", "B", "C", "B", "D", "C", "E", "A"], with a delta dictionary 
> batch could take the form:

```
<SCHEMA>
<DICTIONARY 0>
(0) "A"
(1) "B"
(2) "C"

<RECORD BATCH 0>
0
1
2
1

<DICTIONARY 0 DELTA>
(3) "D"
(4) "E"

<RECORD BATCH 1>
3
2
4
0
EOS
```

> Alternatively, if isDelta is set to false, then the dictionary replaces the 
> existing dictionary for the same ID. Using the same example as above, an 
> alternate encoding could be:

```
<SCHEMA>
<DICTIONARY 0>
(0) "A"
(1) "B"
(2) "C"

<RECORD BATCH 0>
0
1
2
1

<DICTIONARY 0>
(0) "A"
(1) "C"
(2) "D"
(3) "E"

<RECORD BATCH 1>
2
1
3
0
EOS
```

It also specifies in the IPC File Format (non-streaming) section:

> In the file format, there is no requirement that dictionary keys should be 
> defined in a DictionaryBatch before they are used in a RecordBatch, as long 
> as the keys are defined somewhere in the file. Further more, it is invalid to 
> have more than one non-delta dictionary batch per dictionary ID (i.e. 
> dictionary replacement is not supported). Delta dictionaries are applied in 
> the order they appear in the file footer.

So for the non-streaming format multiple non-delta dictionaries are not 
supported but one non-delta followed by delta dictionaries should be.

Is it possible to do this in PyArrow? If so, how? If not, how easy would it be 
to add? Is it currently possible via C++ and therefore can I write a Cython or 
similar extension that will let me do this now without waiting for a release?

Best,

Sam
IMPORTANT NOTICE: The information transmitted is intended only for the person 
or entity to which it is addressed and may contain confidential and/or 
privileged material. Any review, re-transmission, dissemination or other use 
of, or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer. 
Although we routinely screen for viruses, addressees should check this e-mail 
and any attachment for viruses. We make no warranty as to absence of viruses in 
this e-mail or any attachments.

[PyArrow] DictionaryArray isDelta Support

Reply via email to