GitHub user assignUser added a comment to the discussion: Does the arrow 
protocol require unique dictionary values?

>From the Arrow format spec:

> Note that a dictionary is permitted to contain duplicate values or nulls:
> 
> ```
> data VarBinary (dictionary-encoded)
>    index_type: Int32
>    values: [0, 1, 3, 1, 4, 2]
> 
> dictionary
>    type: VarBinary
>    values: ['foo', 'bar', 'baz', 'foo', null]
> ```
> 
> The null count of such arrays is dictated only by the validity bitmap of its 
> indices, irrespective of any null values in the dictionary.
[Arrow Columnar Format – Dictionary-encoded 
Layout](https://arrow.apache.org/docs/format/Columnar.html)

(Thanks Docs chat bot!) 

So forcing the de-duplication seems to go against the spec and there for is a 
bug? 

GitHub link: 
https://github.com/apache/arrow/discussions/47134#discussioncomment-13809560

----
This is an automatically sent email for user@arrow.apache.org.
To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org

Reply via email to