GitHub user assignUser added a comment to the discussion: Does the arrow protocol require unique dictionary values?
>From the Arrow format spec: > Note that a dictionary is permitted to contain duplicate values or nulls: > > ``` > data VarBinary (dictionary-encoded) > index_type: Int32 > values: [0, 1, 3, 1, 4, 2] > > dictionary > type: VarBinary > values: ['foo', 'bar', 'baz', 'foo', null] > ``` > > The null count of such arrays is dictated only by the validity bitmap of its > indices, irrespective of any null values in the dictionary. [Arrow Columnar Format – Dictionary-encoded Layout](https://arrow.apache.org/docs/format/Columnar.html) (Thanks Docs chat bot!) So forcing the de-duplication seems to go against the spec and there for is a bug? GitHub link: https://github.com/apache/arrow/discussions/47134#discussioncomment-13809560 ---- This is an automatically sent email for user@arrow.apache.org. To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org