[ https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe L. Korn resolved ARROW-2176. -------------------------------- Resolution: Fixed Fix Version/s: (was: 0.10.0) 0.9.0 Issue resolved by pull request 1629 [https://github.com/apache/arrow/pull/1629] > [C++] Extend DictionaryBuilder to support delta dictionaries > ------------------------------------------------------------ > > Key: ARROW-2176 > URL: https://issues.apache.org/jira/browse/ARROW-2176 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Dimitri Vorona > Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a > possibility of sending additional dictionary batches with a previously seen > id and a isDelta flag to extend the existing dictionaries with new entries. > Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not > support generation of delta dictionaries. > This pull request contains a basic implementation of the DictionaryBuilder > with delta dictionaries support. The use API can be seen in the dictionary > tests (i.e. > [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]). > The basic idea is that the user just reuses the builder object after calling > Finish(Array*) for the first time. Subsequent calls to Append will create new > entries only for the unseen element and reuse id from previous dictionaries > for the seen ones. > Some considerations: > # The API is pretty implicit, and additional flag for Finish, which > explicitly indicates a desire to use the builder for delta dictionary > generation might be expedient from the error avoidance point of view. > # Right now the implementation uses an additional "overflow dictionary" to > store the seen items. This adds a copy on each Finish call and an additional > lookup at each GetItem or Append call. I assume, we might get away with > returning Array slices at Finish, which would remove the need for an > additional overflow dictionary. If the gist of the PR is approved, I can look > into further optimizations. > The Writer and Reader extensions would be pretty simple, since the > DictionaryBuilder API remains basically the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005)