[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889740#comment-16889740 ] Wes McKinney commented on ARROW-3772: - I'm getting close to having something PR-worthy here, ended up being a can of worms -- there's going to be a lot of follow up issues so I'll try to contain the scope of the work and leave polishing to subsequent PRs. > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888054#comment-16888054 ] Wes McKinney commented on ARROW-3772: - At least after ARROW-3144 we have broken the constraint of a constant dictionary across arrays. Having a mix of dictionary-encoded and non-dictionary-encoded arrays is interesting, but regardless there's a lot of refactoring to do in the Parquet library to expose this details > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887610#comment-16887610 ] Micah Kornfield commented on ARROW-3772: "I'm looking at this. This is not a small project – the assumption that values are fully materialized is pretty deeply baked into the library. We also have to deal with the "fallback" case where a column chunk starts out dictionary encoded and switches mid-stream because the dictionary got too big" I don't have context on how we decided originally to designate an entire column dictionary encoded vs a chunk/record batch column but it seems like this might be another use-case where the proposal on encoding/compression might make things easier to code (i.e. specify dictionary encoding only on SparseRecordBatches where it makes sense and leave the fallback to dense encoding where it no longer makes sense). > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887577#comment-16887577 ] Wes McKinney commented on ARROW-3772: - I'm looking at this. This is not a small project -- the assumption that values are fully materialized is pretty deeply baked into the library. We also have to deal with the "fallback" case where a column chunk starts out dictionary encoded and switches mid-stream because the dictionary got too big. What to do in that case is ambiguous: * One option is to dictionary-encode the additional pages, so we could end up with one big dictionary * Another option is to optimistically leave things dictionary-encoded, and if we hit the fallback case then we fully materialize. We can always do a cast on the Arrow side after the fact in this case FWIW, the fallback scenario is not at all esoteric because the default dictionary pagesize limit in the C++ library is 1MB. I think Java is the same https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L44 I think adding an option to raise the limit to 2GB or so when writing Arrow DictionaryArray would help. Things are made a bit more complex by the code duplication between parquet/column_reader.cc and parquet/arrow/record_reader.cc. I'll see if there's some things I can do to fix that while I'm working on this > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867159#comment-16867159 ] Wes McKinney commented on ARROW-3772: - Realistically I don't think I can get this done this week (or in time for 0.14.0), and I think it would be worth giving the feature some care and attention rather than rushing it. Moving to the 1.0.0 milestone > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780207#comment-16780207 ] Hatem Helal commented on ARROW-3772: I'd like to take a stab at this after ARROW-3769 > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Assignee: Hatem Helal >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684422#comment-16684422 ] Wes McKinney commented on ARROW-3772: - Moved issue to Arrow issue tracker > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)