[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-07-21 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889740#comment-16889740
 ] 

Wes McKinney commented on ARROW-3772:
-

I'm getting close to having something PR-worthy here, ended up being a can of 
worms -- there's going to be a lot of follow up issues so I'll try to contain 
the scope of the work and leave polishing to subsequent PRs.

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-07-18 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888054#comment-16888054
 ] 

Wes McKinney commented on ARROW-3772:
-

At least after ARROW-3144 we have broken the constraint of a constant 
dictionary across arrays. Having a mix of dictionary-encoded and 
non-dictionary-encoded arrays is interesting, but regardless there's a lot of 
refactoring to do in the Parquet library to expose this details

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-07-17 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887610#comment-16887610
 ] 

Micah Kornfield commented on ARROW-3772:


"I'm looking at this. This is not a small project – the assumption that values 
are fully materialized is pretty deeply baked into the library. We also have to 
deal with the "fallback" case where a column chunk starts out dictionary 
encoded and switches mid-stream because the dictionary got too big"

I don't have context on how we decided originally to designate an entire column 
dictionary encoded vs a chunk/record batch column but it seems like this might 
be another use-case where the proposal on encoding/compression might make 
things easier to code (i.e. specify dictionary encoding only on 
SparseRecordBatches where it makes sense and leave the fallback to dense 
encoding where it no longer makes sense).

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-07-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887577#comment-16887577
 ] 

Wes McKinney commented on ARROW-3772:
-

I'm looking at this. This is not a small project -- the assumption that values 
are fully materialized is pretty deeply baked into the library. We also have to 
deal with the "fallback" case where a column chunk starts out dictionary 
encoded and switches mid-stream because the dictionary got too big. What to do 
in that case is ambiguous:

* One option is to dictionary-encode the additional pages, so we could end up 
with one big dictionary
* Another option is to optimistically leave things dictionary-encoded, and if 
we hit the fallback case then we fully materialize. We can always do a cast on 
the Arrow side after the fact in this case

FWIW, the fallback scenario is not at all esoteric because the default 
dictionary pagesize limit in the C++ library is 1MB. I think Java is the same 

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L44

I think adding an option to raise the limit to 2GB or so when writing Arrow 
DictionaryArray would help. 

Things are made a bit more complex by the code duplication between 
parquet/column_reader.cc and parquet/arrow/record_reader.cc. I'll see if 
there's some things I can do to fix that while I'm working on this

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-06-18 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867159#comment-16867159
 ] 

Wes McKinney commented on ARROW-3772:
-

Realistically I don't think I can get this done this week (or in time for 
0.14.0), and I think it would be worth giving the feature some care and 
attention rather than rushing it. Moving to the 1.0.0 milestone

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-02-27 Thread Hatem Helal (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780207#comment-16780207
 ] 

Hatem Helal commented on ARROW-3772:


I'd like to take a stab at this after ARROW-3769

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Hatem Helal
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2018-11-12 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684422#comment-16684422
 ] 

Wes McKinney commented on ARROW-3772:
-

Moved issue to Arrow issue tracker

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)