[ 
https://issues.apache.org/jira/browse/ARROW-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ildar updated ARROW-4293:
-------------------------
    Description: 
Hi,

I'm trying to use per-column statistics (min/max values) to filter out row 
groups while reading parquet file. But I don't see statistics built for binary 
columns. I noticed that {{ApplicationVersion::HasCorrectStatistics()}} discards 
statistics that have sort order {{UNSIGNED and haven't been created by 
parquet-cpp}}. As I understand there used to be some issues in {{parquet-mr}} 
before. But do they still persist?

For example, I have parquet file created with {{parquet-mr}} version 1.10, it 
seems to have correct min/max values for binary columns. And {{parquet-cpp}} 
works fine for me if I remove this code from {{HasCorrectStatistics()}} func:

 
{code:java}
if (SortOrder::SIGNED != sort_order && !max_equals_min) {
    return false;
}{code}
 

  was:
Hi,

I'm trying to use per-column statistics (min/max values) to filter out row 
groups while reading parquet file. But I don't see statistics built for binary 
columns. I noticed that {{ApplicationVersion::HasCorrectStatistics()}} discards 
statistics that have sort order {{UNSIGNED }}and haven't been created by 
{{parquet-cpp}}. As I understand there used to be some issues in {{parquet-mr}} 
before. But do they still persist?

For example, I have parquet file created with {{parquet-mr}} version 1.10, it 
seems to have correct min/max values for binary columns. And {{parquet-cpp}} 
works fine for me if I remove this code from {{HasCorrectStatistics()}} func:

{{ if (SortOrder::SIGNED != sort_order && !max_equals_min) {}}
{{    return false; }}}


> [C++] Can't access parquet statistics on binary columns
> -------------------------------------------------------
>
>                 Key: ARROW-4293
>                 URL: https://issues.apache.org/jira/browse/ARROW-4293
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Ildar
>            Priority: Major
>
> Hi,
> I'm trying to use per-column statistics (min/max values) to filter out row 
> groups while reading parquet file. But I don't see statistics built for 
> binary columns. I noticed that {{ApplicationVersion::HasCorrectStatistics()}} 
> discards statistics that have sort order {{UNSIGNED and haven't been created 
> by parquet-cpp}}. As I understand there used to be some issues in 
> {{parquet-mr}} before. But do they still persist?
> For example, I have parquet file created with {{parquet-mr}} version 1.10, it 
> seems to have correct min/max values for binary columns. And {{parquet-cpp}} 
> works fine for me if I remove this code from {{HasCorrectStatistics()}} func:
>  
> {code:java}
> if (SortOrder::SIGNED != sort_order && !max_equals_min) {
>     return false;
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to