[ 
https://issues.apache.org/jira/browse/ARROW-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061915#comment-17061915
 ] 

Joris Van den Bossche commented on ARROW-5572:
----------------------------------------------

This works now correctly with the new Datasets API, since we can filter on both 
partition keys as "normal" columns. 

So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), 
this issue will be resolved.


> [Python] raise error message when passing invalid filter in parquet reading
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-5572
>                 URL: https://issues.apache.org/jira/browse/ARROW-5572
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>            Reporter: Joris Van den Bossche
>            Priority: Minor
>              Labels: dataset-parquet-read, parquet
>
> From 
> https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset
> For example, when specifying a column in the filter which is a normal column 
> and not a key in your partitioned folder hierarchy, the filter gets silently 
> ignored. It would be nice to get an error message for this.  
> Reproducible example:
> {code:python}
> df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1], 'c': [1, 2, 3, 4]})
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, 'test_parquet_row_filters', partition_cols=['a'])
> # filter on 'a' (partition column) -> works
> pq.read_table('test_parquet_row_filters', filters=[('a', '=', 1)]).to_pandas()
> # filter on normal column (in future could do row group filtering) -> 
> silently does nothing
> pq.read_table('test_parquet_row_filters', filters=[('b', '=', 1)]).to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to