[jira] [Updated] (ARROW-4076) [Python] schema validation and filters
[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4076: Labels: dataset easyfix parquet pull-request-available (was: dataset datasets easyfix parquet pull-request-available) > [Python] schema validation and filters > -- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Assignee: Joris Van den Bossche >Priority: Minor > Labels: dataset, easyfix, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > -if validate_schema: > -self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > +if validate_schema: > +self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-4076) [Python] schema validation and filters
[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-4076: -- Labels: dataset datasets easyfix parquet pull-request-available (was: datasets easyfix parquet pull-request-available) > [Python] schema validation and filters > -- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Assignee: Joris Van den Bossche >Priority: Minor > Labels: dataset, datasets, easyfix, parquet, > pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > -if validate_schema: > -self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > +if validate_schema: > +self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-4076) [Python] schema validation and filters
[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4076: Labels: datasets easyfix parquet pull-request-available (was: easyfix parquet pull-request-available) > [Python] schema validation and filters > -- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Priority: Minor > Labels: datasets, easyfix, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > -if validate_schema: > -self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > +if validate_schema: > +self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4076) [Python] schema validation and filters
[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-4076: - Labels: easyfix parquet pull-request-available (was: easyfix pull-request-available) > [Python] schema validation and filters > -- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Priority: Minor > Labels: easyfix, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > -if validate_schema: > -self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > +if validate_schema: > +self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4076) [Python] schema validation and filters
[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4076: Fix Version/s: 0.14.0 > [Python] schema validation and filters > -- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Priority: Minor > Labels: easyfix, pull-request-available > Fix For: 0.14.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > -if validate_schema: > -self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > +if validate_schema: > +self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4076) [Python] schema validation and filters
[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Sakkis updated ARROW-4076: - Description: Currently [schema validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] of {{ParquetDataset}} takes place before filtering. This may raise a {{ValueError}} if the schema is different in some dataset pieces, even if these pieces would be subsequently filtered out. I think validation should happen after filtering to prevent such spurious errors: {noformat} --- a/pyarrow/parquet.py +++ b/pyarrow/parquet.py @@ -878,13 +878,13 @@ if split_row_groups: raise NotImplementedError("split_row_groups not yet implemented") -if validate_schema: -self.validate_schemas() - if filters is not None: filters = _check_filters(filters) self._filter(filters) +if validate_schema: +self.validate_schemas() + def validate_schemas(self): open_file = self._get_open_file_func() {noformat} was: Currently [schema validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] of {{ParquetDataset}} takes place before filtering. This may raise a {{ValueError}}if the schema is different in some dataset pieces, even if these pieces would be subsequently filtered out. I think validation should happen after filtering to prevent such spurious errors: {noformat} --- a/pyarrow/parquet.py +++ b/pyarrow/parquet.py @@ -878,13 +878,13 @@ if split_row_groups: raise NotImplementedError("split_row_groups not yet implemented") -if validate_schema: -self.validate_schemas() - if filters is not None: filters = _check_filters(filters) self._filter(filters) +if validate_schema: +self.validate_schemas() + def validate_schemas(self): open_file = self._get_open_file_func() {noformat} > [Python] schema validation and filters > -- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Priority: Minor > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > -if validate_schema: > -self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > +if validate_schema: > +self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)