[jira] [Created] (ARROW-5885) Support optional arrow components via extras_require
George Sakkis created ARROW-5885: Summary: Support optional arrow components via extras_require Key: ARROW-5885 URL: https://issues.apache.org/jira/browse/ARROW-5885 Project: Apache Arrow Issue Type: Wish Components: Python Reporter: George Sakkis Since Arrow (and pyarrow) have several independent optional component, instead of installing all of them it would be convenient if these could be opt-in from pip like {{pip install pyarrow[gandiva,flight,plasma]}} or opt-out like {{pip install pyarrow[no-gandiva,no-flight,no-plasma]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5825) [Python] Exceptions swallowed in ParquetManifest._visit_directories
[ https://issues.apache.org/jira/browse/ARROW-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878991#comment-16878991 ] George Sakkis commented on ARROW-5825: -- Yes, in my case it was ["Found files in an intermediate directory"|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L835] two or three levels deep in the partitioned directory tree. > [Python] Exceptions swallowed in ParquetManifest._visit_directories > --- > > Key: ARROW-5825 > URL: https://issues.apache.org/jira/browse/ARROW-5825 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Priority: Major > Labels: parquet > > {{ParquetManifest._visit_directories}} uses a {{ThreadPoolExecutor}} to visit > partitioned parquet datasets concurrently, it waits for them to finish but > doesn't check if the respective futures have failed or not. This is quite > tricky to detect and debug as an exception is either raised later as a a > side-effect or (perhaps worse) it passes silently. > Observed on 0.12.1 but appears to be on latest master too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5825) [Python] Exceptions swallowed in ParquetManifest._visit_directories
George Sakkis created ARROW-5825: Summary: [Python] Exceptions swallowed in ParquetManifest._visit_directories Key: ARROW-5825 URL: https://issues.apache.org/jira/browse/ARROW-5825 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: George Sakkis {{ParquetManifest._visit_directories}} uses a {{ThreadPoolExecutor}} to visit partitioned parquet datasets concurrently, it waits for them to finish but doesn't check if the respective futures have failed or not. This is quite tricky to detect and debug as an exception is either raised later as a a side-effect or (perhaps worse) it passes silently. Observed on 0.12.1 but appears to be on latest master too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4492) [Python] Failure reading Parquet column as pandas Categorical in 0.12
[ https://issues.apache.org/jira/browse/ARROW-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833679#comment-16833679 ] George Sakkis commented on ARROW-4492: -- [~jorisvandenbossche] indeed I don't get it on pyarrow 0.12.1, only 0.12.0 is affected. Closing > [Python] Failure reading Parquet column as pandas Categorical in 0.12 > - > > Key: ARROW-4492 > URL: https://issues.apache.org/jira/browse/ARROW-4492 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 >Reporter: George Sakkis >Priority: Major > Labels: Parquet > Fix For: 0.14.0 > > Attachments: slug.pq > > > On pyarrow 0.12.0 some (but not all) columns cannot be read as category > dtype. Attached is an extracted failing sample. > {noformat} > import dask.dataframe as dd > df = dd.read_parquet('slug.pq', categories=['slug'], > engine='pyarrow').compute() > print(len(df['slug'].dtype.categories)) > {noformat} > This works on pyarrow 0.11.1 (and fastparquet 0.2.1). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4492) [Python] Failure reading Parquet column as pandas Categorical in 0.12
[ https://issues.apache.org/jira/browse/ARROW-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Sakkis resolved ARROW-4492. -- Resolution: Fixed > [Python] Failure reading Parquet column as pandas Categorical in 0.12 > - > > Key: ARROW-4492 > URL: https://issues.apache.org/jira/browse/ARROW-4492 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 >Reporter: George Sakkis >Priority: Major > Labels: Parquet > Fix For: 0.12.1 > > Attachments: slug.pq > > > On pyarrow 0.12.0 some (but not all) columns cannot be read as category > dtype. Attached is an extracted failing sample. > {noformat} > import dask.dataframe as dd > df = dd.read_parquet('slug.pq', categories=['slug'], > engine='pyarrow').compute() > print(len(df['slug'].dtype.categories)) > {noformat} > This works on pyarrow 0.11.1 (and fastparquet 0.2.1). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4492) [Python] Failure reading Parquet column as pandas Categorical in 0.12
[ https://issues.apache.org/jira/browse/ARROW-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Sakkis updated ARROW-4492: - Fix Version/s: (was: 0.14.0) 0.12.1 > [Python] Failure reading Parquet column as pandas Categorical in 0.12 > - > > Key: ARROW-4492 > URL: https://issues.apache.org/jira/browse/ARROW-4492 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 >Reporter: George Sakkis >Priority: Major > Labels: Parquet > Fix For: 0.12.1 > > Attachments: slug.pq > > > On pyarrow 0.12.0 some (but not all) columns cannot be read as category > dtype. Attached is an extracted failing sample. > {noformat} > import dask.dataframe as dd > df = dd.read_parquet('slug.pq', categories=['slug'], > engine='pyarrow').compute() > print(len(df['slug'].dtype.categories)) > {noformat} > This works on pyarrow 0.11.1 (and fastparquet 0.2.1). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4406) Ignore "*_$folder$" files on S3
[ https://issues.apache.org/jira/browse/ARROW-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Sakkis updated ARROW-4406: - Priority: Minor (was: Major) > Ignore "*_$folder$" files on S3 > --- > > Key: ARROW-4406 > URL: https://issues.apache.org/jira/browse/ARROW-4406 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: George Sakkis >Priority: Minor > Labels: easyfix, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently reading parquet files generated by Hadoop (EMR) from S3 fails with > "ValueError: Found files in an intermediate directory" because of the > [_$folder$|http://stackoverflow.com/questions/42876195/avoid-creation-of-folder-keys-in-s3-with-hadoop-emr] > empty files. > The fix should be easy, just an extra condition in > [ParquetManifest._should_silently_exclude|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L770]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4492) ValueError: Categorical categories must be unique
George Sakkis created ARROW-4492: Summary: ValueError: Categorical categories must be unique Key: ARROW-4492 URL: https://issues.apache.org/jira/browse/ARROW-4492 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.0 Reporter: George Sakkis Attachments: slug.pq On pyarrow 0.12.0 some (but not all) columns cannot be read as category dtype. Attached is an extracted failing sample. {noformat} import dask.dataframe as dd df = dd.read_parquet('slug.pq', categories=['slug'], engine='pyarrow').compute() print(len(df['slug'].dtype.categories)) {noformat} This works on pyarrow 0.11.1 (and fastparquet 0.2.1). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4076) [Python] schema validation and filters
[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Sakkis updated ARROW-4076: - Description: Currently [schema validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] of {{ParquetDataset}} takes place before filtering. This may raise a {{ValueError}} if the schema is different in some dataset pieces, even if these pieces would be subsequently filtered out. I think validation should happen after filtering to prevent such spurious errors: {noformat} --- a/pyarrow/parquet.py +++ b/pyarrow/parquet.py @@ -878,13 +878,13 @@ if split_row_groups: raise NotImplementedError("split_row_groups not yet implemented") -if validate_schema: -self.validate_schemas() - if filters is not None: filters = _check_filters(filters) self._filter(filters) +if validate_schema: +self.validate_schemas() + def validate_schemas(self): open_file = self._get_open_file_func() {noformat} was: Currently [schema validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] of {{ParquetDataset}} takes place before filtering. This may raise a {{ValueError}}if the schema is different in some dataset pieces, even if these pieces would be subsequently filtered out. I think validation should happen after filtering to prevent such spurious errors: {noformat} --- a/pyarrow/parquet.py +++ b/pyarrow/parquet.py @@ -878,13 +878,13 @@ if split_row_groups: raise NotImplementedError("split_row_groups not yet implemented") -if validate_schema: -self.validate_schemas() - if filters is not None: filters = _check_filters(filters) self._filter(filters) +if validate_schema: +self.validate_schemas() + def validate_schemas(self): open_file = self._get_open_file_func() {noformat} > [Python] schema validation and filters > -- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Priority: Minor > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > -if validate_schema: > -self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > +if validate_schema: > +self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4076) [Python] schema validation and filters
George Sakkis created ARROW-4076: Summary: [Python] schema validation and filters Key: ARROW-4076 URL: https://issues.apache.org/jira/browse/ARROW-4076 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: George Sakkis Currently [schema validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] of {{ParquetDataset}} takes place before filtering. This may raise a {{ValueError}}if the schema is different in some dataset pieces, even if these pieces would be subsequently filtered out. I think validation should happen after filtering to prevent such spurious errors: {noformat} --- a/pyarrow/parquet.py +++ b/pyarrow/parquet.py @@ -878,13 +878,13 @@ if split_row_groups: raise NotImplementedError("split_row_groups not yet implemented") -if validate_schema: -self.validate_schemas() - if filters is not None: filters = _check_filters(filters) self._filter(filters) +if validate_schema: +self.validate_schemas() + def validate_schemas(self): open_file = self._get_open_file_func() {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531217#comment-16531217 ] George Sakkis commented on ARROW-1956: -- +1 to bump this from minor priority; it's effectively a blocker for working with non-trivial datasets with hundreds/thousands of partitions where only a few are needed. > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 0.10.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2124) [Python] ArrowInvalid raised if the first item of a nested list of numpy arrays is empty
[ https://issues.apache.org/jira/browse/ARROW-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Sakkis updated ARROW-2124: - Summary: [Python] ArrowInvalid raised if the first item of a nested list of numpy arrays is empty (was: ArrowInvalid raised if the first item of a nested list of numpy arrays is empty) > [Python] ArrowInvalid raised if the first item of a nested list of numpy > arrays is empty > > > Key: ARROW-2124 > URL: https://issues.apache.org/jira/browse/ARROW-2124 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: George Sakkis >Priority: Major > Fix For: 0.9.0 > > > See example below: > {noformat} > In [1]: import numpy as np > In [2]: import pandas as pd > In [3]: import pyarrow as pa > In [4]: num_lists = [[2,3,4], [3,6,7,8], [], [2]] > In [5]: series = pd.Series([np.array(s, dtype=float) for s in num_lists]) > In [6]: pa.array(series) > Out[6]: > > [ > [2.0, >3.0, >4.0], > [3.0, >6.0, >7.0, >8.0], > [], > [2.0] > ] > In [7]: num_lists.append([]) > In [8]: series = pd.Series([np.array(s, dtype=float) for s in num_lists]) > In [9]: pa.array(series) > Out[9]: > > [ > [2.0, >3.0, >4.0], > [3.0, >6.0, >7.0, >8.0], > [], > [2.0], > [] > ] > In [10]: num_lists.insert(0, []) > In [11]: series = pd.Series([np.array(s, dtype=float) for s in num_lists]) > In [12]: pa.array(series) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.array(series) > array.pxi in pyarrow.lib.array() > array.pxi in pyarrow.lib._ndarray_to_array() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: trying to convert NumPy type object but got float64 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2124) ArrowInvalid raised if the first item of a nested list of numpy arrays is empty
George Sakkis created ARROW-2124: Summary: ArrowInvalid raised if the first item of a nested list of numpy arrays is empty Key: ARROW-2124 URL: https://issues.apache.org/jira/browse/ARROW-2124 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: George Sakkis Fix For: 0.9.0 See example below: {noformat} In [1]: import numpy as np In [2]: import pandas as pd In [3]: import pyarrow as pa In [4]: num_lists = [[2,3,4], [3,6,7,8], [], [2]] In [5]: series = pd.Series([np.array(s, dtype=float) for s in num_lists]) In [6]: pa.array(series) Out[6]: [ [2.0, 3.0, 4.0], [3.0, 6.0, 7.0, 8.0], [], [2.0] ] In [7]: num_lists.append([]) In [8]: series = pd.Series([np.array(s, dtype=float) for s in num_lists]) In [9]: pa.array(series) Out[9]: [ [2.0, 3.0, 4.0], [3.0, 6.0, 7.0, 8.0], [], [2.0], [] ] In [10]: num_lists.insert(0, []) In [11]: series = pd.Series([np.array(s, dtype=float) for s in num_lists]) In [12]: pa.array(series) --- ArrowInvalid Traceback (most recent call last) in () > 1 pa.array(series) array.pxi in pyarrow.lib.array() array.pxi in pyarrow.lib._ndarray_to_array() error.pxi in pyarrow.lib.check_status() ArrowInvalid: trying to convert NumPy type object but got float64 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)