[jira] [Created] (ARROW-5885) Support optional arrow components via extras_require

2019-07-09 Thread George Sakkis (JIRA)
George Sakkis created ARROW-5885:


 Summary: Support optional arrow components via extras_require
 Key: ARROW-5885
 URL: https://issues.apache.org/jira/browse/ARROW-5885
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Reporter: George Sakkis


Since Arrow (and pyarrow) have several independent optional component, instead 
of installing all of them it would be convenient if these could be opt-in from 
pip like 

{{pip install pyarrow[gandiva,flight,plasma]}}

or opt-out like

{{pip install pyarrow[no-gandiva,no-flight,no-plasma]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5825) [Python] Exceptions swallowed in ParquetManifest._visit_directories

2019-07-05 Thread George Sakkis (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878991#comment-16878991
 ] 

George Sakkis commented on ARROW-5825:
--

Yes, in my case it was ["Found files in an intermediate 
directory"|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L835]
 two or three levels deep in the partitioned directory tree.

 

> [Python] Exceptions swallowed in ParquetManifest._visit_directories
> ---
>
> Key: ARROW-5825
> URL: https://issues.apache.org/jira/browse/ARROW-5825
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: George Sakkis
>Priority: Major
>  Labels: parquet
>
> {{ParquetManifest._visit_directories}} uses a {{ThreadPoolExecutor}} to visit 
> partitioned parquet datasets concurrently, it waits for them to finish but 
> doesn't check if the respective futures have failed or not. This is quite 
> tricky to detect and debug as an exception is either raised later as a a 
> side-effect or (perhaps worse) it passes silently.
> Observed on 0.12.1 but appears to be on latest master too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5825) [Python] Exceptions swallowed in ParquetManifest._visit_directories

2019-07-02 Thread George Sakkis (JIRA)
George Sakkis created ARROW-5825:


 Summary: [Python] Exceptions swallowed in 
ParquetManifest._visit_directories
 Key: ARROW-5825
 URL: https://issues.apache.org/jira/browse/ARROW-5825
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: George Sakkis


{{ParquetManifest._visit_directories}} uses a {{ThreadPoolExecutor}} to visit 
partitioned parquet datasets concurrently, it waits for them to finish but 
doesn't check if the respective futures have failed or not. This is quite 
tricky to detect and debug as an exception is either raised later as a a 
side-effect or (perhaps worse) it passes silently.

Observed on 0.12.1 but appears to be on latest master too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4492) [Python] Failure reading Parquet column as pandas Categorical in 0.12

2019-05-06 Thread George Sakkis (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833679#comment-16833679
 ] 

George Sakkis commented on ARROW-4492:
--

[~jorisvandenbossche] indeed I don't get it on pyarrow 0.12.1, only 0.12.0 is 
affected. Closing

> [Python] Failure reading Parquet column as pandas Categorical in 0.12
> -
>
> Key: ARROW-4492
> URL: https://issues.apache.org/jira/browse/ARROW-4492
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: George Sakkis
>Priority: Major
>  Labels: Parquet
> Fix For: 0.14.0
>
> Attachments: slug.pq
>
>
> On pyarrow 0.12.0 some (but not all) columns cannot be read as category 
> dtype. Attached is an extracted failing sample.
>  {noformat}
> import dask.dataframe as dd
> df = dd.read_parquet('slug.pq', categories=['slug'], 
> engine='pyarrow').compute()
> print(len(df['slug'].dtype.categories))
>  {noformat}
> This works on pyarrow 0.11.1 (and fastparquet 0.2.1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4492) [Python] Failure reading Parquet column as pandas Categorical in 0.12

2019-05-06 Thread George Sakkis (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Sakkis resolved ARROW-4492.
--
Resolution: Fixed

> [Python] Failure reading Parquet column as pandas Categorical in 0.12
> -
>
> Key: ARROW-4492
> URL: https://issues.apache.org/jira/browse/ARROW-4492
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: George Sakkis
>Priority: Major
>  Labels: Parquet
> Fix For: 0.12.1
>
> Attachments: slug.pq
>
>
> On pyarrow 0.12.0 some (but not all) columns cannot be read as category 
> dtype. Attached is an extracted failing sample.
>  {noformat}
> import dask.dataframe as dd
> df = dd.read_parquet('slug.pq', categories=['slug'], 
> engine='pyarrow').compute()
> print(len(df['slug'].dtype.categories))
>  {noformat}
> This works on pyarrow 0.11.1 (and fastparquet 0.2.1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4492) [Python] Failure reading Parquet column as pandas Categorical in 0.12

2019-05-06 Thread George Sakkis (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Sakkis updated ARROW-4492:
-
Fix Version/s: (was: 0.14.0)
   0.12.1

> [Python] Failure reading Parquet column as pandas Categorical in 0.12
> -
>
> Key: ARROW-4492
> URL: https://issues.apache.org/jira/browse/ARROW-4492
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: George Sakkis
>Priority: Major
>  Labels: Parquet
> Fix For: 0.12.1
>
> Attachments: slug.pq
>
>
> On pyarrow 0.12.0 some (but not all) columns cannot be read as category 
> dtype. Attached is an extracted failing sample.
>  {noformat}
> import dask.dataframe as dd
> df = dd.read_parquet('slug.pq', categories=['slug'], 
> engine='pyarrow').compute()
> print(len(df['slug'].dtype.categories))
>  {noformat}
> This works on pyarrow 0.11.1 (and fastparquet 0.2.1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4406) Ignore "*_$folder$" files on S3

2019-02-06 Thread George Sakkis (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Sakkis updated ARROW-4406:
-
Priority: Minor  (was: Major)

> Ignore "*_$folder$" files on S3
> ---
>
> Key: ARROW-4406
> URL: https://issues.apache.org/jira/browse/ARROW-4406
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: George Sakkis
>Priority: Minor
>  Labels: easyfix, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently reading parquet files generated by Hadoop (EMR) from S3 fails with 
> "ValueError: Found files in an intermediate directory" because of the 
> [_$folder$|http://stackoverflow.com/questions/42876195/avoid-creation-of-folder-keys-in-s3-with-hadoop-emr]
>  empty files. 
> The fix should be easy, just an extra condition in 
> [ParquetManifest._should_silently_exclude|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L770].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4492) ValueError: Categorical categories must be unique

2019-02-06 Thread George Sakkis (JIRA)
George Sakkis created ARROW-4492:


 Summary: ValueError: Categorical categories must be unique
 Key: ARROW-4492
 URL: https://issues.apache.org/jira/browse/ARROW-4492
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.0
Reporter: George Sakkis
 Attachments: slug.pq

On pyarrow 0.12.0 some (but not all) columns cannot be read as category dtype. 
Attached is an extracted failing sample.

 {noformat}
import dask.dataframe as dd
df = dd.read_parquet('slug.pq', categories=['slug'], engine='pyarrow').compute()
print(len(df['slug'].dtype.categories))
 {noformat}

This works on pyarrow 0.11.1 (and fastparquet 0.2.1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4076) [Python] schema validation and filters

2018-12-19 Thread George Sakkis (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Sakkis updated ARROW-4076:
-
Description: 
Currently [schema 
validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
 of {{ParquetDataset}} takes place before filtering. This may raise a 
{{ValueError}} if the schema is different in some dataset pieces, even if these 
pieces would be subsequently filtered out. I think validation should happen 
after filtering to prevent such spurious errors:
{noformat}
--- a/pyarrow/parquet.py
+++ b/pyarrow/parquet.py
@@ -878,13 +878,13 @@
 if split_row_groups:
 raise NotImplementedError("split_row_groups not yet implemented")
 
-if validate_schema:
-self.validate_schemas()
-
 if filters is not None:
 filters = _check_filters(filters)
 self._filter(filters)
 
+if validate_schema:
+self.validate_schemas()
+
 def validate_schemas(self):
 open_file = self._get_open_file_func()
{noformat}

  was:
Currently [schema 
validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
 of {{ParquetDataset}} takes place before filtering. This may raise a 
{{ValueError}}if the schema is different in some dataset pieces, even if these 
pieces would be subsequently filtered out. I think validation should happen 
after filtering to prevent such spurious errors:
{noformat}
--- a/pyarrow/parquet.py
+++ b/pyarrow/parquet.py
@@ -878,13 +878,13 @@
 if split_row_groups:
 raise NotImplementedError("split_row_groups not yet implemented")
 
-if validate_schema:
-self.validate_schemas()
-
 if filters is not None:
 filters = _check_filters(filters)
 self._filter(filters)
 
+if validate_schema:
+self.validate_schemas()
+
 def validate_schemas(self):
 open_file = self._get_open_file_func()
{noformat}


> [Python] schema validation and filters
> --
>
> Key: ARROW-4076
> URL: https://issues.apache.org/jira/browse/ARROW-4076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: George Sakkis
>Priority: Minor
>
> Currently [schema 
> validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
>  of {{ParquetDataset}} takes place before filtering. This may raise a 
> {{ValueError}} if the schema is different in some dataset pieces, even if 
> these pieces would be subsequently filtered out. I think validation should 
> happen after filtering to prevent such spurious errors:
> {noformat}
> --- a/pyarrow/parquet.py  
> +++ b/pyarrow/parquet.py  
> @@ -878,13 +878,13 @@
>  if split_row_groups:
>  raise NotImplementedError("split_row_groups not yet implemented")
>  
> -if validate_schema:
> -self.validate_schemas()
> -
>  if filters is not None:
>  filters = _check_filters(filters)
>  self._filter(filters)
>  
> +if validate_schema:
> +self.validate_schemas()
> +
>  def validate_schemas(self):
>  open_file = self._get_open_file_func()
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4076) [Python] schema validation and filters

2018-12-19 Thread George Sakkis (JIRA)
George Sakkis created ARROW-4076:


 Summary: [Python] schema validation and filters
 Key: ARROW-4076
 URL: https://issues.apache.org/jira/browse/ARROW-4076
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: George Sakkis


Currently [schema 
validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
 of {{ParquetDataset}} takes place before filtering. This may raise a 
{{ValueError}}if the schema is different in some dataset pieces, even if these 
pieces would be subsequently filtered out. I think validation should happen 
after filtering to prevent such spurious errors:
{noformat}
--- a/pyarrow/parquet.py
+++ b/pyarrow/parquet.py
@@ -878,13 +878,13 @@
 if split_row_groups:
 raise NotImplementedError("split_row_groups not yet implemented")
 
-if validate_schema:
-self.validate_schemas()
-
 if filters is not None:
 filters = _check_filters(filters)
 self._filter(filters)
 
+if validate_schema:
+self.validate_schemas()
+
 def validate_schemas(self):
 open_file = self._get_open_file_func()
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-07-03 Thread George Sakkis (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531217#comment-16531217
 ] 

George Sakkis commented on ARROW-1956:
--

+1 to bump this from minor priority; it's effectively a blocker for working 
with non-trivial datasets with hundreds/thousands of partitions where only a 
few are needed.

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.10.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2124) [Python] ArrowInvalid raised if the first item of a nested list of numpy arrays is empty

2018-02-10 Thread George Sakkis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Sakkis updated ARROW-2124:
-
Summary: [Python] ArrowInvalid raised if the first item of a nested list of 
numpy arrays is empty  (was: ArrowInvalid raised if the first item of a nested 
list of numpy arrays is empty)

> [Python] ArrowInvalid raised if the first item of a nested list of numpy 
> arrays is empty
> 
>
> Key: ARROW-2124
> URL: https://issues.apache.org/jira/browse/ARROW-2124
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: George Sakkis
>Priority: Major
> Fix For: 0.9.0
>
>
> See example below:
> {noformat}
> In [1]: import numpy as np
> In [2]: import pandas as pd
> In [3]: import pyarrow as pa
> In [4]: num_lists = [[2,3,4], [3,6,7,8], [], [2]]
> In [5]: series = pd.Series([np.array(s, dtype=float) for s in num_lists])
> In [6]: pa.array(series)
> Out[6]: 
> 
> [
>   [2.0,
>3.0,
>4.0],
>   [3.0,
>6.0,
>7.0,
>8.0],
>   [],
>   [2.0]
> ]
> In [7]: num_lists.append([])
> In [8]: series = pd.Series([np.array(s, dtype=float) for s in num_lists])
> In [9]: pa.array(series)
> Out[9]: 
> 
> [
>   [2.0,
>3.0,
>4.0],
>   [3.0,
>6.0,
>7.0,
>8.0],
>   [],
>   [2.0],
>   []
> ]
> In [10]: num_lists.insert(0, [])
> In [11]: series = pd.Series([np.array(s, dtype=float) for s in num_lists])
> In [12]: pa.array(series)
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in ()
> > 1 pa.array(series)
> array.pxi in pyarrow.lib.array()
> array.pxi in pyarrow.lib._ndarray_to_array()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: trying to convert NumPy type object but got float64
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2124) ArrowInvalid raised if the first item of a nested list of numpy arrays is empty

2018-02-10 Thread George Sakkis (JIRA)
George Sakkis created ARROW-2124:


 Summary: ArrowInvalid raised if the first item of a nested list of 
numpy arrays is empty
 Key: ARROW-2124
 URL: https://issues.apache.org/jira/browse/ARROW-2124
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: George Sakkis
 Fix For: 0.9.0


See example below:
{noformat}
In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import pyarrow as pa

In [4]: num_lists = [[2,3,4], [3,6,7,8], [], [2]]

In [5]: series = pd.Series([np.array(s, dtype=float) for s in num_lists])

In [6]: pa.array(series)
Out[6]: 

[
  [2.0,
   3.0,
   4.0],
  [3.0,
   6.0,
   7.0,
   8.0],
  [],
  [2.0]
]

In [7]: num_lists.append([])

In [8]: series = pd.Series([np.array(s, dtype=float) for s in num_lists])

In [9]: pa.array(series)
Out[9]: 

[
  [2.0,
   3.0,
   4.0],
  [3.0,
   6.0,
   7.0,
   8.0],
  [],
  [2.0],
  []
]

In [10]: num_lists.insert(0, [])

In [11]: series = pd.Series([np.array(s, dtype=float) for s in num_lists])

In [12]: pa.array(series)
---
ArrowInvalid  Traceback (most recent call last)
 in ()
> 1 pa.array(series)

array.pxi in pyarrow.lib.array()

array.pxi in pyarrow.lib._ndarray_to_array()

error.pxi in pyarrow.lib.check_status()

ArrowInvalid: trying to convert NumPy type object but got float64
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)