[jira] [Updated] (ARROW-5181) [Rust] Create Arrow File reader

2019-04-17 Thread Neville Dipale (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-5181:
--
Fix Version/s: 0.14.0

> [Rust] Create Arrow File reader
> ---
>
> Key: ARROW-5181
> URL: https://issues.apache.org/jira/browse/ARROW-5181
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Initial support for reading the Arrow File format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5181) [Rust] Create Arrow File reader

2019-04-17 Thread Neville Dipale (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-5181:
--
Labels: pull-request-available  (was: )

> [Rust] Create Arrow File reader
> ---
>
> Key: ARROW-5181
> URL: https://issues.apache.org/jira/browse/ARROW-5181
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>
> Initial support for reading the Arrow File format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5180) [Rust] IPC Support

2019-04-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5180:
--
Labels: pull-request-available  (was: )

> [Rust] IPC Support
> --
>
> Key: ARROW-5180
> URL: https://issues.apache.org/jira/browse/ARROW-5180
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>
> The overall ticket to keep track of initial IPC support



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-17 Thread Pearu Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820504#comment-16820504
 ] 

Pearu Peterson commented on ARROW-1983:
---

Arrow [PR 4166|https://github.com/apache/arrow/pull/4166] implements the 
approach 1 above.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1983:
--
Labels: beginner parquet pull-request-available  (was: beginner parquet)

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

2019-04-17 Thread Sarah Bird (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820483#comment-16820483
 ] 

Sarah Bird edited comment on ARROW-5144 at 4/17/19 8:39 PM:


I should add: I can also get the cloudpickle error that is reported by the 
original report with the same file, but it is invariant between pyarrow 0.12.1 
and 0.13.0. The above distributed error is what is changing for me.


was (Author: birdsarah):
I should add, I can also get the cloudpickle error that is reported by the 
original report with the same file, but it is invariant between pyarrow 0.12.1 
and 0.13.0. The above distributed error is what is changing for me.

> [Python] ParquetDataset and ParquetPiece not serializable
> -
>
> Key: ARROW-5144
> URL: https://issues.apache.org/jira/browse/ARROW-5144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp 0.13.0   py36ha71616b_0conda-forge
> pyarrow   0.13.0   py36hb37e6aa_0conda-forge
>Reporter: Martin Durant
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available
> Attachments: part.0.parquet
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
> 893 try:
> 894 cp = CloudPickler(file, protocol=protocol)
> --> 895 cp.dump(obj)
> 896 return file.getvalue()
> 897 finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
> 266 self.inject_addons()
> 267 try:
> --> 268 return Pickler.dump(self, obj)
> 269 except RuntimeError as e:
> 270 if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
> 407 if self.proto >= 4:
> 408 self.framer.start_framing()
> --> 409 self.save(obj)
> 410 self.write(STOP)
> 411 self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 519
> 520 # Save the reduce() output and finally memoize the object
> --> 521 self.save_reduce(obj=obj, *rv)
> 522
> 523 def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
> 632
> 633 if state is not None:
> --> 634 save(state)
> 635 write(BUILD)
> 636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 474 f = self.dispatch.get(t)
> 475 if f is not None:
> --> 476 f(self, obj) # Call unbound method with explicit self
> 477 return
> 478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
> 819
> 820 self.memoize(obj)
> --> 821 self._batch_setitems(obj.items())
> 822
> 823 dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
> 845 for k, v in tmp:
> 846 save(k)
> --> 847 save(v)
> 848 write(SETITEMS)
> 849 elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 494 reduce = getattr(obj, "__reduce_ex__", None)
> 495 if reduce is not None:
> --> 496 rv = reduce(self.proto)
> 497 else:
> 498 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so
>  in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

2019-04-17 Thread Sarah Bird (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820480#comment-16820480
 ] 

Sarah Bird edited comment on ARROW-5144 at 4/17/19 8:38 PM:


[~kszucs] this is with variations on my dataset. I have attached one piece 
which is sufficient to reproduce the error. It is web crawl data. The dtypes 
are:
{code}
argument_0  object
argument_1  object
argument_2  object
argument_3  object
argument_4  object
argument_5  object
argument_6  object
argument_7  object
arguments   object
arguments_lenint64
call_stack  object
crawl_id int32
document_urlobject
func_name   object
in_iframe bool
operation   object
script_col   int64
script_line  int64
script_loc_eval object
script_url  object
symbol  object
time_stamp datetime64[ns, UTC]
top_level_url   object
value_1000  object
value_lenint64
visit_id int64
dtype: object
{code}
My traceback is:
{code:java}
distributed.protocol.pickle - INFO - Failed to serialize (, (, 
, 
ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet',
 row_group=None, partition_keys=[]), ['argument_0', 'argument_1', 'argument_2', 
'argument_3', 'argument_4', 'argument_5'
, 'argument_6', 'argument_7', 'arguments', 'arguments_len', 'call_stack', 
'crawl_id', 'document_url', 'func_name', 'in_iframe', 'operation', 
'script_col', 'script_line', 'script_loc_eval', 'script_url', 'symbol
', 'time_stamp', 'top_level_url', 'value_1000', 'value_len', 'visit_id'], [], 
False, None, []), 5). Exception: no default __reduce__ due to non-trivial 
__cinit__
distributed.protocol.core - CRITICAL - Failed to Serialize  
   
Traceback (most recent call last):  
   
  File 
"/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/core.py",
 line 54, in dumps
    for key, value in data.items()  
   
  File 
"/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/core.py",
 line 55, in 
    if type(value) is Serialize}

     
  File 
"/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/serialize.py",
 line 164, in serialize
    raise TypeError(msg, str(x)[:1])

  
TypeError: ('Could not serialize object of type tuple.', "(, (, 
, 
ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet',
 row_group=None, partition_keys=[]), ['argument_0', 'argument_1', 'argument_2', 
'argument_3', 'argument_4', 'argument_5
', 'argument_6', 'argument_7', 'arguments', 'arguments_len', 'call_stack', 
'crawl_id', 'document_url', 'func_name', 'in_iframe', 'operation', 
'script_col', 'script_line', 'script_loc_eval', 'script_url', 'symbo
l', 'time_stamp', 'top_level_url', 'value_1000', 'value_len', 'visit_id'], [], 
False, None, []), 5)")  
distributed.comm.utils - INFO - Unserializable Message: [{'op': 'update-graph', 
'tasks': {"('head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)": 
, 
"('read-parquet-head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)": 
, (, , 
ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet',
 row_group=None, partition_keys=[]), ['argument_0', '
argument_1', 'argument_2', 'argument_3', 'argument_4', 'argument_5', 
'argument_6', 'argument_7', 'arguments', 'arguments_len', 'call_stack', 
'crawl_id', 'document_url', 'func_name', 'in_iframe', 'operation', 's
cript_col', 'script_line', 'script_loc_eval', 'script_url', 'symbol', 
'time_stamp', 'top_level_url', 'value_1000', 'value_len', 'visit_id'], [], 
False, None, []), 5)>}, 'dependencies': {"('head-1-5-read-parquet
-daaccee11e9cff29ad1ee5622ffd6c69', 0)": 
["('read-parquet-head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)"], 
"('read-parquet-head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)": 
[]}, 'keys'
: 

[jira] [Commented] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

2019-04-17 Thread Sarah Bird (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820483#comment-16820483
 ] 

Sarah Bird commented on ARROW-5144:
---

I should add, I can also get the cloudpickle error that is reported by the 
original report with the same file, but it is invariant between pyarrow 0.12.1 
and 0.13.0. The above distributed error is what is changing for me.

> [Python] ParquetDataset and ParquetPiece not serializable
> -
>
> Key: ARROW-5144
> URL: https://issues.apache.org/jira/browse/ARROW-5144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp 0.13.0   py36ha71616b_0conda-forge
> pyarrow   0.13.0   py36hb37e6aa_0conda-forge
>Reporter: Martin Durant
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available
> Attachments: part.0.parquet
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
> 893 try:
> 894 cp = CloudPickler(file, protocol=protocol)
> --> 895 cp.dump(obj)
> 896 return file.getvalue()
> 897 finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
> 266 self.inject_addons()
> 267 try:
> --> 268 return Pickler.dump(self, obj)
> 269 except RuntimeError as e:
> 270 if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
> 407 if self.proto >= 4:
> 408 self.framer.start_framing()
> --> 409 self.save(obj)
> 410 self.write(STOP)
> 411 self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 519
> 520 # Save the reduce() output and finally memoize the object
> --> 521 self.save_reduce(obj=obj, *rv)
> 522
> 523 def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
> 632
> 633 if state is not None:
> --> 634 save(state)
> 635 write(BUILD)
> 636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 474 f = self.dispatch.get(t)
> 475 if f is not None:
> --> 476 f(self, obj) # Call unbound method with explicit self
> 477 return
> 478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
> 819
> 820 self.memoize(obj)
> --> 821 self._batch_setitems(obj.items())
> 822
> 823 dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
> 845 for k, v in tmp:
> 846 save(k)
> --> 847 save(v)
> 848 write(SETITEMS)
> 849 elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 494 reduce = getattr(obj, "__reduce_ex__", None)
> 495 if reduce is not None:
> --> 496 rv = reduce(self.proto)
> 497 else:
> 498 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so
>  in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

2019-04-17 Thread Sarah Bird (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820480#comment-16820480
 ] 

Sarah Bird commented on ARROW-5144:
---

[~kszucs] this is with variations on my dataset. I have attached one piece 
which is sufficient to reproduce the error. It is web crawl data. The dtypes 
are:

{code:python}
argument_0  object
argument_1  object
argument_2  object
argument_3  object
argument_4  object
argument_5  object
argument_6  object
argument_7  object
arguments   object
arguments_lenint64
call_stack  object
crawl_id int32
document_urlobject
func_name   object
in_iframe bool
operation   object
script_col   int64
script_line  int64
script_loc_eval object
script_url  object
symbol  object
time_stamp datetime64[ns, UTC]
top_level_url   object
value_1000  object
value_lenint64
visit_id int64
dtype: object
{code}

The end of my traceback is:

{code}
  File 
"/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/core.py",
 line 55, in 
if type(value) is Serialize}
  File 
"/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/serialize.py",
 line 164, in serialize
raise TypeError(msg, str(x)[:1])  
TypeError: ('Could not serialize object of type tuple.', "(, (, 
, 
ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet',
 row_group=None, partition_keys=[]), ['argument_0', 'argument_1', 'argument_2', 
'argument_3', 'argument_4', 'argument_5', 'argument_6', 'argument_7', 
'arguments', 'arguments_len', 'call_stack', 'crawl_id', 'document_url', 
'func_name', 'in_iframe', 'operation', 'script_col', 'script_line', 
'script_loc_eval', 'script_url', 'symbol', 'time_stamp', 'top_level_url', 
'value_1000', 'value_len', 'visit_id'], [], False, None, []), 5)")
{code}

[^part.0.parquet] 

> [Python] ParquetDataset and ParquetPiece not serializable
> -
>
> Key: ARROW-5144
> URL: https://issues.apache.org/jira/browse/ARROW-5144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp 0.13.0   py36ha71616b_0conda-forge
> pyarrow   0.13.0   py36hb37e6aa_0conda-forge
>Reporter: Martin Durant
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available
> Attachments: part.0.parquet
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
> 893 try:
> 894 cp = CloudPickler(file, protocol=protocol)
> --> 895 cp.dump(obj)
> 896 return file.getvalue()
> 897 finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
> 266 self.inject_addons()
> 267 try:
> --> 268 return Pickler.dump(self, obj)
> 269 except RuntimeError as e:
> 270 if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
> 407 if self.proto >= 4:
> 408 self.framer.start_framing()
> --> 409 self.save(obj)
> 410 self.write(STOP)
> 411 self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 519
> 520 # Save the reduce() output and finally memoize the object
> --> 521 self.save_reduce(obj=obj, *rv)
> 522
> 523 def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
> 632
> 633 if state is not None:
> --> 634 save(state)
> 635 write(BUILD)
> 636
> 

[jira] [Updated] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

2019-04-17 Thread Sarah Bird (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Bird updated ARROW-5144:
--
Attachment: part.0.parquet

> [Python] ParquetDataset and ParquetPiece not serializable
> -
>
> Key: ARROW-5144
> URL: https://issues.apache.org/jira/browse/ARROW-5144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp 0.13.0   py36ha71616b_0conda-forge
> pyarrow   0.13.0   py36hb37e6aa_0conda-forge
>Reporter: Martin Durant
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available
> Attachments: part.0.parquet
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
> 893 try:
> 894 cp = CloudPickler(file, protocol=protocol)
> --> 895 cp.dump(obj)
> 896 return file.getvalue()
> 897 finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
> 266 self.inject_addons()
> 267 try:
> --> 268 return Pickler.dump(self, obj)
> 269 except RuntimeError as e:
> 270 if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
> 407 if self.proto >= 4:
> 408 self.framer.start_framing()
> --> 409 self.save(obj)
> 410 self.write(STOP)
> 411 self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 519
> 520 # Save the reduce() output and finally memoize the object
> --> 521 self.save_reduce(obj=obj, *rv)
> 522
> 523 def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
> 632
> 633 if state is not None:
> --> 634 save(state)
> 635 write(BUILD)
> 636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 474 f = self.dispatch.get(t)
> 475 if f is not None:
> --> 476 f(self, obj) # Call unbound method with explicit self
> 477 return
> 478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
> 819
> 820 self.memoize(obj)
> --> 821 self._batch_setitems(obj.items())
> 822
> 823 dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
> 845 for k, v in tmp:
> 846 save(k)
> --> 847 save(v)
> 848 write(SETITEMS)
> 849 elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 494 reduce = getattr(obj, "__reduce_ex__", None)
> 495 if reduce is not None:
> --> 496 rv = reduce(self.proto)
> 497 else:
> 498 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so
>  in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

2019-04-17 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820466#comment-16820466
 ] 

Krisztian Szucs commented on ARROW-5144:


[~birdsarah] with any parquet file?

> [Python] ParquetDataset and ParquetPiece not serializable
> -
>
> Key: ARROW-5144
> URL: https://issues.apache.org/jira/browse/ARROW-5144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp 0.13.0   py36ha71616b_0conda-forge
> pyarrow   0.13.0   py36hb37e6aa_0conda-forge
>Reporter: Martin Durant
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
> 893 try:
> 894 cp = CloudPickler(file, protocol=protocol)
> --> 895 cp.dump(obj)
> 896 return file.getvalue()
> 897 finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
> 266 self.inject_addons()
> 267 try:
> --> 268 return Pickler.dump(self, obj)
> 269 except RuntimeError as e:
> 270 if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
> 407 if self.proto >= 4:
> 408 self.framer.start_framing()
> --> 409 self.save(obj)
> 410 self.write(STOP)
> 411 self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 519
> 520 # Save the reduce() output and finally memoize the object
> --> 521 self.save_reduce(obj=obj, *rv)
> 522
> 523 def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
> 632
> 633 if state is not None:
> --> 634 save(state)
> 635 write(BUILD)
> 636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 474 f = self.dispatch.get(t)
> 475 if f is not None:
> --> 476 f(self, obj) # Call unbound method with explicit self
> 477 return
> 478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
> 819
> 820 self.memoize(obj)
> --> 821 self._batch_setitems(obj.items())
> 822
> 823 dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
> 845 for k, v in tmp:
> 846 save(k)
> --> 847 save(v)
> 848 write(SETITEMS)
> 849 elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 494 reduce = getattr(obj, "__reduce_ex__", None)
> 495 if reduce is not None:
> --> 496 rv = reduce(self.proto)
> 497 else:
> 498 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so
>  in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

2019-04-17 Thread Sarah Bird (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820463#comment-16820463
 ] 

Sarah Bird edited comment on ARROW-5144 at 4/17/19 8:05 PM:


This might not be what you were looking for [~pitrou] but this is what breaks 
consistently for me from 0.12.1 to 0.13.0: 

{code:python}
import dask.dataframe as dd
from dask.distributed import ClientClient()
df = dd.read_parquet('my_data.parquet', engine='pyarrow')
df.head()
{code}
 
(dask 1.2.0, distributed 1.27.0)

Let me know if I can better help.


was (Author: birdsarah):
This might not be what you were looking for [~pitrou] but this is what breaks 
consistently for me from 0.12.1 to 0.13.0:

 

{{import dask.dataframe as ddfrom dask.distributed import ClientClient()df = 
dd.read_parquet('my_data.parquet', engine='pyarrow')}}{{df.head()}}

 

(dask 1.2.0, distributed 1.27.0)

Let me know if I can better help.

> [Python] ParquetDataset and ParquetPiece not serializable
> -
>
> Key: ARROW-5144
> URL: https://issues.apache.org/jira/browse/ARROW-5144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp 0.13.0   py36ha71616b_0conda-forge
> pyarrow   0.13.0   py36hb37e6aa_0conda-forge
>Reporter: Martin Durant
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
> 893 try:
> 894 cp = CloudPickler(file, protocol=protocol)
> --> 895 cp.dump(obj)
> 896 return file.getvalue()
> 897 finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
> 266 self.inject_addons()
> 267 try:
> --> 268 return Pickler.dump(self, obj)
> 269 except RuntimeError as e:
> 270 if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
> 407 if self.proto >= 4:
> 408 self.framer.start_framing()
> --> 409 self.save(obj)
> 410 self.write(STOP)
> 411 self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 519
> 520 # Save the reduce() output and finally memoize the object
> --> 521 self.save_reduce(obj=obj, *rv)
> 522
> 523 def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
> 632
> 633 if state is not None:
> --> 634 save(state)
> 635 write(BUILD)
> 636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 474 f = self.dispatch.get(t)
> 475 if f is not None:
> --> 476 f(self, obj) # Call unbound method with explicit self
> 477 return
> 478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
> 819
> 820 self.memoize(obj)
> --> 821 self._batch_setitems(obj.items())
> 822
> 823 dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
> 845 for k, v in tmp:
> 846 save(k)
> --> 847 save(v)
> 848 write(SETITEMS)
> 849 elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 494 reduce = getattr(obj, "__reduce_ex__", None)
> 495 if reduce is not None:
> --> 496 rv = reduce(self.proto)
> 497 else:
> 498 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so
>  in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597



--
This message was sent by 

[jira] [Commented] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

2019-04-17 Thread Sarah Bird (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820463#comment-16820463
 ] 

Sarah Bird commented on ARROW-5144:
---

This might not be what you were looking for [~pitrou] but this is what breaks 
consistently for me from 0.12.1 to 0.13.0:

 

{{import dask.dataframe as ddfrom dask.distributed import ClientClient()df = 
dd.read_parquet('my_data.parquet', engine='pyarrow')}}{{df.head()}}

 

(dask 1.2.0, distributed 1.27.0)

Let me know if I can better help.

> [Python] ParquetDataset and ParquetPiece not serializable
> -
>
> Key: ARROW-5144
> URL: https://issues.apache.org/jira/browse/ARROW-5144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp 0.13.0   py36ha71616b_0conda-forge
> pyarrow   0.13.0   py36hb37e6aa_0conda-forge
>Reporter: Martin Durant
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
> 893 try:
> 894 cp = CloudPickler(file, protocol=protocol)
> --> 895 cp.dump(obj)
> 896 return file.getvalue()
> 897 finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
> 266 self.inject_addons()
> 267 try:
> --> 268 return Pickler.dump(self, obj)
> 269 except RuntimeError as e:
> 270 if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
> 407 if self.proto >= 4:
> 408 self.framer.start_framing()
> --> 409 self.save(obj)
> 410 self.write(STOP)
> 411 self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 519
> 520 # Save the reduce() output and finally memoize the object
> --> 521 self.save_reduce(obj=obj, *rv)
> 522
> 523 def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
> 632
> 633 if state is not None:
> --> 634 save(state)
> 635 write(BUILD)
> 636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 474 f = self.dispatch.get(t)
> 475 if f is not None:
> --> 476 f(self, obj) # Call unbound method with explicit self
> 477 return
> 478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
> 819
> 820 self.memoize(obj)
> --> 821 self._batch_setitems(obj.items())
> 822
> 823 dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
> 845 for k, v in tmp:
> 846 save(k)
> --> 847 save(v)
> 848 write(SETITEMS)
> 849 elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
> 494 reduce = getattr(obj, "__reduce_ex__", None)
> 495 if reduce is not None:
> --> 496 rv = reduce(self.proto)
> 497 else:
> 498 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so
>  in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5181) [Rust] Create Arrow File reader

2019-04-17 Thread Neville Dipale (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-5181:
-

Assignee: Neville Dipale

> [Rust] Create Arrow File reader
> ---
>
> Key: ARROW-5181
> URL: https://issues.apache.org/jira/browse/ARROW-5181
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>
> Initial support for reading the Arrow File format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5182) [Rust] Create Arrow File writer

2019-04-17 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5182:
-

 Summary: [Rust] Create Arrow File writer
 Key: ARROW-5182
 URL: https://issues.apache.org/jira/browse/ARROW-5182
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5181) [Rust] Create Arrow File reader

2019-04-17 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5181:
-

 Summary: [Rust] Create Arrow File reader
 Key: ARROW-5181
 URL: https://issues.apache.org/jira/browse/ARROW-5181
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


Initial support for reading the Arrow File format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5180) [Rust] IPC Support

2019-04-17 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5180:
-

 Summary: [Rust] IPC Support
 Key: ARROW-5180
 URL: https://issues.apache.org/jira/browse/ARROW-5180
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Neville Dipale


The overall ticket to keep track of initial IPC support



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5178) [Python] Allow creating Table from Python dict

2019-04-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5178:
--
Labels: pull-request-available  (was: )

> [Python] Allow creating Table from Python dict
> --
>
> Key: ARROW-5178
> URL: https://issues.apache.org/jira/browse/ARROW-5178
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> There's already {{Table.to_pydict()}}, we should probably have the reverse 
> {{Table.from_pydict()}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5178) [Python] Allow creating Table from Python dict

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5178:
-

Assignee: Antoine Pitrou

> [Python] Allow creating Table from Python dict
> --
>
> Key: ARROW-5178
> URL: https://issues.apache.org/jira/browse/ARROW-5178
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> There's already {{Table.to_pydict()}}, we should probably have the reverse 
> {{Table.from_pydict()}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5177) [Python] ParquetReader.read_column() doesn't check bounds

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5177:
--
Fix Version/s: 0.14.0

> [Python] ParquetReader.read_column() doesn't check bounds
> -
>
> Key: ARROW-5177
> URL: https://issues.apache.org/jira/browse/ARROW-5177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If you call {{ParquetReader.read_column()}} with an invalid column number, it 
> just crashes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5091) [Flight] Rename FlightGetInfo message to FlightInfo

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5091.
---
Resolution: Fixed

Issue resolved by pull request 4143
[https://github.com/apache/arrow/pull/4143]

> [Flight] Rename FlightGetInfo message to FlightInfo
> ---
>
> Key: ARROW-5091
> URL: https://issues.apache.org/jira/browse/ARROW-5091
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Per mailing list discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5177) [Python] ParquetReader.read_column() doesn't check bounds

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5177:
-

Assignee: Antoine Pitrou

> [Python] ParquetReader.read_column() doesn't check bounds
> -
>
> Key: ARROW-5177
> URL: https://issues.apache.org/jira/browse/ARROW-5177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If you call {{ParquetReader.read_column()}} with an invalid column number, it 
> just crashes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5177) [Python] ParquetReader.read_column() doesn't check bounds

2019-04-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5177:
--
Labels: pull-request-available  (was: )

> [Python] ParquetReader.read_column() doesn't check bounds
> -
>
> Key: ARROW-5177
> URL: https://issues.apache.org/jira/browse/ARROW-5177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> If you call {{ParquetReader.read_column()}} with an invalid column number, it 
> just crashes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5177) [Python] ParquetReader.read_column() doesn't check bounds

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5177:
--
Component/s: C++

> [Python] ParquetReader.read_column() doesn't check bounds
> -
>
> Key: ARROW-5177
> URL: https://issues.apache.org/jira/browse/ARROW-5177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> If you call {{ParquetReader.read_column()}} with an invalid column number, it 
> just crashes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5179) [Python] Return plain dicts, not OrderedDict, on Python 3.7+

2019-04-17 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5179:
-

 Summary: [Python] Return plain dicts, not OrderedDict, on Python 
3.7+
 Key: ARROW-5179
 URL: https://issues.apache.org/jira/browse/ARROW-5179
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.13.0
Reporter: Antoine Pitrou
 Fix For: 0.14.0


On Python 3.7 and onwards, builtin dict is guaranteed to be insertion-ordered. 
So there's no need to return a OrderedDict anymore.

(builtin dict is slightly smaller, faster, and can have a nicer repr output)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5178) [Python] Allow creating Table from Python dict

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5178:
--
Description: There's already {{Table.to_pydict()}}, we should probably have 
the reverse {{Table.from_pydict()}} method.

> [Python] Allow creating Table from Python dict
> --
>
> Key: ARROW-5178
> URL: https://issues.apache.org/jira/browse/ARROW-5178
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> There's already {{Table.to_pydict()}}, we should probably have the reverse 
> {{Table.from_pydict()}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5178) [Python] Allow creating Table from Python dict

2019-04-17 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5178:
-

 Summary: [Python] Allow creating Table from Python dict
 Key: ARROW-5178
 URL: https://issues.apache.org/jira/browse/ARROW-5178
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Antoine Pitrou
 Fix For: 0.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3396) [Java] VectorSchemaRoot.create(schema, allocator) doesn't create dictionary encoded vector correctly

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3396:
--
Component/s: Java

> [Java] VectorSchemaRoot.create(schema, allocator) doesn't create dictionary 
> encoded vector correctly
> 
>
> Key: ARROW-3396
> URL: https://issues.apache.org/jira/browse/ARROW-3396
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Li Jin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3453) [Packaging][OSX] Fix plasma test failure during release verification

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3453:
--
Component/s: Packaging

> [Packaging][OSX] Fix plasma test failure during release verification
> 
>
> Key: ARROW-3453
> URL: https://issues.apache.org/jira/browse/ARROW-3453
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>
> Plasma tests fail for OSX with the following error: 
> /var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-0.11.0.X.Psg69q8j/apache-arrow-0.11.0/cpp/src/plasma/io.cc:136:
>  Socket pathname is too long. 
> /var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-0.11.0.X.Psg69q8j/apache-arrow-0.11.0/cpp/src/plasma/store.cc:900:
>  Check failed: socket >= 0 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2363) [Plasma] Have an automatic object-releasing Create() variant

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2363:
--
Component/s: C++ - Plasma

> [Plasma] Have an automatic object-releasing Create() variant
> 
>
> Key: ARROW-2363
> URL: https://issues.apache.org/jira/browse/ARROW-2363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Antoine Pitrou
>Priority: Major
>
> Like ARROW-2195, but for Create() instead of Get(). Need creating a new C++ 
> API and using it on the Python side.
>  * Create() currently increments the reference count twice
>  * Both Seal() and Release() decrement the reference count
>  * The returned buffer must also handle the case where Seal() wasn't called : 
> first Release() then Abort()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1664:
--
Component/s: Python

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1321) [C++/Python] hdfs delegation token functions

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1321:
--
Labels: HDFS  (was: )

> [C++/Python] hdfs delegation token functions
> 
>
> Key: ARROW-1321
> URL: https://issues.apache.org/jira/browse/ARROW-1321
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: HDFS
>
> HDFS can create delegation tokens for an authenticated user, so that access 
> to the file-system from other processes/machines can authenticate as that 
> same user without having to use third-party identity systems (kerberos, etc.).
> arrow-hdfs should provide the ability to accept, create, renew and cancel 
> delegation tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3717) [Python] Add GCSFSWrapper for DaskFileSystem

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3717:
--
Labels: FileSystem  (was: )

> [Python] Add GCSFSWrapper for DaskFileSystem
> 
>
> Key: ARROW-3717
> URL: https://issues.apache.org/jira/browse/ARROW-3717
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Emmett McQuinn
>Priority: Major
>  Labels: FileSystem
>
> Currently there is an S3FSWrapper that extends the DaskFileSystem object to 
> support functionality like isdir(...), isfile(...), and walk(...).
> Adding a GCSFSWrapper would enable using Google Cloud Storage for packages 
> depending on arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1322) [C++/Python] hdfs: encryption-at-rest and secure transport

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1322:
--
Component/s: C++

> [C++/Python] hdfs: encryption-at-rest and secure transport
> --
>
> Key: ARROW-1322
> URL: https://issues.apache.org/jira/browse/ARROW-1322
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Martin Durant
>Priority: Major
>
> HDFS provides for encrypted data transfer and encryption of data on-disc 
> (e.g., via KMS records). It would be nice to see these available within 
> arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1322) [C++/Python] hdfs: encryption-at-rest and secure transport

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1322:
--
Priority: Minor  (was: Major)

> [C++/Python] hdfs: encryption-at-rest and secure transport
> --
>
> Key: ARROW-1322
> URL: https://issues.apache.org/jira/browse/ARROW-1322
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Martin Durant
>Priority: Minor
>  Labels: HDFS
>
> HDFS provides for encrypted data transfer and encryption of data on-disc 
> (e.g., via KMS records). It would be nice to see these available within 
> arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1322) [C++/Python] hdfs: encryption-at-rest and secure transport

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1322:
--
Labels: HDFS  (was: )

> [C++/Python] hdfs: encryption-at-rest and secure transport
> --
>
> Key: ARROW-1322
> URL: https://issues.apache.org/jira/browse/ARROW-1322
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Martin Durant
>Priority: Major
>  Labels: HDFS
>
> HDFS provides for encrypted data transfer and encryption of data on-disc 
> (e.g., via KMS records). It would be nice to see these available within 
> arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1321) [C++/Python] hdfs delegation token functions

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1321:
--
Priority: Minor  (was: Major)

> [C++/Python] hdfs delegation token functions
> 
>
> Key: ARROW-1321
> URL: https://issues.apache.org/jira/browse/ARROW-1321
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Minor
>  Labels: HDFS
>
> HDFS can create delegation tokens for an authenticated user, so that access 
> to the file-system from other processes/machines can authenticate as that 
> same user without having to use third-party identity systems (kerberos, etc.).
> arrow-hdfs should provide the ability to accept, create, renew and cancel 
> delegation tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1321) [C++/Python] hdfs delegation token functions

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1321:
--
Component/s: Python

> [C++/Python] hdfs delegation token functions
> 
>
> Key: ARROW-1321
> URL: https://issues.apache.org/jira/browse/ARROW-1321
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>
> HDFS can create delegation tokens for an authenticated user, so that access 
> to the file-system from other processes/machines can authenticate as that 
> same user without having to use third-party identity systems (kerberos, etc.).
> arrow-hdfs should provide the ability to accept, create, renew and cancel 
> delegation tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3717) [Python] Add GCSFSWrapper for DaskFileSystem

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3717:
--
Component/s: Python

> [Python] Add GCSFSWrapper for DaskFileSystem
> 
>
> Key: ARROW-3717
> URL: https://issues.apache.org/jira/browse/ARROW-3717
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Emmett McQuinn
>Priority: Major
>
> Currently there is an S3FSWrapper that extends the DaskFileSystem object to 
> support functionality like isdir(...), isfile(...), and walk(...).
> Adding a GCSFSWrapper would enable using Google Cloud Storage for packages 
> depending on arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4648) [C++/Question] Naming/organizational inconsistencies in cpp codebase

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4648:
--
Component/s: C++

> [C++/Question] Naming/organizational inconsistencies in cpp codebase
> 
>
> Key: ARROW-4648
> URL: https://issues.apache.org/jira/browse/ARROW-4648
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.14.0
>
>
> Even after my eyes are used to the codebase, I still find the namings and/or 
> code organization inconsistent.
> h2. File Formats
> So arrow already support a couple of file formats, namely parquet, feather, 
> json, csv, orc, but their placement in the codebase is quiet odd:
> - parquet: src/parquet
> - feather: src/arrow/ipc/feather
> - orc: src/arrow/adapters/orc
> - csv: src/arrow/csv
> - json: src/arrow/json
> I might misunderstand the purpose of these sources, but I'd expect them to be 
> organized under the same roof.
> h2. Inter-Process-Communication vs. Flight
> I'd expect flight's functionality from the ipc names. 
> Flight's placement is a bit odd too, because it has its own codename, it 
> should be placed under cpp/src - like parquet, plasma, or gandiva.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1664:
--
Priority: Minor  (was: Major)

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3538:
--
Fix Version/s: 0.14.0

> [Python] ability to override the automated assignment of uuid for filenames 
> when writing datasets
> -
>
> Key: ARROW-3538
> URL: https://issues.apache.org/jira/browse/ARROW-3538
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Ji Xu
>Priority: Major
>  Labels: features, parquet
> Fix For: 0.14.0
>
>
> Say I have a pandas DataFrame {{df}} that I would like to store on disk as 
> dataset using pyarrow parquet, I would do this:
> {code:java}
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_to_dataset(table, root_path=some_path, 
> partition_cols=['a',]){code}
> On disk the dataset would look like something like this:
>  {color:#14892c}some_path{color}
>  {color:#14892c}├── a=1{color}
>  {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color}
>  {color:#14892c}├── a=2{color}
>  {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color}
> *Wished Feature:* It'd be great if I can override the auto-assignment of the 
> long UUID as filename somehow during the *dataset* writing. My purpose is to 
> be able to overwrite the dataset on disk when I have a new version of {{df}}. 
> Currently if I try to write the dataset again, another new uniquely named 
> [UUID].parquet file will be placed next to the old one, with the same, 
> redundant data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3538:
--
Component/s: Python

> [Python] ability to override the automated assignment of uuid for filenames 
> when writing datasets
> -
>
> Key: ARROW-3538
> URL: https://issues.apache.org/jira/browse/ARROW-3538
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Ji Xu
>Priority: Major
>  Labels: features, parquet
>
> Say I have a pandas DataFrame {{df}} that I would like to store on disk as 
> dataset using pyarrow parquet, I would do this:
> {code:java}
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_to_dataset(table, root_path=some_path, 
> partition_cols=['a',]){code}
> On disk the dataset would look like something like this:
>  {color:#14892c}some_path{color}
>  {color:#14892c}├── a=1{color}
>  {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color}
>  {color:#14892c}├── a=2{color}
>  {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color}
> *Wished Feature:* It'd be great if I can override the auto-assignment of the 
> long UUID as filename somehow during the *dataset* writing. My purpose is to 
> be able to overwrite the dataset on disk when I have a new version of {{df}}. 
> Currently if I try to write the dataset again, another new uniquely named 
> [UUID].parquet file will be placed next to the old one, with the same, 
> redundant data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3887) [Java][Gandiva] Expose Dremio build and tests as new optional container/test

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3887:
--
Component/s: Java
 C++ - Gandiva

> [Java][Gandiva] Expose Dremio build and tests as new optional container/test
> 
>
> Key: ARROW-3887
> URL: https://issues.apache.org/jira/browse/ARROW-3887
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva, Java
>Reporter: Jacques Nadeau
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>
> Dremio uses Arrow Java and Gandiva extensively and could provide additional 
> test coverage for the project. We should find a way to expose the downstream 
> build of Dremio as an optional build so major changes can better be evaluated 
> against downstream effects.
>  
> [~praveenbingo], assigning to you for now but let's figure out who at Dremio 
> can pick this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3794) [R] Consider mapping INT8 to integer() not raw()

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3794:
--
Component/s: R

> [R] Consider mapping INT8 to integer() not raw()
> 
>
> Key: ARROW-3794
> URL: https://issues.apache.org/jira/browse/ARROW-3794
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>
> The Arrow:BINARY type maps better to R's raw(), while Arrow::INT8 maps better 
> to R's integer() since currently, NA's are not supported when collecting 
> INT8's and numerical operations can't be performed against raw().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4003) [Gandiva][Java] Safeguard jvm before loading the gandiva library

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4003:
--
Component/s: Java
 C++ - Gandiva

> [Gandiva][Java] Safeguard jvm before loading the gandiva library
> 
>
> Key: ARROW-4003
> URL: https://issues.apache.org/jira/browse/ARROW-4003
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva, Java
>Reporter: Praveen Kumar Desabandu
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>
> Today we load the gandiva library always when trying to use the jni bridge, 
> but we have run into issues causing the jvm to crash in untested paths.
> Proposal is to do load the library in a separate process first and if it 
> works only then load in the current process.
> This would be done only once at startup/first load.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3902) [Gandiva] [C++] Remove static c++ linked in Gandiva.

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3902:
--
Component/s: C++ - Gandiva

> [Gandiva] [C++] Remove static c++ linked in Gandiva.
> 
>
> Key: ARROW-3902
> URL: https://issues.apache.org/jira/browse/ARROW-3902
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Affects Versions: 0.12.0
>Reporter: Praveen Kumar Desabandu
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>
> Hi,
> [~wesm_impala_7e40], I am looking into switching Gandiva Redhat developer 
> toolchain. We are not too familiar with it and not sure the effort required 
> there.
> In the meanwhile for the short term, can we turn get Crossbow builds to only 
> do static linking for Dremio builds (through a travis env variable)? and 
> Arrow can ship Gandiva linked to std-c++ dynamically?
> We can then move to redhat toolchain for 0.13 version of Arrow?
> Thx.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2127) [Plasma] Transfer of objects between CPUs and GPUs

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2127:
--
Component/s: C++ - Plasma

> [Plasma] Transfer of objects between CPUs and GPUs
> --
>
> Key: ARROW-2127
> URL: https://issues.apache.org/jira/browse/ARROW-2127
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.14.0
>
>
> It should be possible to transfer an object that was created on the CPU to 
> the GPU and vice versa. One natural implementation is to introduce a flag to 
> plasma::Get that specifies where the object should end up and then transfer 
> the object under the hood and return the appropriate buffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2444) [Python] Better handle reading empty parquet files

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2444:
--
Component/s: Python

> [Python] Better handle reading empty parquet files
> --
>
> Key: ARROW-2444
> URL: https://issues.apache.org/jira/browse/ARROW-2444
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]
>  
> Currently pyarrow reads empty parts as float64, even if the underlying 
> columns have other dtypes. This can cause problems for pandas downstream, as 
> certain operations are only valid on certain dtypes, even if the columns are 
> empty.
>  
> Copying the comment Uwe over:
>  
> bq. {quote}This is the expected behaviour as an empty string column in Pandas 
> is simply an empty column of type object. Sadly object does not tell us much 
> about the type of the column at all. We return numpy.float64 in this case as 
> it's the most efficient type to store nulls in Pandas.{quote}
> {quote}This seems unintuitive at best to me. An empty object column in pandas 
> is treated differently in many operations than an empty float64 column (str 
> accessor is available, excluded from numeric operations, etc..). Having an 
> empty file read in as a different dtype than was written could lead to errors 
> in processing code downstream. Would arrow be willing to change this 
> behavior?{quote}
> We should probably use another method than `field.type.to_pandas_dtype()` in 
> this case. The column saved in Parquet should be saved with `NA` as type 
> which sadly does not provide enough information. 
> We also store the original dtype in the Pandas metadata that is used for the 
> actual DataFrame reconstruction later on. If we would also pick up the 
> metadata when it was written, we should be able to correctly reconstruct the 
> dtype.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2621) [Python/CI] Use pep8speaks for Python PRs

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2621:
--
Component/s: Python
 Continuous Integration

> [Python/CI] Use pep8speaks for Python PRs
> -
>
> Key: ARROW-2621
> URL: https://issues.apache.org/jira/browse/ARROW-2621
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.14.0
>
>
> It would be nice if we would get automated comments by 
> [https://pep8speaks.com/] on the Python PRs. This should be much better 
> readable than the current `flake8` ouput in the Travis logs. This issue is 
> split up into two tasks:
>  * Create an issue with INFRA kindly asking them for activating pep8speaks 
> for Arrow
>  * Setup {{.pep8speaks.yml}} to align with our {{flake8}} config. For 
> reference, see Pandas' config: 
> [https://github.com/pandas-dev/pandas/blob/master/.pep8speaks.yml] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2542) [Plasma] Refactor object notification code

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2542:
--
Component/s: C++ - Plasma

> [Plasma] Refactor object notification code
> --
>
> Key: ARROW-2542
> URL: https://issues.apache.org/jira/browse/ARROW-2542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Replace unique_ptr with vector



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-326) [Java] ComplexWriter should initialize nested writers when container vector is already populated

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-326:
-
Component/s: Java

> [Java] ComplexWriter should initialize nested writers when container vector 
> is already populated
> 
>
> Key: ARROW-326
> URL: https://issues.apache.org/jira/browse/ARROW-326
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>Priority: Major
>
> It's possible and sometimes useful to use reuse a nested vector that was 
> populated in a previous ComplexWriter. The new ComplexWriter should be aware 
> of the fields that are present in the vector.
> As it is right now, if a particular column were determined to be a specific 
> type (or a union type), but the new writer finds a new type, the original 
> type may be thrown out. What should happen is that the type should be 
> promoted to union (or have a new subtype added to the union field).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3850) [Python] Support MapType and StructType for enhanced PySpark integration

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3850:
--
Component/s: Python

> [Python] Support MapType and StructType for enhanced PySpark integration
> 
>
> Key: ARROW-3850
> URL: https://issues.apache.org/jira/browse/ARROW-3850
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Florian Wilhelm
>Priority: Major
>
> It would be great to support MapType and (nested) StructType in Arrow so that 
> PySpark can make use of it.
>  
>  Quite often as in my use-case in Hive table cells are also complex types 
> saved. Currently it's not possible to user the new 
> {{[pandas_udf|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.pandas_udf]}}
>  decorator which internally uses Arrow to generate a UDF for columns with 
> complex types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5177) [Python] ParquetReader.read_column() doesn't check bounds

2019-04-17 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5177:
-

 Summary: [Python] ParquetReader.read_column() doesn't check bounds
 Key: ARROW-5177
 URL: https://issues.apache.org/jira/browse/ARROW-5177
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
Reporter: Antoine Pitrou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5177) [Python] ParquetReader.read_column() doesn't check bounds

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5177:
--
Description: If you call {{ParquetReader.read_column()}} with an invalid 
column number, it just crashes.

> [Python] ParquetReader.read_column() doesn't check bounds
> -
>
> Key: ARROW-5177
> URL: https://issues.apache.org/jira/browse/ARROW-5177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> If you call {{ParquetReader.read_column()}} with an invalid column number, it 
> just crashes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3861:
--
Component/s: Python

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: pyarrow, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4131) [Python] Coerce mixed columns to String

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4131:
--
Component/s: Python

> [Python] Coerce mixed columns to String
> ---
>
> Key: ARROW-4131
> URL: https://issues.apache.org/jira/browse/ARROW-4131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Leo Meyerovich
>Priority: Major
> Fix For: 0.14.0
>
>
> Continuing [https://github.com/apache/arrow/issues/3280] 
>  
> ===
>  
> I'm seeing variants of this elsewhere (e.g., 
> [wesm/feather#349|https://github.com/wesm/feather/issues/349] ) --
> Not all Pandas tables coerce to Arrow tables, and when they fail, not in a 
> way that is conducive to automation:
> Sample:
> {{mixed_df = pd.DataFrame(\{'mixed': [1, 'b']}) 
> pa.Table.from_pandas(mixed_df) => ArrowInvalid: ('Could not convert b with 
> type str: tried to convert to double', 'Conversion failed for column mixed 
> with type object') }}
> I would have expected behaviors more like the following:
>  * Coerce {{toString}} by default, with a default-off option to disallow 
> toString coercions
>  * Provide a default-off option to {{from_pandas}} to auto-coerce
>  * Name the exception so it is clear that this is a column coercion failure, 
> and include the column name(s), making this predictable and clearly 
> handleable by both library writers & users
> I lean towards:
>  * Defaults auto-coerce, improving life of early users, 
> `coerce_mixed_columns_to_strings=True`
>  * For less frequent yet more advanced library implementors, allow them to 
> override to `False`
>  * In their case, create a predictable & machine-readable exception, 
> `MixedColumnException(mixed_columns=['a', 'b', ...], msg="")`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3779) [Python] Validate timezone passed to pa.timestamp

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3779:
--
Component/s: Python

> [Python] Validate timezone passed to pa.timestamp
> -
>
> Key: ARROW-3779
> URL: https://issues.apache.org/jira/browse/ARROW-3779
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3919) [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3919:
--
Component/s: Python

> [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize
> -
>
> Key: ARROW-3919
> URL: https://issues.apache.org/jira/browse/ARROW-3919
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> see https://github.com/modin-project/modin/issues/266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3779) [Python] Validate timezone passed to pa.timestamp

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3779:
--
Component/s: C++

> [Python] Validate timezone passed to pa.timestamp
> -
>
> Key: ARROW-3779
> URL: https://issues.apache.org/jira/browse/ARROW-3779
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4452) [Python] Serializing sparse torch tensors

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4452:
--
Component/s: Python

> [Python] Serializing sparse torch tensors
> -
>
> Key: ARROW-4452
> URL: https://issues.apache.org/jira/browse/ARROW-4452
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Using the pytorch serialization handler on sparse Tensors:
> {code:java}
> import torch
> i = torch.LongTensor([[0, 2], [1, 0], [1, 2]])
> v = torch.FloatTensor([3,      4,      5    ])
> tensor = torch.sparse.FloatTensor(i.t(), v, torch.Size([2,3]))
> pyarrow.serialization.register_torch_serialization_handlers(pyarrow.serialization._default_serialization_context)
> s = pyarrow.serialize(tensor, 
> context=pyarrow.serialization._default_serialization_context) {code}
> Produces this result:
> {code:java}
> TypeError: can't convert sparse tensor to numpy. Use Tensor.to_dense() to 
> convert to a dense tensor first.{code}
> We should provide a way to serialize sparse torch tensors, especially now 
> that we are getting support for sparse Tensors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3471) [C++][Gandiva] Investigate caching isomorphic expressions

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3471:
--
Component/s: C++ - Gandiva

> [C++][Gandiva] Investigate caching isomorphic expressions
> -
>
> Key: ARROW-3471
> URL: https://issues.apache.org/jira/browse/ARROW-3471
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Praveen Kumar Desabandu
>Priority: Major
>  Labels: gandiva
> Fix For: 0.14.0
>
>
> Two expressions say add(a+b) and add(c+d), could potentially be reused if the 
> only thing differing are the names.
> Test E2E.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2105) [C++] Implement take kernel functions - properly handle special indices

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2105:
--
Component/s: C++

> [C++] Implement take kernel functions - properly handle special indices
> ---
>
> Key: ARROW-2105
> URL: https://issues.apache.org/jira/browse/ARROW-2105
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Jingyuan Wang
>Priority: Major
> Fix For: 0.14.0
>
>
> Special indices include:
> - negative indices
> - out-of-bound indices



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2104) [C++] Implement take kernel functions - nested array value type

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2104:
--
Component/s: C++

> [C++] Implement take kernel functions - nested array value type
> ---
>
> Key: ARROW-2104
> URL: https://issues.apache.org/jira/browse/ARROW-2104
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Jingyuan Wang
>Priority: Major
> Fix For: 0.14.0
>
>
> Should support nested array value types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4526) [Java] Remove Netty references from ArrowBuf and move Allocator out of vector package

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4526:
--
Component/s: Java

> [Java] Remove Netty references from ArrowBuf and move Allocator out of vector 
> package
> -
>
> Key: ARROW-4526
> URL: https://issues.apache.org/jira/browse/ARROW-4526
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Jacques Nadeau
>Priority: Major
>
> Arrow currently has a hard dependency on Netty and exposes this in public 
> APIs. This shouldn't be the case. There could be many allocator 
> implementations with Netty as one possible option. We should remove hard 
> dependency between arrow-vector and Netty, instead creating a trivial 
> allocator. ArrowBuf should probably expose an  T unwrap(Class clazz) 
> method instead to allow inner providers availability without a hard 
> reference. This should also include drastically reducing the number of 
> methods on ArrowBuf as right now it includes every method from ByteBuf but 
> many of those are not very useful, appropriate.
> This work should come after we do the simpler ARROW-3191



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1005) [JAVA] NullableDecimalVector.setSafe(int, byte[]...) throws UnsupportedOperationException

2019-04-17 Thread Jacques Nadeau (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacques Nadeau resolved ARROW-1005.
---
Resolution: Invalid

I think this is so old and the code has gone through so much iterations that 
whatever it is, let's not worry about it.

> [JAVA] NullableDecimalVector.setSafe(int, byte[]...) throws 
> UnsupportedOperationException
> -
>
> Key: ARROW-1005
> URL: https://issues.apache.org/jira/browse/ARROW-1005
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Jacques Nadeau
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4648) [C++/Question] Naming/organizational inconsistencies in cpp codebase

2019-04-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820231#comment-16820231
 ] 

Antoine Pitrou commented on ARROW-4648:
---

I would welcome fixing filenames (standardize on either "-" or "_" in filenames 
made of multiple words). Opinions?

> [C++/Question] Naming/organizational inconsistencies in cpp codebase
> 
>
> Key: ARROW-4648
> URL: https://issues.apache.org/jira/browse/ARROW-4648
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.14.0
>
>
> Even after my eyes are used to the codebase, I still find the namings and/or 
> code organization inconsistent.
> h2. File Formats
> So arrow already support a couple of file formats, namely parquet, feather, 
> json, csv, orc, but their placement in the codebase is quiet odd:
> - parquet: src/parquet
> - feather: src/arrow/ipc/feather
> - orc: src/arrow/adapters/orc
> - csv: src/arrow/csv
> - json: src/arrow/json
> I might misunderstand the purpose of these sources, but I'd expect them to be 
> organized under the same roof.
> h2. Inter-Process-Communication vs. Flight
> I'd expect flight's functionality from the ipc names. 
> Flight's placement is a bit odd too, because it has its own codename, it 
> should be placed under cpp/src - like parquet, plasma, or gandiva.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4570) [Gandiva] Add overflow checks for decimals

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4570:
--
Component/s: C++ - Gandiva

> [Gandiva] Add overflow checks for decimals
> --
>
> Key: ARROW-4570
> URL: https://issues.apache.org/jira/browse/ARROW-4570
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Pindikura Ravindra
>Assignee: Pindikura Ravindra
>Priority: Major
>
> For decimals, overflows can occur at two places :
>  # input array can have values that are outside the bound (eg. > 38 digits)
>  # When an operation can result in overflows. eg. add of two decimals of (38, 
> 6) can result in an overflow, if the input numbers are very large.
> In both the above cases, just verifying that an overflow occurred can be a 
> perf overhead. We should do this based on a conf variable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4025) [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some settings

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4025:
--
Component/s: Python

> [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some 
> settings
> -
>
> Key: ARROW-4025
> URL: https://issues.apache.org/jira/browse/ARROW-4025
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> See the bug report in [https://github.com/ray-project/ray/issues/3520]
> I wonder if we can revisit this issue and try to get rid of the workarounds 
> we tried to deploy in the past.
> See also the discussion in [https://github.com/apache/arrow/pull/2096]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1266) [Plasma] Move heap allocations to arrow memory pool

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1266:
--
Component/s: C++ - Plasma

> [Plasma] Move heap allocations to arrow memory pool
> ---
>
> Key: ARROW-1266
> URL: https://issues.apache.org/jira/browse/ARROW-1266
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.14.0
>
>
> At the moment we are allocating memory with std::vectors and even new in some 
> places, this should be cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1796) [Python] RowGroup filtering on file level

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1796:
--
Component/s: C++

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2041) [Python] pyarrow.serialize has high overhead for list of NumPy arrays

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2041:
--
Priority: Minor  (was: Major)

> [Python] pyarrow.serialize has high overhead for list of NumPy arrays
> -
>
> Key: ARROW-2041
> URL: https://issues.apache.org/jira/browse/ARROW-2041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Richard Shin
>Priority: Minor
>  Labels: Performance
> Fix For: 0.14.0
>
>
> {{Python 2.7.12 (default, Nov 20 2017, 18:23:56)}}
> {{[GCC 5.4.0 20160609] on linux2}}
> {{Type "help", "copyright", "credits" or "license" for more information.}}
> {{>>> import pyarrow as pa, numpy as np}}
> {{>>> arrays = [np.arange(100, dtype=np.int32) for _ in range(1)]}}
> {{>>> with open('test.pyarrow', 'w') as f:}}
> {{... f.write(pa.serialize(arrays).to_buffer().to_pybytes())}}
> {{...}}
> {{>>> import cPickle as pickle}}
> {{>>> pickle.dump(arrays, open('test.pkl', 'w'), pickle.HIGHEST_PROTOCOL)}}
> test.pyarrow is 6.2 MB, while test.pkl is only 4.2 MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2041) [Python] pyarrow.serialize has high overhead for list of NumPy arrays

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2041:
--
Labels: Performance  (was: )

> [Python] pyarrow.serialize has high overhead for list of NumPy arrays
> -
>
> Key: ARROW-2041
> URL: https://issues.apache.org/jira/browse/ARROW-2041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Richard Shin
>Priority: Major
>  Labels: Performance
> Fix For: 0.14.0
>
>
> {{Python 2.7.12 (default, Nov 20 2017, 18:23:56)}}
> {{[GCC 5.4.0 20160609] on linux2}}
> {{Type "help", "copyright", "credits" or "license" for more information.}}
> {{>>> import pyarrow as pa, numpy as np}}
> {{>>> arrays = [np.arange(100, dtype=np.int32) for _ in range(1)]}}
> {{>>> with open('test.pyarrow', 'w') as f:}}
> {{... f.write(pa.serialize(arrays).to_buffer().to_pybytes())}}
> {{...}}
> {{>>> import cPickle as pickle}}
> {{>>> pickle.dump(arrays, open('test.pkl', 'w'), pickle.HIGHEST_PROTOCOL)}}
> test.pyarrow is 6.2 MB, while test.pkl is only 4.2 MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2041) [Python] pyarrow.serialize has high overhead for list of NumPy arrays

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2041:
--
Component/s: Python

> [Python] pyarrow.serialize has high overhead for list of NumPy arrays
> -
>
> Key: ARROW-2041
> URL: https://issues.apache.org/jira/browse/ARROW-2041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Richard Shin
>Priority: Major
> Fix For: 0.14.0
>
>
> {{Python 2.7.12 (default, Nov 20 2017, 18:23:56)}}
> {{[GCC 5.4.0 20160609] on linux2}}
> {{Type "help", "copyright", "credits" or "license" for more information.}}
> {{>>> import pyarrow as pa, numpy as np}}
> {{>>> arrays = [np.arange(100, dtype=np.int32) for _ in range(1)]}}
> {{>>> with open('test.pyarrow', 'w') as f:}}
> {{... f.write(pa.serialize(arrays).to_buffer().to_pybytes())}}
> {{...}}
> {{>>> import cPickle as pickle}}
> {{>>> pickle.dump(arrays, open('test.pkl', 'w'), pickle.HIGHEST_PROTOCOL)}}
> test.pyarrow is 6.2 MB, while test.pkl is only 4.2 MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2835:
--
Component/s: C++

> [C++] ReadAt/WriteAt are inconsistent with moving the files position
> 
>
> Key: ARROW-2835
> URL: https://issues.apache.org/jira/browse/ARROW-2835
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Dimitri Vorona
>Priority: Major
> Fix For: 0.14.0
>
>
> Right now, there is inconsistent behaviour regarding moving the files 
> position pointer after calling ReadAt or WriteAt. For example, the default 
> implementation of ReadAt seeks to the desired offset and calls Read which 
> moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change 
> the position. WriteableFile::WriteAt seem to move the position in the current 
> implementation, but there is no docstring which prescribes this behaviour.
> Antoine suggested that *At methods shouldn't touch the position and it makes 
> more sense, IMHO. The change isn't huge and doesn't seem to break anything 
> internally, but it might break the existing user code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4757) [C++] Nested chunked array support

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4757:
--
Component/s: C++

> [C++] Nested chunked array support
> --
>
> Key: ARROW-4757
> URL: https://issues.apache.org/jira/browse/ARROW-4757
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.14.0
>
>
> Dear all,
> I'm currently trying to lift the 2GB limit on the python serialization. For 
> this, I implemented a chunked union builder to split the array into smaller 
> arrays.
> However, some of the children of the union array can be ListArrays, which can 
> themselves contain UnionArrays which can contain ListArrays etc. I'm at a bit 
> of a loss how to handle this. In principle I'd like to chunk the children 
> too. However, currently UnionArrays can only have children of type Array, and 
> there is no way to treat a chunked array (which is a vector of Arrays) as an 
> Array to store it as a child of a UnionArray. Any ideas how to best support 
> this use case?
> -- Philipp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4974) Array approx equality

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4974:
--
Component/s: Go

> Array approx equality
> -
>
> Key: ARROW-4974
> URL: https://issues.apache.org/jira/browse/ARROW-4974
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Alexandre Crayssac
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4973) Slice Array equality

2019-04-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820224#comment-16820224
 ] 

Antoine Pitrou commented on ARROW-4973:
---

Can you elaborate what you are asking for here?

> Slice Array equality
> 
>
> Key: ARROW-4973
> URL: https://issues.apache.org/jira/browse/ARROW-4973
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alexandre Crayssac
>Assignee: Alexandre Crayssac
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4972) [Go] Array equality

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4972:
--
Summary: [Go] Array equality  (was: Array equality)

> [Go] Array equality
> ---
>
> Key: ARROW-4972
> URL: https://issues.apache.org/jira/browse/ARROW-4972
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Alexandre Crayssac
>Assignee: Alexandre Crayssac
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1692:
--
Component/s: Python
 Java

> [Python, Java] UnionArray round trip not working
> 
>
> Key: ARROW-1692
> URL: https://issues.apache.org/jira/browse/ARROW-1692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java, Python
>Reporter: Philipp Moritz
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: columnar-format-1.0
> Fix For: 0.14.0
>
> Attachments: union_array.arrow
>
>
> I'm currently working on making pyarrow.serialization data available from the 
> Java side, one problem I was running into is that it seems the Java 
> implementation cannot read UnionArrays generated from C++. To make this 
> easily reproducible I created a clean Python implementation for creating 
> UnionArrays: https://github.com/apache/arrow/pull/1216
> The data is generated with the following script:
> {code}
> import pyarrow as pa
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
> int64 = pa.array([1, 2, 3], type='int64')
> types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)
> batch = pa.RecordBatch.from_arrays([result], ["test"])
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> sink.close()
> b = sink.get_result()
> with open("union_array.arrow", "wb") as f:
> f.write(b)
> # Sanity check: Read the batch in again
> with open("union_array.arrow", "rb") as f:
> b = f.read()
> reader = pa.RecordBatchStreamReader(pa.BufferReader(b))
> batch = reader.read_next_batch()
> print("union array is", batch.column(0))
> {code}
> I attached the file generated by that script. Then when I run the following 
> code in Java:
> {code}
> RootAllocator allocator = new RootAllocator(10);
> ByteArrayInputStream in = new 
> ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));
> ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
> reader.loadNextBatch()
> {code}
> I get the following error:
> {code}
> |  java.lang.IllegalArgumentException thrown: Could not load buffers for 
> field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error 
> message: can not truncate buffer to a larger size 7: 0
> |at VectorLoader.loadBuffers (VectorLoader.java:83)
> |at VectorLoader.load (VectorLoader.java:62)
> |at ArrowReader$1.visit (ArrowReader.java:125)
> |at ArrowReader$1.visit (ArrowReader.java:111)
> |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
> |at ArrowReader.loadNextBatch (ArrowReader.java:137)
> |at (#7:1)
> {code}
> It seems like Java is not picking up that the UnionArray is Dense instead of 
> Sparse. After changing the default in 
> java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, 
> I get this:
> {code}
> jshell> reader.getVectorSchemaRoot().getSchema()
> $9 ==> Schema [0])<: Int(64, true)>
> {code}
> but then reading doesn't work:
> {code}
> jshell> reader.loadNextBatch()
> |  java.lang.IllegalArgumentException thrown: Could not load buffers for 
> field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0
> |at VectorLoader.loadBuffers (VectorLoader.java:83)
> |at VectorLoader.load (VectorLoader.java:62)
> |at ArrowReader$1.visit (ArrowReader.java:125)
> |at ArrowReader$1.visit (ArrowReader.java:111)
> |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
> |at ArrowReader.loadNextBatch (ArrowReader.java:137)
> |at (#8:1)
> {code}
> Any help with this is appreciated!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4949) [CI] Add C# docker image to the docker-compose setup

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4949:
--
Component/s: Continuous Integration

> [CI] Add C# docker image to the docker-compose setup
> 
>
> Key: ARROW-4949
> URL: https://issues.apache.org/jira/browse/ARROW-4949
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> https://github.com/apache/arrow/blob/master/csharp/build/docker/Dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4972) Array equality

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4972:
--
Component/s: Go

> Array equality
> --
>
> Key: ARROW-4972
> URL: https://issues.apache.org/jira/browse/ARROW-4972
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Alexandre Crayssac
>Assignee: Alexandre Crayssac
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4974) [Go] Array approx equality

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4974:
--
Summary: [Go] Array approx equality  (was: Array approx equality)

> [Go] Array approx equality
> --
>
> Key: ARROW-4974
> URL: https://issues.apache.org/jira/browse/ARROW-4974
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Alexandre Crayssac
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4973) [Go] Slice Array equality

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4973:
--
Summary: [Go] Slice Array equality  (was: Slice Array equality)

> [Go] Slice Array equality
> -
>
> Key: ARROW-4973
> URL: https://issues.apache.org/jira/browse/ARROW-4973
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alexandre Crayssac
>Assignee: Alexandre Crayssac
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4973) [Go] Slice Array equality

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4973:
--
Component/s: Go

> [Go] Slice Array equality
> -
>
> Key: ARROW-4973
> URL: https://issues.apache.org/jira/browse/ARROW-4973
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Alexandre Crayssac
>Assignee: Alexandre Crayssac
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5022) [C++] Implement more "Datum" types for AggregateKernel

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5022:
--
Component/s: C++

> [C++] Implement more "Datum" types for AggregateKernel
> --
>
> Key: ARROW-5022
> URL: https://issues.apache.org/jira/browse/ARROW-5022
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>
> Currently it gives the following error if the datum isn't an array:
> {code:java}
> AggregateKernel expects Array datum{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5030) [Python] read_row_group fails with Nested data conversions not implemented for chunked array outputs

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5030:
--
Component/s: Python
 C++

> [Python] read_row_group fails with Nested data conversions not implemented 
> for chunked array outputs
> 
>
> Key: ARROW-5030
> URL: https://issues.apache.org/jira/browse/ARROW-5030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.0
>Reporter: Jakub Okoński
>Priority: Major
>
> Hey, I'm trying to concatenate two files and to avoid reading everything to 
> memory at once, I wanted to use `read_row_group` for my solution, but it 
> fails.
>  
> I think it's due to fields like these:
> {{pyarrow.Field>}}
>  
> But I'm not sure. Is this a duplicate? The issue linked in the code is 
> resolved 
> https://github.com/apache/arrow/blob/fd0b90a7f7e65fde32af04c4746004a1240914cf/cpp/src/parquet/arrow/reader.cc#L915
>  
> Stacktrace is
>  
> {{  File "/data/teftel/teftel-data/teftel_data/parquet_stream.py", line 163, 
> in read_batches}}
> {{    table = pf.read_row_group(ix, columns=self._columns)}}
> {{  File 
> "/home/kuba/.local/share/virtualenvs/teftel-o6G5iH_l/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 186, in read_row_group}}
> {{    use_threads=use_threads)}}
> {{  File "pyarrow/_parquet.pyx", line 695, in 
> pyarrow._parquet.ParquetReader.read_row_group}}
> {{  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status}}
> {{pyarrow.lib.ArrowNotImplementedError: Nested data conversions not 
> implemented for chunked array outputs}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4974) Array approx equality

2019-04-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820223#comment-16820223
 ] 

Antoine Pitrou commented on ARROW-4974:
---

[~alexandreyc] Can you elaborate what you are asking for here?

> Array approx equality
> -
>
> Key: ARROW-4974
> URL: https://issues.apache.org/jira/browse/ARROW-4974
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alexandre Crayssac
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3523) [JS] Assign dictionary IDs in IPC writer rather than on creation

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3523:
--
Component/s: JavaScript

> [JS] Assign dictionary IDs in IPC writer rather than on creation
> 
>
> Key: ARROW-3523
> URL: https://issues.apache.org/jira/browse/ARROW-3523
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
> Fix For: 0.14.0
>
>
>  Currently the JS implementation relies on on the user assigning IDs for 
> dictionaries that they create, we should do something like the C++ 
> implementation, which uses a dictionary id memo to assign and retrieve 
> dictionary ids in the IPC writer 
> (https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L495).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5008) [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5008:
--
Component/s: C++

> [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist
> 
>
> Key: ARROW-5008
> URL: https://issues.apache.org/jira/browse/ARROW-5008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.0, 0.12.1
>Reporter: Keith Kraus
>Priority: Major
>
> In docker containers it's common for `/etc/localtime` to not exist, and if it 
> doesn't exist it causes a file not found error which is not handled in 
> PyArrow. Workaround is to install `tzdata` into the container (at least for 
> Ubuntu), but wanted to report upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5002) [C++] Implement GroupBy

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5002:
--
Component/s: C++

> [C++] Implement GroupBy
> ---
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5008) [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5008:
--
Labels: ORC  (was: )

> [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist
> 
>
> Key: ARROW-5008
> URL: https://issues.apache.org/jira/browse/ARROW-5008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.0, 0.12.1
>Reporter: Keith Kraus
>Priority: Major
>  Labels: ORC
>
> In docker containers it's common for `/etc/localtime` to not exist, and if it 
> doesn't exist it causes a file not found error which is not handled in 
> PyArrow. Workaround is to install `tzdata` into the container (at least for 
> Ubuntu), but wanted to report upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5106) [Packaging] [C++/Python] Add conda package verification scripts

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5106:
--
Component/s: Packaging

> [Packaging] [C++/Python] Add conda package verification scripts
> ---
>
> Key: ARROW-5106
> URL: https://issues.apache.org/jira/browse/ARROW-5106
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>
> Following the conventions of apt/yum verification script: 
> https://github.com/apache/arrow/pull/4098



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5153) [Rust] Use IntoIter trait for write_batch/write_mini_batch

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5153:
--
Component/s: Rust

> [Rust] Use IntoIter trait for write_batch/write_mini_batch
> --
>
> Key: ARROW-5153
> URL: https://issues.apache.org/jira/browse/ARROW-5153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Xavier Lange
>Priority: Major
>
> Writing data to a parquet file requires a lot of copying and intermediate Vec 
> creation. Take a record struct like:
> {{struct MyData {}}{{  name: String,}}{{  address: Option}}{{}}}
> Over the course of working sets of this data, you'll have the bulk data 
> Vec,  the names column in a Vec<>, the address column in a 
> Vec>. This puts extra memory pressure on the system, at the 
> minimum we have to allocate a Vec the same size as the bulk data even if we 
> are using references.
> What I'm proposing is to use an IntoIter style. This will maintain backward 
> compat as a slice automatically implements IntoIter. Where 
> ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: 
> IntoIter". Then you can do things like
> {{  write_batch(bulk.iter().map(|x| x.name), None, None)}}{{  
> write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| 
> x.is_some())), None)}}
> and you can see there's no need for an intermediate Vec, so no short-term 
> allocations to write out the data.
> I am writing data with many columns and I think this would really help to 
> speed things up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5163) [Gandiva] Cast timestamp/date are incorrectly evaluating year 0097 to 1997

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5163:
--
Component/s: C++ - Gandiva

> [Gandiva] Cast timestamp/date are incorrectly evaluating year 0097 to 1997
> --
>
> Key: ARROW-5163
> URL: https://issues.apache.org/jira/browse/ARROW-5163
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: shyam narayan singh
>Assignee: shyam narayan singh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Cast timestamp/date functions are incorrectly evaluating year string "0097" 
> to year 1997. It should evaluate to 0097.
> In other words any year string length of 4 should be taken as it is and must 
> not be tampered with. Year string "97" should be considered as 1997.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5166) [Python] Statistics for uint64 columns may overflow

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5166:
--
Component/s: Python

> [Python] Statistics for uint64 columns may overflow
> ---
>
> Key: ARROW-5166
> URL: https://issues.apache.org/jira/browse/ARROW-5166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: python 3.6
> pyarrow 0.13.0
>Reporter: Marco Neumann
>Priority: Major
> Attachments: int64_statistics_overflow.parquet
>
>
> See the attached parquet file, where the statistics max value is smaller than 
> the min value.
> You can roundtrip that file through pandas and store it back to provoke the 
> same bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5166) [Python] Statistics for uint64 columns may overflow

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5166:
--
Component/s: C++

> [Python] Statistics for uint64 columns may overflow
> ---
>
> Key: ARROW-5166
> URL: https://issues.apache.org/jira/browse/ARROW-5166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: python 3.6
> pyarrow 0.13.0
>Reporter: Marco Neumann
>Priority: Major
> Attachments: int64_statistics_overflow.parquet
>
>
> See the attached parquet file, where the statistics max value is smaller than 
> the min value.
> You can roundtrip that file through pandas and store it back to provoke the 
> same bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2317) [Python] fix C linkage warning

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2317:
--
Component/s: Python

> [Python] fix C linkage warning
> --
>
> Key: ARROW-2317
> URL: https://issues.apache.org/jira/browse/ARROW-2317
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Viktor Gal
>Priority: Minor
>
> When using pyarrow interface from a c++ library one will get the following 
> compiler warning:
> {quote}{{warning: 'unwrap_table' has C-linkage specified, but returns 
> user-defined type 'arrow::Status' which is incompatible with C 
> [-Wreturn-type-c-linkage]}}
> {{ARROW_EXPORT Status unwrap_table(PyObject* table, std::shared_ptr* 
> out);}}
> {quote}
> This is due to a Cython artifact.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1391) [Python] Benchmarks for python serialization

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1391:
--
Component/s: Python
 Benchmarking

> [Python] Benchmarks for python serialization
> 
>
> Key: ARROW-1391
> URL: https://issues.apache.org/jira/browse/ARROW-1391
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Benchmarking, Python
>Reporter: Philipp Moritz
>Priority: Minor
>
> It would be great to have a suite of relevant benchmarks for the Python 
> serialization code in ARROW-759. These could be used to guide profiling and 
> performance improvements.
> Relevant use cases include:
> - dictionaries of large numpy arrays that are used to represent weights of a 
> neural network
> - long lists of primitive types like ints, floats or strings
> - lists of user defined python objects



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-999:
-
Component/s: Java

> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Priority: Minor
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.
> edit: Float4Vector doesn't implement FieldVector, so can't currently be a 
> top-level vector. I'm confused as to what the nullable flag is supposed to 
> represent then.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >