date:20200115

[jira] [Updated] (ARROW-7589) [C++][Gandiva] Calling castVarchar from java sometimes results in segmentation fault for input length 0

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7589:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] Calling castVarchar from java sometimes results in 
> segmentation fault for input length 0
> ---
>
> Key: ARROW-7589
> URL: https://issues.apache.org/jira/browse/ARROW-7589
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7589) [C++][Gandiva] Calling castVarchar from java sometimes results in segmentation fault for input length 0

2020-01-15 Thread Projjal Chanda (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Projjal Chanda updated ARROW-7589:
--
Summary: [C++][Gandiva] Calling castVarchar from java sometimes results in 
segmentation fault for input length 0  (was: [C++][Gandiva] Calling castVarchar 
java sometimes results in segmentation fault for input length 0)

> [C++][Gandiva] Calling castVarchar from java sometimes results in 
> segmentation fault for input length 0
> ---
>
> Key: ARROW-7589
> URL: https://issues.apache.org/jira/browse/ARROW-7589
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7589) [C++][Gandiva] Calling castVarchar java sometimes results in segmentation fault for input length 0

2020-01-15 Thread Projjal Chanda (Jira)

Projjal Chanda created ARROW-7589:
-

 Summary: [C++][Gandiva] Calling castVarchar java sometimes results 
in segmentation fault for input length 0
 Key: ARROW-7589
 URL: https://issues.apache.org/jira/browse/ARROW-7589
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Projjal Chanda
Assignee: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7588) [Plasma] Plasma On YARN

2020-01-15 Thread Ferdinand Xu (Jira)

Ferdinand Xu created ARROW-7588:
---

 Summary: [Plasma] Plasma On YARN
 Key: ARROW-7588
 URL: https://issues.apache.org/jira/browse/ARROW-7588
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Plasma
Reporter: Ferdinand Xu


YARN is widely used for resource manager. Currently Plasma server serves as an 
external service for memory sharing across different clients. It is not a 
managed service by YARN. The resource used by Plasma should also been managed. 
Additionally, Plasma service can also been managed by YARN



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-3827) [Rust] Implement UnionArray

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3827:
--
Labels: pull-request-available  (was: )

> [Rust] Implement UnionArray
> ---
>
> Key: ARROW-3827
> URL: https://issues.apache.org/jira/browse/ARROW-3827
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7533) [Java] Move ArrowBufPointer out of the java the memory package

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7533:
--
Labels: pull-request-available  (was: )

> [Java] Move ArrowBufPointer out of the java the memory package
> --
>
> Key: ARROW-7533
> URL: https://issues.apache.org/jira/browse/ARROW-7533
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> The memory package is focused on memory access and management. 
> ArrowBufPointer should be moved to algorithm package as it isn't core to the 
> Arrow memory management primitives. I would further suggest that is an 
> anti-pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1571) [C++] Implement argsort kernels (sort indices) for integers using O(n) counting sort

2020-01-15 Thread Yibo Cai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016474#comment-17016474
 ] 

Yibo Cai commented on ARROW-1571:
-

Finding cross-over point suitable for various hardware may be not easy. I will 
do some tests to see if we can reach a reasonable approach.

> [C++] Implement argsort kernels (sort indices) for integers using O(n) 
> counting sort
> 
>
> Key: ARROW-1571
> URL: https://issues.apache.org/jira/browse/ARROW-1571
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 2.0.0
>
>
> This function requires knowledge of the minimum and maximum of an array. If 
> it is small enough, then an array of size {{maximum - minimum}} can be 
> constructed and used to tabulate value frequencies and then compute the sort 
> indices (this is called "grade up" or "grade down" in APL languages). There 
> is generally a cross-over point where this function performs worse than 
> mergesort or quicksort due to data locality issues



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7587) [C++][Compute] Add Top-k kernel

2020-01-15 Thread Yibo Cai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016470#comment-17016470
 ] 

Yibo Cai commented on ARROW-7587:
-

Comments welcomed

> [C++][Compute] Add Top-k kernel
> ---
>
> Key: ARROW-7587
> URL: https://issues.apache.org/jira/browse/ARROW-7587
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Compute
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Minor
>
> Add a kernel to get top k smallest or largest elements (indices).
> std::paiital_sort should be a better solution than sorting everything then 
> pick top k.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7587) [C++][Compute] Add Top-k kernel

2020-01-15 Thread Yibo Cai (Jira)

Yibo Cai created ARROW-7587:
---

 Summary: [C++][Compute] Add Top-k kernel
 Key: ARROW-7587
 URL: https://issues.apache.org/jira/browse/ARROW-7587
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Compute
Reporter: Yibo Cai
Assignee: Yibo Cai


Add a kernel to get top k smallest or largest elements (indices).
std::paiital_sort should be a better solution than sorting everything then pick 
top k.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf

2020-01-15 Thread Ji Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu updated ARROW-7494:
--
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Java] Remove reader index and writer index from ArrowBuf
> -
>
> Key: ARROW-7494
> URL: https://issues.apache.org/jira/browse/ARROW-7494
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Reader and writer index and functionality doesn't belong on a chunk of memory 
> and is due to inheritance from ByteBuf. As part of removing ByteBuf 
> inheritance, we should also remove reader and writer indexes from ArrowBuf 
> functionality. It wastes heap memory for rare utility. In general, a slice 
> can be used instead of a reader/writer index pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7578) [R] Add support for datasets with IPC files and with multiple sources

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7578.

Resolution: Fixed

Issue resolved by pull request 6205
[https://github.com/apache/arrow/pull/6205]

> [R] Add support for datasets with IPC files and with multiple sources
> -
>
> Key: ARROW-7578
> URL: https://issues.apache.org/jira/browse/ARROW-7578
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6899) [Python] to_pandas() not implemented on list

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6899:
--

Assignee: Neal Richardson  (was: Wes McKinney)

> [Python] to_pandas() not implemented on list indices=int32>
> -
>
> Key: ARROW-6899
> URL: https://issues.apache.org/jira/browse/ARROW-6899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.15.0
>Reporter: Razvan Chitu
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: encoded.arrow
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hi,
> {{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data 
> vector is of type "dictionary encoded string". Here is the table schema as 
> printed by pyarrow:
> {code:java}
> pyarrow.Table
> encodedList: list<$data$: dictionary 
> not null> not null
>   child 0, $data$: dictionary not 
> null
> metadata
> 
> OrderedDict() {code}
> and the data (also attached in a file to this ticket)
> {code:java}
> 
> [
>   [
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 1,
> 2
>   ],
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 3
>   ]
>   ]
> ] {code}
> and the exception I got
> {code:java}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 df.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi
>  in pyarrow.lib._PandasConvertible.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table._to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in table_to_blockmanager(options, table, categories, ignore_metadata)
> 700 
> 701 _check_data_column_metadata_consistency(all_columns)
> --> 702 blocks = _table_to_blocks(options, table, categories)
> 703 columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> 704 
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in _table_to_blocks(options, block_table, categories)
> 972 
> 973 # Convert an arrow table to Block from the internal pandas API
> --> 974 result = pa.lib.table_to_blocks(options, block_table, categories)
> 975 
> 976 # Defined above
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.table_to_blocks()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> dictionary {code}
> Note that the data vector itself can be loaded successfully by to_pandas.
> It'd be great if this would be addressed in the next version of pyarrow. For 
> now, is there anything I can do on my end to bypass this unimplemented 
> conversion?
> Thanks,
> Razvan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7494:
--

Assignee: Ji Liu  (was: Neal Richardson)

> [Java] Remove reader index and writer index from ArrowBuf
> -
>
> Key: ARROW-7494
> URL: https://issues.apache.org/jira/browse/ARROW-7494
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Reader and writer index and functionality doesn't belong on a chunk of memory 
> and is due to inheritance from ByteBuf. As part of removing ByteBuf 
> inheritance, we should also remove reader and writer indexes from ArrowBuf 
> functionality. It wastes heap memory for rare utility. In general, a slice 
> can be used instead of a reader/writer index pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7096) [C++] Add options structs for concatenation-with-promotion and schema unification

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7096:
--

Assignee: Neal Richardson  (was: Zhuo Peng)

> [C++] Add options structs for concatenation-with-promotion and schema 
> unification
> -
>
> Key: ARROW-7096
> URL: https://issues.apache.org/jira/browse/ARROW-7096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Follow up to ARROW-6625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7096) [C++] Add options structs for concatenation-with-promotion and schema unification

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7096:
--

Assignee: Zhuo Peng  (was: Neal Richardson)

> [C++] Add options structs for concatenation-with-promotion and schema 
> unification
> -
>
> Key: ARROW-7096
> URL: https://issues.apache.org/jira/browse/ARROW-7096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Zhuo Peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Follow up to ARROW-6625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6899) [Python] to_pandas() not implemented on list

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6899:
--

Assignee: Wes McKinney  (was: Neal Richardson)

> [Python] to_pandas() not implemented on list indices=int32>
> -
>
> Key: ARROW-6899
> URL: https://issues.apache.org/jira/browse/ARROW-6899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.15.0
>Reporter: Razvan Chitu
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: encoded.arrow
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hi,
> {{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data 
> vector is of type "dictionary encoded string". Here is the table schema as 
> printed by pyarrow:
> {code:java}
> pyarrow.Table
> encodedList: list<$data$: dictionary 
> not null> not null
>   child 0, $data$: dictionary not 
> null
> metadata
> 
> OrderedDict() {code}
> and the data (also attached in a file to this ticket)
> {code:java}
> 
> [
>   [
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 1,
> 2
>   ],
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 3
>   ]
>   ]
> ] {code}
> and the exception I got
> {code:java}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 df.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi
>  in pyarrow.lib._PandasConvertible.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table._to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in table_to_blockmanager(options, table, categories, ignore_metadata)
> 700 
> 701 _check_data_column_metadata_consistency(all_columns)
> --> 702 blocks = _table_to_blocks(options, table, categories)
> 703 columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> 704 
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in _table_to_blocks(options, block_table, categories)
> 972 
> 973 # Convert an arrow table to Block from the internal pandas API
> --> 974 result = pa.lib.table_to_blocks(options, block_table, categories)
> 975 
> 976 # Defined above
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.table_to_blocks()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> dictionary {code}
> Note that the data vector itself can be loaded successfully by to_pandas.
> It'd be great if this would be addressed in the next version of pyarrow. For 
> now, is there anything I can do on my end to bypass this unimplemented 
> conversion?
> Thanks,
> Razvan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7567) [Java] Bump Checkstyle from 6.19 to 8.18

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7567:
--

Assignee: Neal Richardson  (was: Fokko Driesprong)

> [Java] Bump Checkstyle from 6.19 to 8.18
> 
>
> Key: ARROW-7567
> URL: https://issues.apache.org/jira/browse/ARROW-7567
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7567) [Java] Bump Checkstyle from 6.19 to 8.18

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7567:
--

Assignee: Fokko Driesprong  (was: Neal Richardson)

> [Java] Bump Checkstyle from 6.19 to 8.18
> 
>
> Key: ARROW-7567
> URL: https://issues.apache.org/jira/browse/ARROW-7567
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7518:
--

Assignee: Neal Richardson  (was: Krisztian Szucs)

> [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
> ---
>
> Key: ARROW-7518
> URL: https://issues.apache.org/jira/browse/ARROW-7518
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Neal Richardson
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This new module is not enabled in the package builds



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7494:
--

Assignee: Neal Richardson  (was: Ji Liu)

> [Java] Remove reader index and writer index from ArrowBuf
> -
>
> Key: ARROW-7494
> URL: https://issues.apache.org/jira/browse/ARROW-7494
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Reader and writer index and functionality doesn't belong on a chunk of memory 
> and is due to inheritance from ByteBuf. As part of removing ByteBuf 
> inheritance, we should also remove reader and writer indexes from ArrowBuf 
> functionality. It wastes heap memory for rare utility. In general, a slice 
> can be used instead of a reader/writer index pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7518:
--

Assignee: Krisztian Szucs  (was: Neal Richardson)

> [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
> ---
>
> Key: ARROW-7518
> URL: https://issues.apache.org/jira/browse/ARROW-7518
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This new module is not enabled in the package builds



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7568) [Java] Bump Apache Avro from 1.9.0 to 1.9.1

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7568:
--

Assignee: Fokko Driesprong  (was: Neal Richardson)

> [Java] Bump Apache Avro from 1.9.0 to 1.9.1
> ---
>
> Key: ARROW-7568
> URL: https://issues.apache.org/jira/browse/ARROW-7568
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Apache Avro 1.9.1 contains some bugfixes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7570) [Java] Fix high severity issues reported by LGTM

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7570:
--

Assignee: Neal Richardson  (was: Fokko Driesprong)

> [Java] Fix high severity issues reported by LGTM
> 
>
> Key: ARROW-7570
> URL: https://issues.apache.org/jira/browse/ARROW-7570
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Fixes high severity issues reported by LGTM:
> [https://lgtm.com/projects/g/apache/arrow/?mode=list&lang=java&severity=error]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7569) [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7569:
--

Assignee: Joris Van den Bossche  (was: Neal Richardson)

> [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas 
> conversions
> ---
>
> Key: ARROW-7569
> URL: https://issues.apache.org/jira/browse/ARROW-7569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-2428 was about adding such a mapping, and described three use cases 
> (see this 
> [comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231]
>  for details):
> * Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if 
> the pandas_metadata specify pandas extension dtypes, and if so, use this as 
> the target dtype for that column)
> * Conversion for pyarrow extension types that can define their equivalent 
> pandas extension dtype
> * A way to override default conversion (eg for the built-in types, or in 
> absence of pandas_metadata in the schema). This would require the user to be 
> able to specify some mapping of pyarrow type or column name to the pandas 
> extension dtype to use.
> The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512) 
> only covered the first two cases, and not the third case.
> I think it is still interesting to also cover the third case in some way.  
> An example use case are the new nullable dtypes that are introduced in pandas 
> (eg the nullable integer dtype).  Assume I want to read a parquet file into a 
> pandas DataFrame using this nullable integer dtype. The pyarrow Table has no 
> pandas_metadata indicating to use this dtype (unless it was created from a 
> pandas DataFrame that was already using this dtype, but that will often not 
> be the case), and the pyarrow.int64() type is also not an extension type that 
> can define its equivalent pandas extension dtype. 
> Currently, the only solution is first read it into pandas DataFrame (which 
> will use floats for the integers if there are nulls), and then afterwards to 
> convert those floats back to a nullable integer dtype. 
> A possible API for this could look like:
> {code}
> table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})
> {code}
> to indicate that you want to convert all columns of the pyarrow table with 
> int64 type to a pandas column using the nullable Int64 dtype.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7568) [Java] Bump Apache Avro from 1.9.0 to 1.9.1

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7568:
--

Assignee: Neal Richardson  (was: Fokko Driesprong)

> [Java] Bump Apache Avro from 1.9.0 to 1.9.1
> ---
>
> Key: ARROW-7568
> URL: https://issues.apache.org/jira/browse/ARROW-7568
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Apache Avro 1.9.1 contains some bugfixes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7569) [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7569:
--

Assignee: Neal Richardson

> [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas 
> conversions
> ---
>
> Key: ARROW-7569
> URL: https://issues.apache.org/jira/browse/ARROW-7569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-2428 was about adding such a mapping, and described three use cases 
> (see this 
> [comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231]
>  for details):
> * Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if 
> the pandas_metadata specify pandas extension dtypes, and if so, use this as 
> the target dtype for that column)
> * Conversion for pyarrow extension types that can define their equivalent 
> pandas extension dtype
> * A way to override default conversion (eg for the built-in types, or in 
> absence of pandas_metadata in the schema). This would require the user to be 
> able to specify some mapping of pyarrow type or column name to the pandas 
> extension dtype to use.
> The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512) 
> only covered the first two cases, and not the third case.
> I think it is still interesting to also cover the third case in some way.  
> An example use case are the new nullable dtypes that are introduced in pandas 
> (eg the nullable integer dtype).  Assume I want to read a parquet file into a 
> pandas DataFrame using this nullable integer dtype. The pyarrow Table has no 
> pandas_metadata indicating to use this dtype (unless it was created from a 
> pandas DataFrame that was already using this dtype, but that will often not 
> be the case), and the pyarrow.int64() type is also not an extension type that 
> can define its equivalent pandas extension dtype. 
> Currently, the only solution is first read it into pandas DataFrame (which 
> will use floats for the integers if there are nulls), and then afterwards to 
> convert those floats back to a nullable integer dtype. 
> A possible API for this could look like:
> {code}
> table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})
> {code}
> to indicate that you want to convert all columns of the pyarrow table with 
> int64 type to a pandas column using the nullable Int64 dtype.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7570) [Java] Fix high severity issues reported by LGTM

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7570:
--

Assignee: Fokko Driesprong  (was: Neal Richardson)

> [Java] Fix high severity issues reported by LGTM
> 
>
> Key: ARROW-7570
> URL: https://issues.apache.org/jira/browse/ARROW-7570
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Fixes high severity issues reported by LGTM:
> [https://lgtm.com/projects/g/apache/arrow/?mode=list&lang=java&severity=error]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7572) [Java] Enfore Maven 3.3+ as mentioned in README

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7572:
--

Assignee: Neal Richardson  (was: Fokko Driesprong)

> [Java] Enfore Maven 3.3+ as mentioned in README
> ---
>
> Key: ARROW-7572
> URL: https://issues.apache.org/jira/browse/ARROW-7572
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7572) [Java] Enfore Maven 3.3+ as mentioned in README

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7572:
--

Assignee: Fokko Driesprong  (was: Neal Richardson)

> [Java] Enfore Maven 3.3+ as mentioned in README
> ---
>
> Key: ARROW-7572
> URL: https://issues.apache.org/jira/browse/ARROW-7572
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7551:
--

Assignee: David Li  (was: Neal Richardson)

> [C++][Flight] Flight test on macOS periodically fails on master
> ---
>
> Key: ARROW-7551
> URL: https://issues.apache.org/jira/browse/ARROW-7551
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Neal Richardson
>Assignee: David Li
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. 
> {code}
> 64/96 Test #64: arrow-flight-test .***Failed0.46 
> sec
> Running arrow-flight-test, redirecting output into 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt
>  (attempt 1/1)
> Running main() from 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 42 tests from 11 test cases.
> [--] Global test environment set-up.
> [--] 2 tests from TestFlightDescriptor
> [ RUN  ] TestFlightDescriptor.Basics
> [   OK ] TestFlightDescriptor.Basics (0 ms)
> [ RUN  ] TestFlightDescriptor.ToFromProto
> [   OK ] TestFlightDescriptor.ToFromProto (0 ms)
> [--] 2 tests from TestFlightDescriptor (0 ms total)
> [--] 6 tests from TestFlight
> [ RUN  ] TestFlight.UnknownLocationScheme
> [   OK ] TestFlight.UnknownLocationScheme (0 ms)
> [ RUN  ] TestFlight.ConnectUri
> Server running with pid 15977
> /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: 
> line 97: 15971 Segmentation fault: 11  $TEST_EXECUTABLE "$@" 2>&1
>  15972 Done| $ROOT/build-support/asan_symbolize.py
>  15973 Done| ${CXXFILT:-c++filt}
>  15974 Done| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  15975 Done| $pipe_cmd 2>&1
>  15976 Done| tee $LOGFILE
> ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight
> {code}
> It's not failing every time but I'm seeing it fail frequently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7551:
--

Assignee: Neal Richardson

> [C++][Flight] Flight test on macOS periodically fails on master
> ---
>
> Key: ARROW-7551
> URL: https://issues.apache.org/jira/browse/ARROW-7551
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. 
> {code}
> 64/96 Test #64: arrow-flight-test .***Failed0.46 
> sec
> Running arrow-flight-test, redirecting output into 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt
>  (attempt 1/1)
> Running main() from 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 42 tests from 11 test cases.
> [--] Global test environment set-up.
> [--] 2 tests from TestFlightDescriptor
> [ RUN  ] TestFlightDescriptor.Basics
> [   OK ] TestFlightDescriptor.Basics (0 ms)
> [ RUN  ] TestFlightDescriptor.ToFromProto
> [   OK ] TestFlightDescriptor.ToFromProto (0 ms)
> [--] 2 tests from TestFlightDescriptor (0 ms total)
> [--] 6 tests from TestFlight
> [ RUN  ] TestFlight.UnknownLocationScheme
> [   OK ] TestFlight.UnknownLocationScheme (0 ms)
> [ RUN  ] TestFlight.ConnectUri
> Server running with pid 15977
> /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: 
> line 97: 15971 Segmentation fault: 11  $TEST_EXECUTABLE "$@" 2>&1
>  15972 Done| $ROOT/build-support/asan_symbolize.py
>  15973 Done| ${CXXFILT:-c++filt}
>  15974 Done| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  15975 Done| $pipe_cmd 2>&1
>  15976 Done| tee $LOGFILE
> ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight
> {code}
> It's not failing every time but I'm seeing it fail frequently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7586) [C++][Dataset] Read feather files

2020-01-15 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7586:
--

 Summary: [C++][Dataset] Read feather files
 Key: ARROW-7586
 URL: https://issues.apache.org/jira/browse/ARROW-7586
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Neal Richardson
Assignee: Ben Kietzman
 Fix For: 0.16.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7551:
--
Labels: pull-request-available  (was: )

> [C++][Flight] Flight test on macOS periodically fails on master
> ---
>
> Key: ARROW-7551
> URL: https://issues.apache.org/jira/browse/ARROW-7551
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Neal Richardson
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>
> See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. 
> {code}
> 64/96 Test #64: arrow-flight-test .***Failed0.46 
> sec
> Running arrow-flight-test, redirecting output into 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt
>  (attempt 1/1)
> Running main() from 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 42 tests from 11 test cases.
> [--] Global test environment set-up.
> [--] 2 tests from TestFlightDescriptor
> [ RUN  ] TestFlightDescriptor.Basics
> [   OK ] TestFlightDescriptor.Basics (0 ms)
> [ RUN  ] TestFlightDescriptor.ToFromProto
> [   OK ] TestFlightDescriptor.ToFromProto (0 ms)
> [--] 2 tests from TestFlightDescriptor (0 ms total)
> [--] 6 tests from TestFlight
> [ RUN  ] TestFlight.UnknownLocationScheme
> [   OK ] TestFlight.UnknownLocationScheme (0 ms)
> [ RUN  ] TestFlight.ConnectUri
> Server running with pid 15977
> /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: 
> line 97: 15971 Segmentation fault: 11  $TEST_EXECUTABLE "$@" 2>&1
>  15972 Done| $ROOT/build-support/asan_symbolize.py
>  15973 Done| ${CXXFILT:-c++filt}
>  15974 Done| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  15975 Done| $pipe_cmd 2>&1
>  15976 Done| tee $LOGFILE
> ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight
> {code}
> It's not failing every time but I'm seeing it fail frequently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-2260) [C++][Plasma] plasma_store should show usage

2020-01-15 Thread Christian Hudon (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016349#comment-17016349
 ] 

Christian Hudon commented on ARROW-2260:


Oups. Didn't check for duplicates before reporting ARROW-7585. I'm willing to 
do the work to fix this as a good first Arrow pull request. Antoine mentioned 
GFlags as the library to use (that would be more featureful than getopt()), so 
I'll use that unless someone says otherwise here... 

> [C++][Plasma] plasma_store should show usage
> 
>
> Key: ARROW-2260
> URL: https://issues.apache.org/jira/browse/ARROW-2260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently the options exposed by the {{plasma_store}} executable aren't very 
> discoverable:
> {code:bash}
> $ plasma_store -h
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store 
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store --help
> plasma_store: invalid option -- '-'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master

2020-01-15 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016346#comment-17016346
 ] 

David Li commented on ARROW-7551:
-

Yes, let's skip the test on CI for now.

> [C++][Flight] Flight test on macOS periodically fails on master
> ---
>
> Key: ARROW-7551
> URL: https://issues.apache.org/jira/browse/ARROW-7551
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Neal Richardson
>Priority: Critical
> Fix For: 0.16.0
>
>
> See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. 
> {code}
> 64/96 Test #64: arrow-flight-test .***Failed0.46 
> sec
> Running arrow-flight-test, redirecting output into 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt
>  (attempt 1/1)
> Running main() from 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 42 tests from 11 test cases.
> [--] Global test environment set-up.
> [--] 2 tests from TestFlightDescriptor
> [ RUN  ] TestFlightDescriptor.Basics
> [   OK ] TestFlightDescriptor.Basics (0 ms)
> [ RUN  ] TestFlightDescriptor.ToFromProto
> [   OK ] TestFlightDescriptor.ToFromProto (0 ms)
> [--] 2 tests from TestFlightDescriptor (0 ms total)
> [--] 6 tests from TestFlight
> [ RUN  ] TestFlight.UnknownLocationScheme
> [   OK ] TestFlight.UnknownLocationScheme (0 ms)
> [ RUN  ] TestFlight.ConnectUri
> Server running with pid 15977
> /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: 
> line 97: 15971 Segmentation fault: 11  $TEST_EXECUTABLE "$@" 2>&1
>  15972 Done| $ROOT/build-support/asan_symbolize.py
>  15973 Done| ${CXXFILT:-c++filt}
>  15974 Done| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  15975 Done| $pipe_cmd 2>&1
>  15976 Done| tee $LOGFILE
> ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight
> {code}
> It's not failing every time but I'm seeing it fail frequently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6895:
--
Labels: pull-request-available  (was: )

> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader 
> repeats returned values when calling `NextBatch()`
> ---
>
> Key: ARROW-6895
> URL: https://issues.apache.org/jira/browse/ARROW-6895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.0
> Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
>Reporter: Adam Hooper
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet
>
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr columnReader(/*...*/);
> while (nRowsRemaining > 0) {
> int n = std::min(100, nRowsRemaining);
> std::shared_ptr chunkedArray;
> auto status = columnReader->NextBatch(n, &chunkedArray);
> // ... and then use `chunkedArray`
> nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
> two arrays); the third return value looks like {{val0...val99, 
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so 
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong 
> values.
> I've attached a minimal Parquet file that presents this problem with the 
> above code; and I've written a patch that fixes this one case, to illustrate 
> where things are wrong. I don't think I understand enough edge cases to 
> decree that my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-7585) Plasma-store-server does not support --help, shows backtrace on getopt error

2020-01-15 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-7585.
-
Resolution: Duplicate

Closing as duplicate. [~chrish42] Please comment on ARROW-2260. I agree it 
deserves fixing. GFlags is what we use for some other command-line utilities.

> Plasma-store-server does not support --help, shows backtrace on getopt error
> 
>
> Key: ARROW-7585
> URL: https://issues.apache.org/jira/browse/ARROW-7585
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Christian Hudon
>Priority: Minor
>
> I'm trying out Plasma, using plasma-store-server. The first thing I usually 
> do then is to run the binary without arguments, and that usually gives me a 
> message showing usage. However, with plasma-store-server, the initial 
> experience there is a backtrace:
> {noformat}
> $ ./debug/plasma-store-server
> /Users/chrish/Code/arrow/cpp/src/plasma/store.cc:1237: please specify socket 
> for incoming connections with -s switch
> 0   plasma-store-server 0x00010b4d7c04 
> _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
> 1   plasma-store-server 0x00010b4d7b24 
> _ZN5arrow4util7CerrLogD2Ev + 100
> 2   plasma-store-server 0x00010b4d7a85 
> _ZN5arrow4util7CerrLogD1Ev + 21
> 3   plasma-store-server 0x00010b4d7aa9 
> _ZN5arrow4util7CerrLogD0Ev + 25
> 4   plasma-store-server 0x00010b4d7990 
> _ZN5arrow4util8ArrowLogD2Ev + 80
> 5   plasma-store-server 0x00010b4d79c5 
> _ZN5arrow4util8ArrowLogD1Ev + 21
> 6   plasma-store-server 0x00010b463152 main + 1122
> 7   libdyld.dylib   0x7fff7765a3d5 start + 1
> fish: './debug/plasma-store-server' terminated by signal SIGABRT (Abort)
> {noformat}
> Also, neither of the "h" or "help" command-line switches is supported, and so 
> to start plasma-store-server, you either find the doc, or iteratively add 
> arguments until you stop getting "please specify ..." backtraces.
> I know it's not a big thing, but it'd be nice if that initial experience was 
> a little bit more user-friendly. Also submitting this because it feels like a 
> good first time issue, so I would be very happy to do the work, and would 
> like to tackle it. I'd like to 1) add --help support that shows all the 
> options and gives an example with the required ones, and 2) remove the 
> unnecessary backtraces on normal errors like these in the main() function.
> Just asking beforehand here: 1) would this kind of patch be welcome, and 2) 
> is there a C++ library for command-line option parsing that I could be using. 
> I can find one on my own, but I'd rather ask here which one would be approved 
> for using in the Arrow codebase... or should I just stick to getopt() and do 
> things manually? Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7585) Plasma-store-server does not support --help, shows backtrace on getopt error

2020-01-15 Thread Christian Hudon (Jira)

Christian Hudon created ARROW-7585:
--

 Summary: Plasma-store-server does not support --help, shows 
backtrace on getopt error
 Key: ARROW-7585
 URL: https://issues.apache.org/jira/browse/ARROW-7585
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Christian Hudon


I'm trying out Plasma, using plasma-store-server. The first thing I usually do 
then is to run the binary without arguments, and that usually gives me a 
message showing usage. However, with plasma-store-server, the initial 
experience there is a backtrace:
{noformat}
$ ./debug/plasma-store-server
/Users/chrish/Code/arrow/cpp/src/plasma/store.cc:1237: please specify socket 
for incoming connections with -s switch
0   plasma-store-server 0x00010b4d7c04 
_ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
1   plasma-store-server 0x00010b4d7b24 
_ZN5arrow4util7CerrLogD2Ev + 100
2   plasma-store-server 0x00010b4d7a85 
_ZN5arrow4util7CerrLogD1Ev + 21
3   plasma-store-server 0x00010b4d7aa9 
_ZN5arrow4util7CerrLogD0Ev + 25
4   plasma-store-server 0x00010b4d7990 
_ZN5arrow4util8ArrowLogD2Ev + 80
5   plasma-store-server 0x00010b4d79c5 
_ZN5arrow4util8ArrowLogD1Ev + 21
6   plasma-store-server 0x00010b463152 main + 1122
7   libdyld.dylib   0x7fff7765a3d5 start + 1
fish: './debug/plasma-store-server' terminated by signal SIGABRT (Abort)
{noformat}
Also, neither of the "h" or "help" command-line switches is supported, and so 
to start plasma-store-server, you either find the doc, or iteratively add 
arguments until you stop getting "please specify ..." backtraces.

I know it's not a big thing, but it'd be nice if that initial experience was a 
little bit more user-friendly. Also submitting this because it feels like a 
good first time issue, so I would be very happy to do the work, and would like 
to tackle it. I'd like to 1) add --help support that shows all the options and 
gives an example with the required ones, and 2) remove the unnecessary 
backtraces on normal errors like these in the main() function.

Just asking beforehand here: 1) would this kind of patch be welcome, and 2) is 
there a C++ library for command-line option parsing that I could be using. I 
can find one on my own, but I'd rather ask here which one would be approved for 
using in the Arrow codebase... or should I just stick to getopt() and do things 
manually? Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata

2020-01-15 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016304#comment-17016304
 ] 

Neal Richardson commented on ARROW-7063:


>From my perspective, it's not a problem that the metadata isn't printed as 
>long as I can access it and print it if I choose. I.e. I can {{print(schema)}} 
>and then {{print(schema.metadata)}} if I want. 

> [C++] Schema print method prints too much metadata
> --
>
> Key: ARROW-7063
> URL: https://issues.apache.org/jira/browse/ARROW-7063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was 
> printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, 
> "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, 
> "field_name": null, "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", 
> "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", 
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name": 
> "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", 
> "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", 
> "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": 
> "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "pickup_latitude", "field_name": 
> "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", 
> "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": 
> "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "payment_type", "field_name": "payment_type", "pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null}, {"name": "fare_amount", 
> "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "extra", "field_name": "extra", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, 
> {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", 
> "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", 
> "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "tolls_amount", "field_name": 
> "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "total_amount", "field_name": "total_amount", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], 
> "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": 
> "0.25.3"}
> ARROW:schema: 
> /3gOAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwEAAgACgAAAFQKAAAEAQwIAAwABAAIAAgsCgAABB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZG

[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata

2020-01-15 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016303#comment-17016303
 ] 

Neal Richardson commented on ARROW-7063:


What I have in mind is in the output in the ticket description: everything 
above the line that says {{-- metadata --}}

> [C++] Schema print method prints too much metadata
> --
>
> Key: ARROW-7063
> URL: https://issues.apache.org/jira/browse/ARROW-7063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was 
> printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, 
> "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, 
> "field_name": null, "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", 
> "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", 
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name": 
> "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", 
> "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", 
> "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": 
> "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "pickup_latitude", "field_name": 
> "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", 
> "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": 
> "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "payment_type", "field_name": "payment_type", "pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null}, {"name": "fare_amount", 
> "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "extra", "field_name": "extra", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, 
> {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", 
> "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", 
> "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "tolls_amount", "field_name": 
> "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "total_amount", "field_name": "total_amount", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], 
> "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": 
> "0.25.3"}
> ARROW:schema: 
> /3gOAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwEAAgACgAAAFQKAAAEAQwIAAwABAAIAAgsCgAABB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9hdCIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfYXQiLCAic

[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata

2020-01-15 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016295#comment-17016295
 ] 

Joris Van den Bossche commented on ARROW-7063:
--

A reason to have at least some truncated form of it (it can be short), is that 
two tables/schemas are not equal if their metadata is not equal. So having 
nothing about it in the simple pretty print can also be quite confusing.

 

> I'm fine with writing it my way in R (i.e. schema print only prints its 
> fields, assuming I can iterate over the Fields in a Schema and print each), 
> and if y'all like how that looks, we can consider making that the C++ 
> behavior.

 

Can you post an example?

> [C++] Schema print method prints too much metadata
> --
>
> Key: ARROW-7063
> URL: https://issues.apache.org/jira/browse/ARROW-7063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was 
> printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, 
> "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, 
> "field_name": null, "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", 
> "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", 
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name": 
> "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", 
> "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", 
> "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": 
> "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "pickup_latitude", "field_name": 
> "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", 
> "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": 
> "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "payment_type", "field_name": "payment_type", "pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null}, {"name": "fare_amount", 
> "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "extra", "field_name": "extra", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, 
> {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", 
> "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", 
> "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "tolls_amount", "field_name": 
> "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "total_amount", "field_name": "total_amount", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], 
> "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": 
> "0.25.3"}
> ARROW:schema: 
> /3gOAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwEAAgACgAAAFQKAAAEAQwIAAwABAAIAAgsCgAABB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJu

[jira] [Commented] (ARROW-7044) [Release] Create a post release script for the home-brew formulas

2020-01-15 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016290#comment-17016290
 ] 

Neal Richardson commented on ARROW-7044:


I'm not sure those instructions are 100% accurate. We're maintaining the 
homebrew formula (at least cpp) and running nightly crossbow on it, so when we 
update the upstream homebrew formula, we should be sure to include any changes 
from the one we maintain in {{dev/tasks/}}. What if (as is the case this time) 
there is a change in the CMake flags or dependencies? Where do you make those 
changes?



> [Release] Create a post release script for the home-brew formulas
> -
>
> Key: ARROW-7044
> URL: https://issues.apache.org/jira/browse/ARROW-7044
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.16.0
>
>
> The required steps are documented in the release management guide 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master

2020-01-15 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016289#comment-17016289
 ] 

Neal Richardson commented on ARROW-7551:


If we can't reproduce this, should we skip this test for now on macOS/on CI? 
Right now it's just making all the builds fail, and that increases the risk 
that we'll merge a patch that really breaks things.

> [C++][Flight] Flight test on macOS periodically fails on master
> ---
>
> Key: ARROW-7551
> URL: https://issues.apache.org/jira/browse/ARROW-7551
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Neal Richardson
>Priority: Critical
> Fix For: 0.16.0
>
>
> See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. 
> {code}
> 64/96 Test #64: arrow-flight-test .***Failed0.46 
> sec
> Running arrow-flight-test, redirecting output into 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt
>  (attempt 1/1)
> Running main() from 
> /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 42 tests from 11 test cases.
> [--] Global test environment set-up.
> [--] 2 tests from TestFlightDescriptor
> [ RUN  ] TestFlightDescriptor.Basics
> [   OK ] TestFlightDescriptor.Basics (0 ms)
> [ RUN  ] TestFlightDescriptor.ToFromProto
> [   OK ] TestFlightDescriptor.ToFromProto (0 ms)
> [--] 2 tests from TestFlightDescriptor (0 ms total)
> [--] 6 tests from TestFlight
> [ RUN  ] TestFlight.UnknownLocationScheme
> [   OK ] TestFlight.UnknownLocationScheme (0 ms)
> [ RUN  ] TestFlight.ConnectUri
> Server running with pid 15977
> /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: 
> line 97: 15971 Segmentation fault: 11  $TEST_EXECUTABLE "$@" 2>&1
>  15972 Done| $ROOT/build-support/asan_symbolize.py
>  15973 Done| ${CXXFILT:-c++filt}
>  15974 Done| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  15975 Done| $pipe_cmd 2>&1
>  15976 Done| tee $LOGFILE
> ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight
> {code}
> It's not failing every time but I'm seeing it fail frequently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7432) [Python] Add higher-level datasets functions

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7432:
--

Assignee: Joris Van den Bossche  (was: Neal Richardson)

> [Python] Add higher-level datasets functions
> 
>
> Key: ARROW-7432
> URL: https://issues.apache.org/jira/browse/ARROW-7432
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> From [~kszucs]: We need to define a more pythonic API for the dataset 
> bindings, because the current one is pretty low-level.
> One option is to provide a "open_dataset" function similar as what is 
> available in R.
> A short-cut to go from a Dataset to a Table might also be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7432) [Python] Add higher-level datasets functions

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7432:
--

Assignee: Neal Richardson

> [Python] Add higher-level datasets functions
> 
>
> Key: ARROW-7432
> URL: https://issues.apache.org/jira/browse/ARROW-7432
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Neal Richardson
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> From [~kszucs]: We need to define a more pythonic API for the dataset 
> bindings, because the current one is pretty low-level.
> One option is to provide a "open_dataset" function similar as what is 
> available in R.
> A short-cut to go from a Dataset to a Table might also be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5744) [C++] Do not error in Table::CombineChunks for BinaryArray types that overflow 2GB limit

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5744:
---
Priority: Major  (was: Critical)

> [C++] Do not error in Table::CombineChunks for BinaryArray types that 
> overflow 2GB limit
> 
>
> Key: ARROW-5744
> URL: https://issues.apache.org/jira/browse/ARROW-5744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 0.16.0
>
>
> Discovered during ARROW-5635 code review



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5744) [C++] Do not error in Table::CombineChunks for BinaryArray types that overflow 2GB limit

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5744:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Do not error in Table::CombineChunks for BinaryArray types that 
> overflow 2GB limit
> 
>
> Key: ARROW-5744
> URL: https://issues.apache.org/jira/browse/ARROW-5744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> Discovered during ARROW-5635 code review



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7269) [C++] Fix arrow::parquet compiler warning

2020-01-15 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7269.
-
Fix Version/s: 0.16.0
 Assignee: Wes McKinney
   Resolution: Fixed

Resolved by PARQUET-1769 
https://github.com/apache/arrow/commit/1a3b17b8382952465d3902c3edd6252a71ef6c5b

> [C++] Fix arrow::parquet compiler warning
> -
>
> Key: ARROW-7269
> URL: https://issues.apache.org/jira/browse/ARROW-7269
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jiajia Li
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Encountered the compiler warning when building:
> [WARNING:/arrow/cpp/src/parquet/parquet.thrift:297] The "byte" type is a 
> compatibility alias for "i8". Use "i8" to emphasize the signedness of this 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7063) [C++] Schema print method prints too much metadata

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7063:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Schema print method prints too much metadata
> --
>
> Key: ARROW-7063
> URL: https://issues.apache.org/jira/browse/ARROW-7063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was 
> printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, 
> "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, 
> "field_name": null, "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", 
> "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", 
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name": 
> "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", 
> "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", 
> "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": 
> "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "pickup_latitude", "field_name": 
> "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", 
> "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": 
> "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "payment_type", "field_name": "payment_type", "pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null}, {"name": "fare_amount", 
> "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "extra", "field_name": "extra", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, 
> {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", 
> "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", 
> "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "tolls_amount", "field_name": 
> "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "total_amount", "field_name": "total_amount", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], 
> "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": 
> "0.25.3"}
> ARROW:schema: 
> /3gOAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwEAAgACgAAAFQKAAAEAQwIAAwABAAIAAgsCgAABB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9hdCIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAic

[jira] [Resolved] (ARROW-6863) [Java] Provide parallel searcher

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6863.

Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 5631
[https://github.com/apache/arrow/pull/5631]

> [Java] Provide parallel searcher
> 
>
> Key: ARROW-6863
> URL: https://issues.apache.org/jira/browse/ARROW-6863
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> For scenarios where the vector is large and the a low response time is 
> required, we need to search the vector in parallel to improve the 
> responsiveness.
> This issue tries to provide a parallel searcher for the equality semantics 
> (the support for ordering semantics is not ready yet, as we need a way to 
> distribute the comparator).
> The implementation is based on multi-threading.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5914) [CI] Build bundled dependencies in docker build step

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5914:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [CI] Build bundled dependencies in docker build step
> 
>
> Key: ARROW-5914
> URL: https://issues.apache.org/jira/browse/ARROW-5914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> In the recently introduced ARROW-5803, some heavy dependencies (thrift, 
> protobuf, flatbufers, grpc) are build at each invocation of docker-compose 
> build (thus each travis test).
> We should aim to build the third party dependencies in docker build phase 
> instead, to exploit caching and docker-compose pull so that the CI step 
> doesn't need to build said dependencies each time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4226) [Format][C++] Add CSF sparse tensor support

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-4226:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Format][C++] Add CSF sparse tensor support
> ---
>
> Key: ARROW-4226
> URL: https://issues.apache.org/jira/browse/ARROW-4226
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Format
>Reporter: Kenta Murata
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: sparse
> Fix For: 1.0.0
>
>
> [https://github.com/apache/arrow/pull/2546#pullrequestreview-156064172]
> {quote}Perhaps in the future, if zero-copy and future-proof-ness is really 
> what we want, we might want to add the CSF (compressed sparse fiber) format, 
> a generalisation of CSR/CSC. I'm currently working on adding it to 
> PyData/Sparse, and I plan to make it the preferred format (COO will still be 
> around though).
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6528) [C++] Spurious Flight test failures (port allocation failure)

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6528:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Spurious Flight test failures (port allocation failure)
> -
>
> Key: ARROW-6528
> URL: https://issues.apache.org/jira/browse/ARROW-6528
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> Seems like our port allocation scheme inside unit tests is still not very 
> reliable :-/
> https://ci.ursalabs.org/#/builders/71/builds/4147/steps/8/logs/stdio
> {code}
> [--] 3 tests from TestMetadata
> [ RUN  ] TestMetadata.DoGet
> E0905 12:45:40.322644527   10203 server_chttp2.cc:40]
> {"created":"@1567687540.322612245","description":"No address added out of 
> total 1 
> resolved","file":"../src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1567687540.322609844","description":"Unable
>  to configure 
> socket","fd":7,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1567687540.322602634","description":"Address
>  already in 
> use","errno":98,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address
>  already in use","syscall":"bind"}]}]}
> ../src/arrow/flight/flight_test.cc:429: Failure
> Failed
> 'server->Init(options)' failed with Unknown error: Server did not start 
> properly
> /buildbot/AMD64_Conda_Python_3_7/cpp/build-support/run-test.sh: line 97: 
> 10203 Segmentation fault  (core dumped) $TEST_EXECUTABLE "$@" 2>&1
>  10204 Done| $ROOT/build-support/asan_symbolize.py
>  10205 Done| ${CXXFILT:-c++filt}
>  10206 Done| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  10207 Done| $pipe_cmd 2>&1
>  10208 Done| tee $LOGFILE
> /buildbot/AMD64_Conda_Python_3_7/cpp/build/src/arrow/flight
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6501) [C++] Remove non_zero_length field from SparseIndex

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6501:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Remove non_zero_length field from SparseIndex
> ---
>
> Key: ARROW-6501
> URL: https://issues.apache.org/jira/browse/ARROW-6501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
> Fix For: 1.0.0
>
>
> We can remove non_zero_length field from SparseIndex because it can be 
> supplied from the shape of the indices tensor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6393) [C++]Add EqualOptions support in SparseTensor::Equals

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6393:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++]Add EqualOptions support in SparseTensor::Equals
> -
>
> Key: ARROW-6393
> URL: https://issues.apache.org/jira/browse/ARROW-6393
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
> Fix For: 1.0.0
>
>
> SparseTensor::Equals should take EqualOptions argument as Tensor::Equals does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6312) [C++] Declare required Libs.private in arrow.pc package config

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6312:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Declare required Libs.private in arrow.pc package config
> --
>
> Key: ARROW-6312
> URL: https://issues.apache.org/jira/browse/ARROW-6312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Michael Maguire
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current arrow.pc package config file produced is deficient and doesn't 
> properly declare static libraries pre-requisities that must be linked in in 
> order to *statically* link in libarrow.a
> Currently it just has:
> ```
>  Libs: -L${libdir} -larrow
> ```
> But in cases, e.g. where you enabled snappy, brotli or zlib support in arrow, 
> our toolchains need to see an arrow.pc file something more like:
> ```
>  Libs: -L${libdir} -larrow
>  Libs.private: -lsnappy -lboost_system -lz -llz4 -lbrotlidec -lbrotlienc 
> -lbrotlicommon -lzstd
> ```
> If not, we get linkage errors.  I'm told the convention is that if the .a has 
> an UNDEF, the Requires.private plus the Libs.private should resolve all the 
> undefs. See the Libs.private info in [https://linux.die.net/man/1/pkg-config]
>  
> Note, however, as Sutou Kouhei pointed out in 
> [https://github.com/apache/arrow/pull/5123#issuecomment-522771452,] the 
> additional Libs.private need to be dynamically generated based on whether 
> functionality like snappy, brotli or zlib is enabled..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7121) [C++][CI][Windows] Enable more features on the windows GHA build

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7121:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++][CI][Windows] Enable more features on the windows GHA build
> 
>
> Key: ARROW-7121
> URL: https://issues.apache.org/jira/browse/ARROW-7121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 1.0.0
>
>
> Like `ARROW_GANDIVA: ON`, `ARROW_FLIGHT: ON`, `ARROW_PARQUET: ON`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7049) [C++] warnings building on mingw-w64

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7049:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] warnings building on mingw-w64
> 
>
> Key: ARROW-7049
> URL: https://issues.apache.org/jira/browse/ARROW-7049
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Jeroen
>Priority: Minor
> Fix For: 1.0.0
>
>
> Two warnings when building libarrow 0.15.1 on mingw-w64:
> {code}
> [  2%] Running thrift compiler on parquet.thrift
> [WARNING:C:/msys64/home/mingw-packages/mingw-w64-arrow/src/apache-arrow-0.15.1/cpp/src/parquet/parquet.thrift:297]
>  The "byte" type is a compatibility alias for "i8". Use "i8" to emphasize the 
> signedness of this type.
> {code} 
> And later:
> {code}
>  81%] Building CXX object 
> src/parquet/CMakeFiles/parquet_static.dir/column_reader.cc.obj
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/apache-arrow-0.15.1/cpp/src/parquet/arrow/writer.cc:
>  In member function 'virtual arrow::Status 
> parquet::arrow::FileWriterImpl::WriteColumnChunk(const 
> std::shared_ptr&, int64_t, int64_t)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/apache-arrow-0.15.1/cpp/src/parquet/arrow/writer.cc:79:41:
>  warning: 'schema_field' may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
>  schema_manifest_(schema_manifest) {}
>  ^
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/apache-arrow-0.15.1/cpp/src/parquet/arrow/writer.cc:466:24:
>  note: 'schema_field' was declared here
>  const SchemaField* schema_field;
> {code}
> Maybe CI with `CXXFLAGS += -Werror` ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7080) [Python][Parquet] Expose parquet field_id in Schema objects

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7080:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Python][Parquet] Expose parquet field_id in Schema objects
> ---
>
> Key: ARROW-7080
> URL: https://issues.apache.org/jira/browse/ARROW-7080
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Ted Gooch
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> I'm in the process of adding parquet read support to 
> Iceberg([https://iceberg.apache.org/]), and we use the parquet field_ids as a 
> consistent id when reading a parquet file to create a map between the current 
> schema and the schema of the file being read.  Unless I've missed something, 
> it appears that field_id is not exposed in the python APIs in 
> pyarrow._parquet.ParquetSchema nor is it available in pyarrow.lib.Schema.
> Would it be possible to add this to either of those two objects?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7365:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Python] Support FixedSizeList type in conversion to numpy/pandas
> -
>
> Key: ARROW-7365
> URL: https://issues.apache.org/jira/browse/ARROW-7365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow-up on ARROW-7261, still need to add support for FixedSizeListType in 
> the arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7332) [C++][Parquet] Explicitly catch status exceptions in PARQUET_CATCH_NOT_OK

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7332:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++][Parquet] Explicitly catch status exceptions in PARQUET_CATCH_NOT_OK
> -
>
> Key: ARROW-7332
> URL: https://issues.apache.org/jira/browse/ARROW-7332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
> Fix For: 1.0.0
>
>
> PARQUET_THROW_NOT_OK throws a ParquetStatusException, which contains a full 
> Status rather than just an error string. These could be caught explicitly in 
> PARQUET_CATCH_NOT_OK and the original status returned rather than creating a 
> new status:
> {code}
>   } catch (const ::parquet::ParquetStatusException& e) { \
> return e.status(); \
>   } catch (const ::parquet::ParquetException& e) { \
> return Status::IOError(e.what()) \
> {code}
> This will retain the original StatusCode rather than overwriting it with 
> IOError.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7499) [C++] CMake should collect libs when making static build

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7499:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] CMake should collect libs when making static build
> 
>
> Key: ARROW-7499
> URL: https://issues.apache.org/jira/browse/ARROW-7499
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 1.0.0
>
>
> From https://github.com/apache/arrow/pull/6068/files#r360672071: 
> {code}
> # Copy the bundled static libs from the build to the install dir
> find . -regex .*/.*/lib/.*\\.a\$ | xargs -I{} cp -u {} ${DEST_DIR}/lib
> {code}
> {quote}I think that we should do this by CMake when -DARROW_BUILD_STATIC=ON 
> is specified.
> ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}/arrow/vendored/libXXX.a may 
> be better for the installed path to avoid conflict.{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7537) [CI][R] Nightly macOS autobrew job should be more verbose if it fails

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7537.

Resolution: Fixed

Issue resolved by pull request 6155
[https://github.com/apache/arrow/pull/6155]

> [CI][R] Nightly macOS autobrew job should be more verbose if it fails
> -
>
> Key: ARROW-7537
> URL: https://issues.apache.org/jira/browse/ARROW-7537
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Things like 
> https://travis-ci.org/ursa-labs/crossbow/builds/634643469#L673-L676 are hard 
> to debug because the installation log is not printed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7578) [R] Add support for datasets with IPC files and with multiple sources

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7578:
--
Labels: pull-request-available  (was: )

> [R] Add support for datasets with IPC files and with multiple sources
> -
>
> Key: ARROW-7578
> URL: https://issues.apache.org/jira/browse/ARROW-7578
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

2020-01-15 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6895:
--
Priority: Critical  (was: Major)

> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader 
> repeats returned values when calling `NextBatch()`
> ---
>
> Key: ARROW-6895
> URL: https://issues.apache.org/jira/browse/ARROW-6895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.0
> Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
>Reporter: Adam Hooper
>Assignee: Francois Saint-Jacques
>Priority: Critical
> Fix For: 0.16.0
>
> Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet
>
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr columnReader(/*...*/);
> while (nRowsRemaining > 0) {
> int n = std::min(100, nRowsRemaining);
> std::shared_ptr chunkedArray;
> auto status = columnReader->NextBatch(n, &chunkedArray);
> // ... and then use `chunkedArray`
> nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
> two arrays); the third return value looks like {{val0...val99, 
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so 
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong 
> values.
> I've attached a minimal Parquet file that presents this problem with the 
> above code; and I've written a patch that fixes this one case, to illustrate 
> where things are wrong. I don't think I understand enough edge cases to 
> decree that my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7093) [R] Support creating ScalarExpressions for more data types

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7093:
--

Assignee: Romain Francois  (was: Neal Richardson)

> [R] Support creating ScalarExpressions for more data types
> --
>
> Key: ARROW-7093
> URL: https://issues.apache.org/jira/browse/ARROW-7093
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/blob/master/r/src/expression.cpp#L93-L107. 
> ARROW-6340 was limited to integer/double/logical. This will let us make 
> dataset filter expressions with all those other types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7093) [R] Support creating ScalarExpressions for more data types

2020-01-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7093:
--

Assignee: Neal Richardson  (was: Romain Francois)

> [R] Support creating ScalarExpressions for more data types
> --
>
> Key: ARROW-7093
> URL: https://issues.apache.org/jira/browse/ARROW-7093
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/blob/master/r/src/expression.cpp#L93-L107. 
> ARROW-6340 was limited to integer/double/logical. This will let us make 
> dataset filter expressions with all those other types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7538) Clarify actual and desired size in AllocationManager

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7538:
--
Labels: pull-request-available  (was: )

> Clarify actual and desired size in AllocationManager
> 
>
> Key: ARROW-7538
> URL: https://issues.apache.org/jira/browse/ARROW-7538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: David Li
>Priority: Major
>  Labels: pull-request-available
>
> As a follow up to the review of ARROW-7329, we should clarify the different 
> sizes (desired vs actual size) in AllocationManager: 
> https://github.com/apache/arrow/pull/5973#discussion_r354729754



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages

2020-01-15 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016106#comment-17016106
 ] 

Neal Richardson commented on ARROW-7518:


Both GHA cron and crossbow nightly sounds redundant. Given our current state of 
the art, I think we should prefer crossbow (better reporting, ability to 
trigger on demand)

> [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
> ---
>
> Key: ARROW-7518
> URL: https://issues.apache.org/jira/browse/ARROW-7518
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This new module is not enabled in the package builds



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016063#comment-17016063
 ] 

Antoine Pitrou commented on ARROW-7584:
---

I think the plan would be to have a high-level function that takes a URI or a 
list of URIs and then constructs a dataset reader from them. Those URIs could 
point to simple files or partitioned data sources.

That high-level function doesn't exist yet, though.

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seem clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016057#comment-17016057
 ] 

Fabian Höring commented on ARROW-7584:
--

Imo each language has its specificities. While it is a good idea to get a 
consistent API cross language trying to do (exactly) the same thing in each 
language will also confuse people.
I don't know arrow very well but protobuf for example doesn't use the same 
wrappers in c# and Java.

Can you explain on how you intend using this for reading parquet from Python 
for example ?

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seem clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016054#comment-17016054
 ] 

Fabian Höring edited comment on ARROW-7584 at 1/15/20 2:53 PM:
---

I agree for not needing to support transactions. Something lightweight would be 
better.


was (Author: fhoering):
I agree for better not to support transactions. Something lightweight would be 
better.

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seem clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016054#comment-17016054
 ] 

Fabian Höring edited comment on ARROW-7584 at 1/15/20 2:53 PM:
---

I agree for better not to support transactions. Something lightweight would be 
better.


was (Author: fhoering):
I agree for transactions. Something lightweight would be better.

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seem clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016054#comment-17016054
 ] 

Fabian Höring commented on ARROW-7584:
--

I agree for transactions. Something lightweight would be better.

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seem clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7583) [C++][Flight] Auth handler tests fragile on Windows

2020-01-15 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016039#comment-17016039
 ] 

Antoine Pitrou commented on ARROW-7583:
---

Or perhaps we can relax the test, along with an explanatory comment?

> [C++][Flight] Auth handler tests fragile on Windows
> ---
>
> Key: ARROW-7583
> URL: https://issues.apache.org/jira/browse/ARROW-7583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Antoine Pitrou
>Priority: Minor
>
> This occurs often on AppVeyor:
> {code}
> [--] 3 tests from TestAuthHandler
> [ RUN  ] TestAuthHandler.PassAuthenticatedCalls
> [   OK ] TestAuthHandler.PassAuthenticatedCalls (4 ms)
> [ RUN  ] TestAuthHandler.FailUnauthenticatedCalls
> ..\src\arrow\flight\flight_test.cc(1126): error: Value of: status.message()
> Expected: has substring "Invalid token"
>   Actual: "Could not write record batch to stream: "
> [  FAILED  ] TestAuthHandler.FailUnauthenticatedCalls (3 ms)
> [ RUN  ] TestAuthHandler.CheckPeerIdentity
> [   OK ] TestAuthHandler.CheckPeerIdentity (2 ms)
> [--] 3 tests from TestAuthHandler (10 ms total)
> [--] 3 tests from TestBasicAuthHandler
> [ RUN  ] TestBasicAuthHandler.PassAuthenticatedCalls
> [   OK ] TestBasicAuthHandler.PassAuthenticatedCalls (4 ms)
> [ RUN  ] TestBasicAuthHandler.FailUnauthenticatedCalls
> ..\src\arrow\flight\flight_test.cc(1224): error: Value of: status.message()
> Expected: has substring "Invalid token"
>   Actual: "Could not write record batch to stream: "
> [  FAILED  ] TestBasicAuthHandler.FailUnauthenticatedCalls (4 ms)
> [ RUN  ] TestBasicAuthHandler.CheckPeerIdentity
> [   OK ] TestBasicAuthHandler.CheckPeerIdentity (3 ms)
> [--] 3 tests from TestBasicAuthHandler (11 ms total)
> {code}
> See e.g. 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/30110376/job/vbtd22813g5hlgfl#L2252



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seem clear and 
are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes, already 
implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), 
it would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- W

[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016020#comment-17016020
 ] 

Antoine Pitrou commented on ARROW-7584:
---

Keep in mind that the filesystem API is meant mainly to be used in conjunction 
with other Arrow facilities, primarily the new datasets facility (which may not 
be documented yet?). Making it nicer to use is a respectable goal as well, but 
the primary goal might be kept in mind.

As for fsspec, the abstract filesystem API is so big that it doesn't seem very 
convenient to implement (e.g. do we have to support transactions?).

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seems clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7518:
--
Labels: pull-request-available  (was: )

> [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
> ---
>
> Key: ARROW-7518
> URL: https://issues.apache.org/jira/browse/ARROW-7518
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>
> This new module is not enabled in the package builds



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016010#comment-17016010
 ] 

Fabian Höring commented on ARROW-7584:
--

As indicated I can work on that if there is consensus of what needs to be done 
and PRs will be accepted. 

If I were to choose I would 
- rename all new methods to stick to the olds ones
- make open() work
- add some useful methods from dask/hdfs3 (that we use on our side).

I also like the idea of fsspec but not if it will be yet another wrapper again. 
Only if pyarrow pulls the spec for real and implements it (would introduce a 
new dependency though)

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seems clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016007#comment-17016007
 ] 

Antoine Pitrou commented on ARROW-7584:
---

Also cc [~jorisvandenbossche] for advice.

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seems clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Labels: FileSystem  (was: )

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seems clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages

2020-01-15 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016004#comment-17016004
 ] 

Krisztian Szucs commented on ARROW-7518:


We have a GHA cron job and a crossbow nightly to tests the hdfs integration. 

> [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
> ---
>
> Key: ARROW-7518
> URL: https://issues.apache.org/jira/browse/ARROW-7518
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Blocker
> Fix For: 0.16.0
>
>
> This new module is not enabled in the package builds



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes, already 
implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), 
it would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
-

[jira] [Assigned] (ARROW-7119) [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces

2020-01-15 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-7119:
--

Assignee: Krisztian Szucs

> [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces
> ---
>
> Key: ARROW-7119
> URL: https://issues.apache.org/jira/browse/ARROW-7119
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The script was previously used on Travis, we should enable it in docker and 
> on GitHub actions to speed up the debugging process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7576) [C++][Dev] Improve fuzzing setup

2020-01-15 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-7576.
---
Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 6195
[https://github.com/apache/arrow/pull/6195]

> [C++][Dev] Improve fuzzing setup
> 
>
> Key: ARROW-7576
> URL: https://issues.apache.org/jira/browse/ARROW-7576
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015999#comment-17015999
 ] 

Fabian Höring commented on ARROW-7584:
--

[~apitrou] [~kszucs]

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seems clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes), it 
> would permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7576) [C++][Dev] Improve fuzzing setup

2020-01-15 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-7576:
-

Assignee: Antoine Pitrou

> [C++][Dev] Improve fuzzing setup
> 
>
> Key: ARROW-7576
> URL: https://issues.apache.org/jira/browse/ARROW-7576
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7119) [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces

2020-01-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7119:
--
Labels: pull-request-available  (was: )

> [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces
> ---
>
> Key: ARROW-7119
> URL: https://issues.apache.org/jira/browse/ARROW-7119
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The script was previously used on Travis, we should enable it in docker and 
> on GitHub actions to speed up the debugging process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSyst

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem the is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSyste

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a fsspex repo. Some comments on the proposed solutions:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem the is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a fsspex repo. Some comments on the proposed solutions:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem the is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601]
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do th

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a fsspex repo. Some comments on the proposed solutions:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem the is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601]
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do th

[jira] [Commented] (ARROW-7583) [C++][Flight] Auth handler tests fragile on Windows

2020-01-15 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015984#comment-17015984
 ] 

David Li commented on ARROW-7583:
-

Interesting. It seems an error from a path where gRPC gives a boolean success 
result (rather than a full error message) is masking the actual, intended error 
later on.

My guess is that this happens:
 # We start the DoPut, but this does not actually write any data.
 # We close the writer right away. {{RecordBatchPayloadWriter::Close}} calls 
{{CheckStarted}} first, which tries to write the schema.
 # If the test server responds quickly enough, the write fails. gRPC reports 
errors in this path as a boolean, so we report a generic error instead of the 
actual error message.

If the test server doesn't respond quickly enough, then the write appears to 
succeed (IIRC gRPC does some sort of buffering?) and then we actually close the 
stream, at which point we get the intended error message.

The solution might be to bypass RecordBatchPayloadWriter's close and go 
directly to our implementation.

> [C++][Flight] Auth handler tests fragile on Windows
> ---
>
> Key: ARROW-7583
> URL: https://issues.apache.org/jira/browse/ARROW-7583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Antoine Pitrou
>Priority: Minor
>
> This occurs often on AppVeyor:
> {code}
> [--] 3 tests from TestAuthHandler
> [ RUN  ] TestAuthHandler.PassAuthenticatedCalls
> [   OK ] TestAuthHandler.PassAuthenticatedCalls (4 ms)
> [ RUN  ] TestAuthHandler.FailUnauthenticatedCalls
> ..\src\arrow\flight\flight_test.cc(1126): error: Value of: status.message()
> Expected: has substring "Invalid token"
>   Actual: "Could not write record batch to stream: "
> [  FAILED  ] TestAuthHandler.FailUnauthenticatedCalls (3 ms)
> [ RUN  ] TestAuthHandler.CheckPeerIdentity
> [   OK ] TestAuthHandler.CheckPeerIdentity (2 ms)
> [--] 3 tests from TestAuthHandler (10 ms total)
> [--] 3 tests from TestBasicAuthHandler
> [ RUN  ] TestBasicAuthHandler.PassAuthenticatedCalls
> [   OK ] TestBasicAuthHandler.PassAuthenticatedCalls (4 ms)
> [ RUN  ] TestBasicAuthHandler.FailUnauthenticatedCalls
> ..\src\arrow\flight\flight_test.cc(1224): error: Value of: status.message()
> Expected: has substring "Invalid token"
>   Actual: "Could not write record batch to stream: "
> [  FAILED  ] TestBasicAuthHandler.FailUnauthenticatedCalls (4 ms)
> [ RUN  ] TestBasicAuthHandler.CheckPeerIdentity
> [   OK ] TestBasicAuthHandler.CheckPeerIdentity (3 ms)
> [--] 3 tests from TestBasicAuthHandler (11 ms total)
> {code}
> See e.g. 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/30110376/job/vbtd22813g5hlgfl#L2252



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601]
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that us

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601]
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic on

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h3. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the str

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h3. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3: Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h3. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the str

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3: Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h3. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 


> [Python] Improve ergonomics of new FileSystem API
> --

1 2 >

1 - 100 of 111 matches

Mail list logo