[jira] [Updated] (ARROW-7589) [C++][Gandiva] Calling castVarchar from java sometimes results in segmentation fault for input length 0
[ https://issues.apache.org/jira/browse/ARROW-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7589: -- Labels: pull-request-available (was: ) > [C++][Gandiva] Calling castVarchar from java sometimes results in > segmentation fault for input length 0 > --- > > Key: ARROW-7589 > URL: https://issues.apache.org/jira/browse/ARROW-7589 > Project: Apache Arrow > Issue Type: Bug >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7589) [C++][Gandiva] Calling castVarchar from java sometimes results in segmentation fault for input length 0
[ https://issues.apache.org/jira/browse/ARROW-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Projjal Chanda updated ARROW-7589: -- Summary: [C++][Gandiva] Calling castVarchar from java sometimes results in segmentation fault for input length 0 (was: [C++][Gandiva] Calling castVarchar java sometimes results in segmentation fault for input length 0) > [C++][Gandiva] Calling castVarchar from java sometimes results in > segmentation fault for input length 0 > --- > > Key: ARROW-7589 > URL: https://issues.apache.org/jira/browse/ARROW-7589 > Project: Apache Arrow > Issue Type: Bug >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7589) [C++][Gandiva] Calling castVarchar java sometimes results in segmentation fault for input length 0
Projjal Chanda created ARROW-7589: - Summary: [C++][Gandiva] Calling castVarchar java sometimes results in segmentation fault for input length 0 Key: ARROW-7589 URL: https://issues.apache.org/jira/browse/ARROW-7589 Project: Apache Arrow Issue Type: Bug Reporter: Projjal Chanda Assignee: Projjal Chanda -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7588) [Plasma] Plasma On YARN
Ferdinand Xu created ARROW-7588: --- Summary: [Plasma] Plasma On YARN Key: ARROW-7588 URL: https://issues.apache.org/jira/browse/ARROW-7588 Project: Apache Arrow Issue Type: New Feature Components: C++ - Plasma Reporter: Ferdinand Xu YARN is widely used for resource manager. Currently Plasma server serves as an external service for memory sharing across different clients. It is not a managed service by YARN. The resource used by Plasma should also been managed. Additionally, Plasma service can also been managed by YARN -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-3827) [Rust] Implement UnionArray
[ https://issues.apache.org/jira/browse/ARROW-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3827: -- Labels: pull-request-available (was: ) > [Rust] Implement UnionArray > --- > > Key: ARROW-3827 > URL: https://issues.apache.org/jira/browse/ARROW-3827 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7533) [Java] Move ArrowBufPointer out of the java the memory package
[ https://issues.apache.org/jira/browse/ARROW-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7533: -- Labels: pull-request-available (was: ) > [Java] Move ArrowBufPointer out of the java the memory package > -- > > Key: ARROW-7533 > URL: https://issues.apache.org/jira/browse/ARROW-7533 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Jacques Nadeau >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > The memory package is focused on memory access and management. > ArrowBufPointer should be moved to algorithm package as it isn't core to the > Arrow memory management primitives. I would further suggest that is an > anti-pattern. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1571) [C++] Implement argsort kernels (sort indices) for integers using O(n) counting sort
[ https://issues.apache.org/jira/browse/ARROW-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016474#comment-17016474 ] Yibo Cai commented on ARROW-1571: - Finding cross-over point suitable for various hardware may be not easy. I will do some tests to see if we can reach a reasonable approach. > [C++] Implement argsort kernels (sort indices) for integers using O(n) > counting sort > > > Key: ARROW-1571 > URL: https://issues.apache.org/jira/browse/ARROW-1571 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > Fix For: 2.0.0 > > > This function requires knowledge of the minimum and maximum of an array. If > it is small enough, then an array of size {{maximum - minimum}} can be > constructed and used to tabulate value frequencies and then compute the sort > indices (this is called "grade up" or "grade down" in APL languages). There > is generally a cross-over point where this function performs worse than > mergesort or quicksort due to data locality issues -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7587) [C++][Compute] Add Top-k kernel
[ https://issues.apache.org/jira/browse/ARROW-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016470#comment-17016470 ] Yibo Cai commented on ARROW-7587: - Comments welcomed > [C++][Compute] Add Top-k kernel > --- > > Key: ARROW-7587 > URL: https://issues.apache.org/jira/browse/ARROW-7587 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Compute >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Minor > > Add a kernel to get top k smallest or largest elements (indices). > std::paiital_sort should be a better solution than sorting everything then > pick top k. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7587) [C++][Compute] Add Top-k kernel
Yibo Cai created ARROW-7587: --- Summary: [C++][Compute] Add Top-k kernel Key: ARROW-7587 URL: https://issues.apache.org/jira/browse/ARROW-7587 Project: Apache Arrow Issue Type: New Feature Components: C++ - Compute Reporter: Yibo Cai Assignee: Yibo Cai Add a kernel to get top k smallest or largest elements (indices). std::paiital_sort should be a better solution than sorting everything then pick top k. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf
[ https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu updated ARROW-7494: -- Fix Version/s: (was: 0.16.0) 1.0.0 > [Java] Remove reader index and writer index from ArrowBuf > - > > Key: ARROW-7494 > URL: https://issues.apache.org/jira/browse/ARROW-7494 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Jacques Nadeau >Assignee: Ji Liu >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Reader and writer index and functionality doesn't belong on a chunk of memory > and is due to inheritance from ByteBuf. As part of removing ByteBuf > inheritance, we should also remove reader and writer indexes from ArrowBuf > functionality. It wastes heap memory for rare utility. In general, a slice > can be used instead of a reader/writer index pattern. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7578) [R] Add support for datasets with IPC files and with multiple sources
[ https://issues.apache.org/jira/browse/ARROW-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-7578. Resolution: Fixed Issue resolved by pull request 6205 [https://github.com/apache/arrow/pull/6205] > [R] Add support for datasets with IPC files and with multiple sources > - > > Key: ARROW-7578 > URL: https://issues.apache.org/jira/browse/ARROW-7578 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset, R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6899) [Python] to_pandas() not implemented on list
[ https://issues.apache.org/jira/browse/ARROW-6899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-6899: -- Assignee: Neal Richardson (was: Wes McKinney) > [Python] to_pandas() not implemented on list indices=int32> > - > > Key: ARROW-6899 > URL: https://issues.apache.org/jira/browse/ARROW-6899 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.15.0 >Reporter: Razvan Chitu >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Attachments: encoded.arrow > > Time Spent: 0.5h > Remaining Estimate: 0h > > Hi, > {{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data > vector is of type "dictionary encoded string". Here is the table schema as > printed by pyarrow: > {code:java} > pyarrow.Table > encodedList: list<$data$: dictionary > not null> not null > child 0, $data$: dictionary not > null > metadata > > OrderedDict() {code} > and the data (also attached in a file to this ticket) > {code:java} > > [ > [ > -- dictionary: > [ > "a", > "b", > "c", > "d" > ] > -- indices: > [ > 0, > 1, > 2 > ], > -- dictionary: > [ > "a", > "b", > "c", > "d" > ] > -- indices: > [ > 0, > 3 > ] > ] > ] {code} > and the exception I got > {code:java} > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 df.to_pandas() > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi > in pyarrow.lib._PandasConvertible.to_pandas() > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi > in pyarrow.lib.Table._to_pandas() > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in table_to_blockmanager(options, table, categories, ignore_metadata) > 700 > 701 _check_data_column_metadata_consistency(all_columns) > --> 702 blocks = _table_to_blocks(options, table, categories) > 703 columns = _deserialize_column_index(table, all_columns, > column_indexes) > 704 > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in _table_to_blocks(options, block_table, categories) > 972 > 973 # Convert an arrow table to Block from the internal pandas API > --> 974 result = pa.lib.table_to_blocks(options, block_table, categories) > 975 > 976 # Defined above > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi > in pyarrow.lib.table_to_blocks() > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: > dictionary {code} > Note that the data vector itself can be loaded successfully by to_pandas. > It'd be great if this would be addressed in the next version of pyarrow. For > now, is there anything I can do on my end to bypass this unimplemented > conversion? > Thanks, > Razvan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf
[ https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7494: -- Assignee: Ji Liu (was: Neal Richardson) > [Java] Remove reader index and writer index from ArrowBuf > - > > Key: ARROW-7494 > URL: https://issues.apache.org/jira/browse/ARROW-7494 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Jacques Nadeau >Assignee: Ji Liu >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Reader and writer index and functionality doesn't belong on a chunk of memory > and is due to inheritance from ByteBuf. As part of removing ByteBuf > inheritance, we should also remove reader and writer indexes from ArrowBuf > functionality. It wastes heap memory for rare utility. In general, a slice > can be used instead of a reader/writer index pattern. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7096) [C++] Add options structs for concatenation-with-promotion and schema unification
[ https://issues.apache.org/jira/browse/ARROW-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7096: -- Assignee: Neal Richardson (was: Zhuo Peng) > [C++] Add options structs for concatenation-with-promotion and schema > unification > - > > Key: ARROW-7096 > URL: https://issues.apache.org/jira/browse/ARROW-7096 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Follow up to ARROW-6625 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7096) [C++] Add options structs for concatenation-with-promotion and schema unification
[ https://issues.apache.org/jira/browse/ARROW-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7096: -- Assignee: Zhuo Peng (was: Neal Richardson) > [C++] Add options structs for concatenation-with-promotion and schema > unification > - > > Key: ARROW-7096 > URL: https://issues.apache.org/jira/browse/ARROW-7096 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Zhuo Peng >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Follow up to ARROW-6625 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6899) [Python] to_pandas() not implemented on list
[ https://issues.apache.org/jira/browse/ARROW-6899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-6899: -- Assignee: Wes McKinney (was: Neal Richardson) > [Python] to_pandas() not implemented on list indices=int32> > - > > Key: ARROW-6899 > URL: https://issues.apache.org/jira/browse/ARROW-6899 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.15.0 >Reporter: Razvan Chitu >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Attachments: encoded.arrow > > Time Spent: 0.5h > Remaining Estimate: 0h > > Hi, > {{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data > vector is of type "dictionary encoded string". Here is the table schema as > printed by pyarrow: > {code:java} > pyarrow.Table > encodedList: list<$data$: dictionary > not null> not null > child 0, $data$: dictionary not > null > metadata > > OrderedDict() {code} > and the data (also attached in a file to this ticket) > {code:java} > > [ > [ > -- dictionary: > [ > "a", > "b", > "c", > "d" > ] > -- indices: > [ > 0, > 1, > 2 > ], > -- dictionary: > [ > "a", > "b", > "c", > "d" > ] > -- indices: > [ > 0, > 3 > ] > ] > ] {code} > and the exception I got > {code:java} > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 df.to_pandas() > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi > in pyarrow.lib._PandasConvertible.to_pandas() > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi > in pyarrow.lib.Table._to_pandas() > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in table_to_blockmanager(options, table, categories, ignore_metadata) > 700 > 701 _check_data_column_metadata_consistency(all_columns) > --> 702 blocks = _table_to_blocks(options, table, categories) > 703 columns = _deserialize_column_index(table, all_columns, > column_indexes) > 704 > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in _table_to_blocks(options, block_table, categories) > 972 > 973 # Convert an arrow table to Block from the internal pandas API > --> 974 result = pa.lib.table_to_blocks(options, block_table, categories) > 975 > 976 # Defined above > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi > in pyarrow.lib.table_to_blocks() > ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: > dictionary {code} > Note that the data vector itself can be loaded successfully by to_pandas. > It'd be great if this would be addressed in the next version of pyarrow. For > now, is there anything I can do on my end to bypass this unimplemented > conversion? > Thanks, > Razvan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7567) [Java] Bump Checkstyle from 6.19 to 8.18
[ https://issues.apache.org/jira/browse/ARROW-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7567: -- Assignee: Neal Richardson (was: Fokko Driesprong) > [Java] Bump Checkstyle from 6.19 to 8.18 > > > Key: ARROW-7567 > URL: https://issues.apache.org/jira/browse/ARROW-7567 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.15.1 >Reporter: Fokko Driesprong >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7567) [Java] Bump Checkstyle from 6.19 to 8.18
[ https://issues.apache.org/jira/browse/ARROW-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7567: -- Assignee: Fokko Driesprong (was: Neal Richardson) > [Java] Bump Checkstyle from 6.19 to 8.18 > > > Key: ARROW-7567 > URL: https://issues.apache.org/jira/browse/ARROW-7567 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.15.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
[ https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7518: -- Assignee: Neal Richardson (was: Krisztian Szucs) > [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages > --- > > Key: ARROW-7518 > URL: https://issues.apache.org/jira/browse/ARROW-7518 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Neal Richardson >Priority: Blocker > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This new module is not enabled in the package builds -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf
[ https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7494: -- Assignee: Neal Richardson (was: Ji Liu) > [Java] Remove reader index and writer index from ArrowBuf > - > > Key: ARROW-7494 > URL: https://issues.apache.org/jira/browse/ARROW-7494 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Jacques Nadeau >Assignee: Neal Richardson >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Reader and writer index and functionality doesn't belong on a chunk of memory > and is due to inheritance from ByteBuf. As part of removing ByteBuf > inheritance, we should also remove reader and writer indexes from ArrowBuf > functionality. It wastes heap memory for rare utility. In general, a slice > can be used instead of a reader/writer index pattern. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
[ https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7518: -- Assignee: Krisztian Szucs (was: Neal Richardson) > [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages > --- > > Key: ARROW-7518 > URL: https://issues.apache.org/jira/browse/ARROW-7518 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Blocker > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This new module is not enabled in the package builds -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7568) [Java] Bump Apache Avro from 1.9.0 to 1.9.1
[ https://issues.apache.org/jira/browse/ARROW-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7568: -- Assignee: Fokko Driesprong (was: Neal Richardson) > [Java] Bump Apache Avro from 1.9.0 to 1.9.1 > --- > > Key: ARROW-7568 > URL: https://issues.apache.org/jira/browse/ARROW-7568 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.15.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Apache Avro 1.9.1 contains some bugfixes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7570) [Java] Fix high severity issues reported by LGTM
[ https://issues.apache.org/jira/browse/ARROW-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7570: -- Assignee: Neal Richardson (was: Fokko Driesprong) > [Java] Fix high severity issues reported by LGTM > > > Key: ARROW-7570 > URL: https://issues.apache.org/jira/browse/ARROW-7570 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.15.1 >Reporter: Fokko Driesprong >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Fixes high severity issues reported by LGTM: > [https://lgtm.com/projects/g/apache/arrow/?mode=list&lang=java&severity=error] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7569) [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions
[ https://issues.apache.org/jira/browse/ARROW-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7569: -- Assignee: Joris Van den Bossche (was: Neal Richardson) > [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas > conversions > --- > > Key: ARROW-7569 > URL: https://issues.apache.org/jira/browse/ARROW-7569 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > ARROW-2428 was about adding such a mapping, and described three use cases > (see this > [comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231] > for details): > * Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if > the pandas_metadata specify pandas extension dtypes, and if so, use this as > the target dtype for that column) > * Conversion for pyarrow extension types that can define their equivalent > pandas extension dtype > * A way to override default conversion (eg for the built-in types, or in > absence of pandas_metadata in the schema). This would require the user to be > able to specify some mapping of pyarrow type or column name to the pandas > extension dtype to use. > The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512) > only covered the first two cases, and not the third case. > I think it is still interesting to also cover the third case in some way. > An example use case are the new nullable dtypes that are introduced in pandas > (eg the nullable integer dtype). Assume I want to read a parquet file into a > pandas DataFrame using this nullable integer dtype. The pyarrow Table has no > pandas_metadata indicating to use this dtype (unless it was created from a > pandas DataFrame that was already using this dtype, but that will often not > be the case), and the pyarrow.int64() type is also not an extension type that > can define its equivalent pandas extension dtype. > Currently, the only solution is first read it into pandas DataFrame (which > will use floats for the integers if there are nulls), and then afterwards to > convert those floats back to a nullable integer dtype. > A possible API for this could look like: > {code} > table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()}) > {code} > to indicate that you want to convert all columns of the pyarrow table with > int64 type to a pandas column using the nullable Int64 dtype. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7568) [Java] Bump Apache Avro from 1.9.0 to 1.9.1
[ https://issues.apache.org/jira/browse/ARROW-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7568: -- Assignee: Neal Richardson (was: Fokko Driesprong) > [Java] Bump Apache Avro from 1.9.0 to 1.9.1 > --- > > Key: ARROW-7568 > URL: https://issues.apache.org/jira/browse/ARROW-7568 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.15.1 >Reporter: Fokko Driesprong >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Apache Avro 1.9.1 contains some bugfixes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7569) [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions
[ https://issues.apache.org/jira/browse/ARROW-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7569: -- Assignee: Neal Richardson > [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas > conversions > --- > > Key: ARROW-7569 > URL: https://issues.apache.org/jira/browse/ARROW-7569 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > ARROW-2428 was about adding such a mapping, and described three use cases > (see this > [comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231] > for details): > * Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if > the pandas_metadata specify pandas extension dtypes, and if so, use this as > the target dtype for that column) > * Conversion for pyarrow extension types that can define their equivalent > pandas extension dtype > * A way to override default conversion (eg for the built-in types, or in > absence of pandas_metadata in the schema). This would require the user to be > able to specify some mapping of pyarrow type or column name to the pandas > extension dtype to use. > The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512) > only covered the first two cases, and not the third case. > I think it is still interesting to also cover the third case in some way. > An example use case are the new nullable dtypes that are introduced in pandas > (eg the nullable integer dtype). Assume I want to read a parquet file into a > pandas DataFrame using this nullable integer dtype. The pyarrow Table has no > pandas_metadata indicating to use this dtype (unless it was created from a > pandas DataFrame that was already using this dtype, but that will often not > be the case), and the pyarrow.int64() type is also not an extension type that > can define its equivalent pandas extension dtype. > Currently, the only solution is first read it into pandas DataFrame (which > will use floats for the integers if there are nulls), and then afterwards to > convert those floats back to a nullable integer dtype. > A possible API for this could look like: > {code} > table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()}) > {code} > to indicate that you want to convert all columns of the pyarrow table with > int64 type to a pandas column using the nullable Int64 dtype. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7570) [Java] Fix high severity issues reported by LGTM
[ https://issues.apache.org/jira/browse/ARROW-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7570: -- Assignee: Fokko Driesprong (was: Neal Richardson) > [Java] Fix high severity issues reported by LGTM > > > Key: ARROW-7570 > URL: https://issues.apache.org/jira/browse/ARROW-7570 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.15.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Fixes high severity issues reported by LGTM: > [https://lgtm.com/projects/g/apache/arrow/?mode=list&lang=java&severity=error] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7572) [Java] Enfore Maven 3.3+ as mentioned in README
[ https://issues.apache.org/jira/browse/ARROW-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7572: -- Assignee: Neal Richardson (was: Fokko Driesprong) > [Java] Enfore Maven 3.3+ as mentioned in README > --- > > Key: ARROW-7572 > URL: https://issues.apache.org/jira/browse/ARROW-7572 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.15.1 >Reporter: Fokko Driesprong >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7572) [Java] Enfore Maven 3.3+ as mentioned in README
[ https://issues.apache.org/jira/browse/ARROW-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7572: -- Assignee: Fokko Driesprong (was: Neal Richardson) > [Java] Enfore Maven 3.3+ as mentioned in README > --- > > Key: ARROW-7572 > URL: https://issues.apache.org/jira/browse/ARROW-7572 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.15.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master
[ https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7551: -- Assignee: David Li (was: Neal Richardson) > [C++][Flight] Flight test on macOS periodically fails on master > --- > > Key: ARROW-7551 > URL: https://issues.apache.org/jira/browse/ARROW-7551 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Neal Richardson >Assignee: David Li >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > > See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. > {code} > 64/96 Test #64: arrow-flight-test .***Failed0.46 > sec > Running arrow-flight-test, redirecting output into > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt > (attempt 1/1) > Running main() from > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc > [==] Running 42 tests from 11 test cases. > [--] Global test environment set-up. > [--] 2 tests from TestFlightDescriptor > [ RUN ] TestFlightDescriptor.Basics > [ OK ] TestFlightDescriptor.Basics (0 ms) > [ RUN ] TestFlightDescriptor.ToFromProto > [ OK ] TestFlightDescriptor.ToFromProto (0 ms) > [--] 2 tests from TestFlightDescriptor (0 ms total) > [--] 6 tests from TestFlight > [ RUN ] TestFlight.UnknownLocationScheme > [ OK ] TestFlight.UnknownLocationScheme (0 ms) > [ RUN ] TestFlight.ConnectUri > Server running with pid 15977 > /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: > line 97: 15971 Segmentation fault: 11 $TEST_EXECUTABLE "$@" 2>&1 > 15972 Done| $ROOT/build-support/asan_symbolize.py > 15973 Done| ${CXXFILT:-c++filt} > 15974 Done| > $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE > 15975 Done| $pipe_cmd 2>&1 > 15976 Done| tee $LOGFILE > ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight > {code} > It's not failing every time but I'm seeing it fail frequently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master
[ https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7551: -- Assignee: Neal Richardson > [C++][Flight] Flight test on macOS periodically fails on master > --- > > Key: ARROW-7551 > URL: https://issues.apache.org/jira/browse/ARROW-7551 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > > See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. > {code} > 64/96 Test #64: arrow-flight-test .***Failed0.46 > sec > Running arrow-flight-test, redirecting output into > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt > (attempt 1/1) > Running main() from > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc > [==] Running 42 tests from 11 test cases. > [--] Global test environment set-up. > [--] 2 tests from TestFlightDescriptor > [ RUN ] TestFlightDescriptor.Basics > [ OK ] TestFlightDescriptor.Basics (0 ms) > [ RUN ] TestFlightDescriptor.ToFromProto > [ OK ] TestFlightDescriptor.ToFromProto (0 ms) > [--] 2 tests from TestFlightDescriptor (0 ms total) > [--] 6 tests from TestFlight > [ RUN ] TestFlight.UnknownLocationScheme > [ OK ] TestFlight.UnknownLocationScheme (0 ms) > [ RUN ] TestFlight.ConnectUri > Server running with pid 15977 > /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: > line 97: 15971 Segmentation fault: 11 $TEST_EXECUTABLE "$@" 2>&1 > 15972 Done| $ROOT/build-support/asan_symbolize.py > 15973 Done| ${CXXFILT:-c++filt} > 15974 Done| > $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE > 15975 Done| $pipe_cmd 2>&1 > 15976 Done| tee $LOGFILE > ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight > {code} > It's not failing every time but I'm seeing it fail frequently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7586) [C++][Dataset] Read feather files
Neal Richardson created ARROW-7586: -- Summary: [C++][Dataset] Read feather files Key: ARROW-7586 URL: https://issues.apache.org/jira/browse/ARROW-7586 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Neal Richardson Assignee: Ben Kietzman Fix For: 0.16.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master
[ https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7551: -- Labels: pull-request-available (was: ) > [C++][Flight] Flight test on macOS periodically fails on master > --- > > Key: ARROW-7551 > URL: https://issues.apache.org/jira/browse/ARROW-7551 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Neal Richardson >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > > See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. > {code} > 64/96 Test #64: arrow-flight-test .***Failed0.46 > sec > Running arrow-flight-test, redirecting output into > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt > (attempt 1/1) > Running main() from > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc > [==] Running 42 tests from 11 test cases. > [--] Global test environment set-up. > [--] 2 tests from TestFlightDescriptor > [ RUN ] TestFlightDescriptor.Basics > [ OK ] TestFlightDescriptor.Basics (0 ms) > [ RUN ] TestFlightDescriptor.ToFromProto > [ OK ] TestFlightDescriptor.ToFromProto (0 ms) > [--] 2 tests from TestFlightDescriptor (0 ms total) > [--] 6 tests from TestFlight > [ RUN ] TestFlight.UnknownLocationScheme > [ OK ] TestFlight.UnknownLocationScheme (0 ms) > [ RUN ] TestFlight.ConnectUri > Server running with pid 15977 > /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: > line 97: 15971 Segmentation fault: 11 $TEST_EXECUTABLE "$@" 2>&1 > 15972 Done| $ROOT/build-support/asan_symbolize.py > 15973 Done| ${CXXFILT:-c++filt} > 15974 Done| > $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE > 15975 Done| $pipe_cmd 2>&1 > 15976 Done| tee $LOGFILE > ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight > {code} > It's not failing every time but I'm seeing it fail frequently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-2260) [C++][Plasma] plasma_store should show usage
[ https://issues.apache.org/jira/browse/ARROW-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016349#comment-17016349 ] Christian Hudon commented on ARROW-2260: Oups. Didn't check for duplicates before reporting ARROW-7585. I'm willing to do the work to fix this as a good first Arrow pull request. Antoine mentioned GFlags as the library to use (that would be more featureful than getopt()), so I'll use that unless someone says otherwise here... > [C++][Plasma] plasma_store should show usage > > > Key: ARROW-2260 > URL: https://issues.apache.org/jira/browse/ARROW-2260 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Minor > Fix For: 2.0.0 > > > Currently the options exposed by the {{plasma_store}} executable aren't very > discoverable: > {code:bash} > $ plasma_store -h > please specify socket for incoming connections with -s switch > Abandon > (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting > *)$ plasma_store > please specify socket for incoming connections with -s switch > Abandon > (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting > *)$ plasma_store --help > plasma_store: invalid option -- '-' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master
[ https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016346#comment-17016346 ] David Li commented on ARROW-7551: - Yes, let's skip the test on CI for now. > [C++][Flight] Flight test on macOS periodically fails on master > --- > > Key: ARROW-7551 > URL: https://issues.apache.org/jira/browse/ARROW-7551 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Neal Richardson >Priority: Critical > Fix For: 0.16.0 > > > See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. > {code} > 64/96 Test #64: arrow-flight-test .***Failed0.46 > sec > Running arrow-flight-test, redirecting output into > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt > (attempt 1/1) > Running main() from > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc > [==] Running 42 tests from 11 test cases. > [--] Global test environment set-up. > [--] 2 tests from TestFlightDescriptor > [ RUN ] TestFlightDescriptor.Basics > [ OK ] TestFlightDescriptor.Basics (0 ms) > [ RUN ] TestFlightDescriptor.ToFromProto > [ OK ] TestFlightDescriptor.ToFromProto (0 ms) > [--] 2 tests from TestFlightDescriptor (0 ms total) > [--] 6 tests from TestFlight > [ RUN ] TestFlight.UnknownLocationScheme > [ OK ] TestFlight.UnknownLocationScheme (0 ms) > [ RUN ] TestFlight.ConnectUri > Server running with pid 15977 > /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: > line 97: 15971 Segmentation fault: 11 $TEST_EXECUTABLE "$@" 2>&1 > 15972 Done| $ROOT/build-support/asan_symbolize.py > 15973 Done| ${CXXFILT:-c++filt} > 15974 Done| > $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE > 15975 Done| $pipe_cmd 2>&1 > 15976 Done| tee $LOGFILE > ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight > {code} > It's not failing every time but I'm seeing it fail frequently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`
[ https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6895: -- Labels: pull-request-available (was: ) > [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader > repeats returned values when calling `NextBatch()` > --- > > Key: ARROW-6895 > URL: https://issues.apache.org/jira/browse/ARROW-6895 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.0 > Environment: Linux 5.2.17-200.fc30.x86_64 (Docker) >Reporter: Adam Hooper >Assignee: Francois Saint-Jacques >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet > > > Given most columns, I can run a loop like: > {code:cpp} > std::unique_ptr columnReader(/*...*/); > while (nRowsRemaining > 0) { > int n = std::min(100, nRowsRemaining); > std::shared_ptr chunkedArray; > auto status = columnReader->NextBatch(n, &chunkedArray); > // ... and then use `chunkedArray` > nRowsRemaining -= n; > } > {code} > (The context is: "convert Parquet to CSV/JSON, with small memory footprint." > Used in https://github.com/CJWorkbench/parquet-to-arrow) > Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; > the second return value looks like {{val100...val199}}; and so on. > ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The > first {{NextBatch()}} return value looks like {{val0...val100}}; the second > return value looks like {{val0...val99, val100...val199}} (ChunkedArray with > two arrays); the third return value looks like {{val0...val99, > val100...val199, val200...val299}} (ChunkedArray with three arrays); and so > on. The returned arrays are never cleared. > In sum: {{NextBatch()}} on a dictionary column reader returns the wrong > values. > I've attached a minimal Parquet file that presents this problem with the > above code; and I've written a patch that fixes this one case, to illustrate > where things are wrong. I don't think I understand enough edge cases to > decree that my patch is a correct fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7585) Plasma-store-server does not support --help, shows backtrace on getopt error
[ https://issues.apache.org/jira/browse/ARROW-7585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-7585. - Resolution: Duplicate Closing as duplicate. [~chrish42] Please comment on ARROW-2260. I agree it deserves fixing. GFlags is what we use for some other command-line utilities. > Plasma-store-server does not support --help, shows backtrace on getopt error > > > Key: ARROW-7585 > URL: https://issues.apache.org/jira/browse/ARROW-7585 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Christian Hudon >Priority: Minor > > I'm trying out Plasma, using plasma-store-server. The first thing I usually > do then is to run the binary without arguments, and that usually gives me a > message showing usage. However, with plasma-store-server, the initial > experience there is a backtrace: > {noformat} > $ ./debug/plasma-store-server > /Users/chrish/Code/arrow/cpp/src/plasma/store.cc:1237: please specify socket > for incoming connections with -s switch > 0 plasma-store-server 0x00010b4d7c04 > _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52 > 1 plasma-store-server 0x00010b4d7b24 > _ZN5arrow4util7CerrLogD2Ev + 100 > 2 plasma-store-server 0x00010b4d7a85 > _ZN5arrow4util7CerrLogD1Ev + 21 > 3 plasma-store-server 0x00010b4d7aa9 > _ZN5arrow4util7CerrLogD0Ev + 25 > 4 plasma-store-server 0x00010b4d7990 > _ZN5arrow4util8ArrowLogD2Ev + 80 > 5 plasma-store-server 0x00010b4d79c5 > _ZN5arrow4util8ArrowLogD1Ev + 21 > 6 plasma-store-server 0x00010b463152 main + 1122 > 7 libdyld.dylib 0x7fff7765a3d5 start + 1 > fish: './debug/plasma-store-server' terminated by signal SIGABRT (Abort) > {noformat} > Also, neither of the "h" or "help" command-line switches is supported, and so > to start plasma-store-server, you either find the doc, or iteratively add > arguments until you stop getting "please specify ..." backtraces. > I know it's not a big thing, but it'd be nice if that initial experience was > a little bit more user-friendly. Also submitting this because it feels like a > good first time issue, so I would be very happy to do the work, and would > like to tackle it. I'd like to 1) add --help support that shows all the > options and gives an example with the required ones, and 2) remove the > unnecessary backtraces on normal errors like these in the main() function. > Just asking beforehand here: 1) would this kind of patch be welcome, and 2) > is there a C++ library for command-line option parsing that I could be using. > I can find one on my own, but I'd rather ask here which one would be approved > for using in the Arrow codebase... or should I just stick to getopt() and do > things manually? Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7585) Plasma-store-server does not support --help, shows backtrace on getopt error
Christian Hudon created ARROW-7585: -- Summary: Plasma-store-server does not support --help, shows backtrace on getopt error Key: ARROW-7585 URL: https://issues.apache.org/jira/browse/ARROW-7585 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Reporter: Christian Hudon I'm trying out Plasma, using plasma-store-server. The first thing I usually do then is to run the binary without arguments, and that usually gives me a message showing usage. However, with plasma-store-server, the initial experience there is a backtrace: {noformat} $ ./debug/plasma-store-server /Users/chrish/Code/arrow/cpp/src/plasma/store.cc:1237: please specify socket for incoming connections with -s switch 0 plasma-store-server 0x00010b4d7c04 _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52 1 plasma-store-server 0x00010b4d7b24 _ZN5arrow4util7CerrLogD2Ev + 100 2 plasma-store-server 0x00010b4d7a85 _ZN5arrow4util7CerrLogD1Ev + 21 3 plasma-store-server 0x00010b4d7aa9 _ZN5arrow4util7CerrLogD0Ev + 25 4 plasma-store-server 0x00010b4d7990 _ZN5arrow4util8ArrowLogD2Ev + 80 5 plasma-store-server 0x00010b4d79c5 _ZN5arrow4util8ArrowLogD1Ev + 21 6 plasma-store-server 0x00010b463152 main + 1122 7 libdyld.dylib 0x7fff7765a3d5 start + 1 fish: './debug/plasma-store-server' terminated by signal SIGABRT (Abort) {noformat} Also, neither of the "h" or "help" command-line switches is supported, and so to start plasma-store-server, you either find the doc, or iteratively add arguments until you stop getting "please specify ..." backtraces. I know it's not a big thing, but it'd be nice if that initial experience was a little bit more user-friendly. Also submitting this because it feels like a good first time issue, so I would be very happy to do the work, and would like to tackle it. I'd like to 1) add --help support that shows all the options and gives an example with the required ones, and 2) remove the unnecessary backtraces on normal errors like these in the main() function. Just asking beforehand here: 1) would this kind of patch be welcome, and 2) is there a C++ library for command-line option parsing that I could be using. I can find one on my own, but I'd rather ask here which one would be approved for using in the Arrow codebase... or should I just stick to getopt() and do things manually? Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata
[ https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016304#comment-17016304 ] Neal Richardson commented on ARROW-7063: >From my perspective, it's not a problem that the metadata isn't printed as >long as I can access it and print it if I choose. I.e. I can {{print(schema)}} >and then {{print(schema.metadata)}} if I want. > [C++] Schema print method prints too much metadata > -- > > Key: ARROW-7063 > URL: https://issues.apache.org/jira/browse/ARROW-7063 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Reporter: Neal Richardson >Assignee: Ben Kietzman >Priority: Minor > Labels: dataset, parquet > Fix For: 1.0.0 > > > I loaded some taxi data in a Dataset and printed the schema. This is what was > printed: > {code} > vendor_id: string > pickup_at: timestamp[us] > dropoff_at: timestamp[us] > passenger_count: int8 > trip_distance: float > pickup_longitude: float > pickup_latitude: float > rate_code_id: null > store_and_fwd_flag: string > dropoff_longitude: float > dropoff_latitude: float > payment_type: string > fare_amount: float > extra: float > mta_tax: float > tip_amount: float > tolls_amount: float > total_amount: float > -- metadata -- > pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, > "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, > "field_name": null, "pandas_type": "unicode", "numpy_type": "object", > "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", > "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", > "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", > "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, > {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", > "numpy_type": "datetime64[ns]", "metadata": null}, {"name": > "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", > "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", > "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": > "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "pickup_latitude", "field_name": > "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", > "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": > "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, {"name": > "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": > "float32", "numpy_type": "float32", "metadata": null}, {"name": > "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": > "float32", "numpy_type": "float32", "metadata": null}, {"name": > "payment_type", "field_name": "payment_type", "pandas_type": "unicode", > "numpy_type": "object", "metadata": null}, {"name": "fare_amount", > "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "extra", "field_name": "extra", > "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, > {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", > "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", > "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "tolls_amount", "field_name": > "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "total_amount", "field_name": "total_amount", > "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], > "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": > "0.25.3"} > ARROW:schema: > /3gOAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwEAAgACgAAAFQKAAAEAQwIAAwABAAIAAgsCgAABB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZG
[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata
[ https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016303#comment-17016303 ] Neal Richardson commented on ARROW-7063: What I have in mind is in the output in the ticket description: everything above the line that says {{-- metadata --}} > [C++] Schema print method prints too much metadata > -- > > Key: ARROW-7063 > URL: https://issues.apache.org/jira/browse/ARROW-7063 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Reporter: Neal Richardson >Assignee: Ben Kietzman >Priority: Minor > Labels: dataset, parquet > Fix For: 1.0.0 > > > I loaded some taxi data in a Dataset and printed the schema. This is what was > printed: > {code} > vendor_id: string > pickup_at: timestamp[us] > dropoff_at: timestamp[us] > passenger_count: int8 > trip_distance: float > pickup_longitude: float > pickup_latitude: float > rate_code_id: null > store_and_fwd_flag: string > dropoff_longitude: float > dropoff_latitude: float > payment_type: string > fare_amount: float > extra: float > mta_tax: float > tip_amount: float > tolls_amount: float > total_amount: float > -- metadata -- > pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, > "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, > "field_name": null, "pandas_type": "unicode", "numpy_type": "object", > "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", > "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", > "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", > "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, > {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", > "numpy_type": "datetime64[ns]", "metadata": null}, {"name": > "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", > "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", > "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": > "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "pickup_latitude", "field_name": > "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", > "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": > "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, {"name": > "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": > "float32", "numpy_type": "float32", "metadata": null}, {"name": > "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": > "float32", "numpy_type": "float32", "metadata": null}, {"name": > "payment_type", "field_name": "payment_type", "pandas_type": "unicode", > "numpy_type": "object", "metadata": null}, {"name": "fare_amount", > "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "extra", "field_name": "extra", > "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, > {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", > "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", > "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "tolls_amount", "field_name": > "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "total_amount", "field_name": "total_amount", > "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], > "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": > "0.25.3"} > ARROW:schema: > /3gOAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwEAAgACgAAAFQKAAAEAQwIAAwABAAIAAgsCgAABB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9hdCIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfYXQiLCAic
[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata
[ https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016295#comment-17016295 ] Joris Van den Bossche commented on ARROW-7063: -- A reason to have at least some truncated form of it (it can be short), is that two tables/schemas are not equal if their metadata is not equal. So having nothing about it in the simple pretty print can also be quite confusing. > I'm fine with writing it my way in R (i.e. schema print only prints its > fields, assuming I can iterate over the Fields in a Schema and print each), > and if y'all like how that looks, we can consider making that the C++ > behavior. Can you post an example? > [C++] Schema print method prints too much metadata > -- > > Key: ARROW-7063 > URL: https://issues.apache.org/jira/browse/ARROW-7063 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Reporter: Neal Richardson >Assignee: Ben Kietzman >Priority: Minor > Labels: dataset, parquet > Fix For: 1.0.0 > > > I loaded some taxi data in a Dataset and printed the schema. This is what was > printed: > {code} > vendor_id: string > pickup_at: timestamp[us] > dropoff_at: timestamp[us] > passenger_count: int8 > trip_distance: float > pickup_longitude: float > pickup_latitude: float > rate_code_id: null > store_and_fwd_flag: string > dropoff_longitude: float > dropoff_latitude: float > payment_type: string > fare_amount: float > extra: float > mta_tax: float > tip_amount: float > tolls_amount: float > total_amount: float > -- metadata -- > pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, > "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, > "field_name": null, "pandas_type": "unicode", "numpy_type": "object", > "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", > "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", > "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", > "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, > {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", > "numpy_type": "datetime64[ns]", "metadata": null}, {"name": > "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", > "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", > "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": > "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "pickup_latitude", "field_name": > "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", > "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": > "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, {"name": > "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": > "float32", "numpy_type": "float32", "metadata": null}, {"name": > "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": > "float32", "numpy_type": "float32", "metadata": null}, {"name": > "payment_type", "field_name": "payment_type", "pandas_type": "unicode", > "numpy_type": "object", "metadata": null}, {"name": "fare_amount", > "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "extra", "field_name": "extra", > "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, > {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", > "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", > "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "tolls_amount", "field_name": > "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "total_amount", "field_name": "total_amount", > "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], > "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": > "0.25.3"} > ARROW:schema: > /3gOAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwEAAgACgAAAFQKAAAEAQwIAAwABAAIAAgsCgAABB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJu
[jira] [Commented] (ARROW-7044) [Release] Create a post release script for the home-brew formulas
[ https://issues.apache.org/jira/browse/ARROW-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016290#comment-17016290 ] Neal Richardson commented on ARROW-7044: I'm not sure those instructions are 100% accurate. We're maintaining the homebrew formula (at least cpp) and running nightly crossbow on it, so when we update the upstream homebrew formula, we should be sure to include any changes from the one we maintain in {{dev/tasks/}}. What if (as is the case this time) there is a change in the CMake flags or dependencies? Where do you make those changes? > [Release] Create a post release script for the home-brew formulas > - > > Key: ARROW-7044 > URL: https://issues.apache.org/jira/browse/ARROW-7044 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Priority: Major > Fix For: 0.16.0 > > > The required steps are documented in the release management guide > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7551) [C++][Flight] Flight test on macOS periodically fails on master
[ https://issues.apache.org/jira/browse/ARROW-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016289#comment-17016289 ] Neal Richardson commented on ARROW-7551: If we can't reproduce this, should we skip this test for now on macOS/on CI? Right now it's just making all the builds fail, and that increases the risk that we'll merge a patch that really breaks things. > [C++][Flight] Flight test on macOS periodically fails on master > --- > > Key: ARROW-7551 > URL: https://issues.apache.org/jira/browse/ARROW-7551 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Neal Richardson >Priority: Critical > Fix For: 0.16.0 > > > See [https://github.com/apache/arrow/runs/380443548#step:5:179] for example. > {code} > 64/96 Test #64: arrow-flight-test .***Failed0.46 > sec > Running arrow-flight-test, redirecting output into > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/build/test-logs/arrow-flight-test.txt > (attempt 1/1) > Running main() from > /Users/runner/runners/2.163.1/work/arrow/arrow/build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc > [==] Running 42 tests from 11 test cases. > [--] Global test environment set-up. > [--] 2 tests from TestFlightDescriptor > [ RUN ] TestFlightDescriptor.Basics > [ OK ] TestFlightDescriptor.Basics (0 ms) > [ RUN ] TestFlightDescriptor.ToFromProto > [ OK ] TestFlightDescriptor.ToFromProto (0 ms) > [--] 2 tests from TestFlightDescriptor (0 ms total) > [--] 6 tests from TestFlight > [ RUN ] TestFlight.UnknownLocationScheme > [ OK ] TestFlight.UnknownLocationScheme (0 ms) > [ RUN ] TestFlight.ConnectUri > Server running with pid 15977 > /Users/runner/runners/2.163.1/work/arrow/arrow/cpp/build-support/run-test.sh: > line 97: 15971 Segmentation fault: 11 $TEST_EXECUTABLE "$@" 2>&1 > 15972 Done| $ROOT/build-support/asan_symbolize.py > 15973 Done| ${CXXFILT:-c++filt} > 15974 Done| > $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE > 15975 Done| $pipe_cmd 2>&1 > 15976 Done| tee $LOGFILE > ~/runners/2.163.1/work/arrow/arrow/build/cpp/src/arrow/flight > {code} > It's not failing every time but I'm seeing it fail frequently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7432) [Python] Add higher-level datasets functions
[ https://issues.apache.org/jira/browse/ARROW-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7432: -- Assignee: Joris Van den Bossche (was: Neal Richardson) > [Python] Add higher-level datasets functions > > > Key: ARROW-7432 > URL: https://issues.apache.org/jira/browse/ARROW-7432 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: dataset, pull-request-available > Fix For: 0.16.0 > > Time Spent: 4h > Remaining Estimate: 0h > > From [~kszucs]: We need to define a more pythonic API for the dataset > bindings, because the current one is pretty low-level. > One option is to provide a "open_dataset" function similar as what is > available in R. > A short-cut to go from a Dataset to a Table might also be useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7432) [Python] Add higher-level datasets functions
[ https://issues.apache.org/jira/browse/ARROW-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7432: -- Assignee: Neal Richardson > [Python] Add higher-level datasets functions > > > Key: ARROW-7432 > URL: https://issues.apache.org/jira/browse/ARROW-7432 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Neal Richardson >Priority: Major > Labels: dataset, pull-request-available > Fix For: 0.16.0 > > Time Spent: 4h > Remaining Estimate: 0h > > From [~kszucs]: We need to define a more pythonic API for the dataset > bindings, because the current one is pretty low-level. > One option is to provide a "open_dataset" function similar as what is > available in R. > A short-cut to go from a Dataset to a Table might also be useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5744) [C++] Do not error in Table::CombineChunks for BinaryArray types that overflow 2GB limit
[ https://issues.apache.org/jira/browse/ARROW-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-5744: --- Priority: Major (was: Critical) > [C++] Do not error in Table::CombineChunks for BinaryArray types that > overflow 2GB limit > > > Key: ARROW-5744 > URL: https://issues.apache.org/jira/browse/ARROW-5744 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Ben Kietzman >Priority: Major > Fix For: 0.16.0 > > > Discovered during ARROW-5635 code review -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5744) [C++] Do not error in Table::CombineChunks for BinaryArray types that overflow 2GB limit
[ https://issues.apache.org/jira/browse/ARROW-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-5744: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++] Do not error in Table::CombineChunks for BinaryArray types that > overflow 2GB limit > > > Key: ARROW-5744 > URL: https://issues.apache.org/jira/browse/ARROW-5744 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > Discovered during ARROW-5635 code review -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7269) [C++] Fix arrow::parquet compiler warning
[ https://issues.apache.org/jira/browse/ARROW-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-7269. - Fix Version/s: 0.16.0 Assignee: Wes McKinney Resolution: Fixed Resolved by PARQUET-1769 https://github.com/apache/arrow/commit/1a3b17b8382952465d3902c3edd6252a71ef6c5b > [C++] Fix arrow::parquet compiler warning > - > > Key: ARROW-7269 > URL: https://issues.apache.org/jira/browse/ARROW-7269 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Jiajia Li >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Encountered the compiler warning when building: > [WARNING:/arrow/cpp/src/parquet/parquet.thrift:297] The "byte" type is a > compatibility alias for "i8". Use "i8" to emphasize the signedness of this > type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7063) [C++] Schema print method prints too much metadata
[ https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7063: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++] Schema print method prints too much metadata > -- > > Key: ARROW-7063 > URL: https://issues.apache.org/jira/browse/ARROW-7063 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Reporter: Neal Richardson >Assignee: Ben Kietzman >Priority: Minor > Labels: dataset, parquet > Fix For: 1.0.0 > > > I loaded some taxi data in a Dataset and printed the schema. This is what was > printed: > {code} > vendor_id: string > pickup_at: timestamp[us] > dropoff_at: timestamp[us] > passenger_count: int8 > trip_distance: float > pickup_longitude: float > pickup_latitude: float > rate_code_id: null > store_and_fwd_flag: string > dropoff_longitude: float > dropoff_latitude: float > payment_type: string > fare_amount: float > extra: float > mta_tax: float > tip_amount: float > tolls_amount: float > total_amount: float > -- metadata -- > pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, > "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, > "field_name": null, "pandas_type": "unicode", "numpy_type": "object", > "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", > "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", > "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", > "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, > {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", > "numpy_type": "datetime64[ns]", "metadata": null}, {"name": > "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", > "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", > "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": > "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "pickup_latitude", "field_name": > "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", > "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": > "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, {"name": > "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": > "float32", "numpy_type": "float32", "metadata": null}, {"name": > "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": > "float32", "numpy_type": "float32", "metadata": null}, {"name": > "payment_type", "field_name": "payment_type", "pandas_type": "unicode", > "numpy_type": "object", "metadata": null}, {"name": "fare_amount", > "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "extra", "field_name": "extra", > "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, > {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", > "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", > "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": > "float32", "metadata": null}, {"name": "tolls_amount", "field_name": > "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", > "metadata": null}, {"name": "total_amount", "field_name": "total_amount", > "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], > "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": > "0.25.3"} > ARROW:schema: > /3gOAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwEAAgACgAAAFQKAAAEAQwIAAwABAAIAAgsCgAABB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9hdCIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAic
[jira] [Resolved] (ARROW-6863) [Java] Provide parallel searcher
[ https://issues.apache.org/jira/browse/ARROW-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-6863. Fix Version/s: 0.16.0 Resolution: Fixed Issue resolved by pull request 5631 [https://github.com/apache/arrow/pull/5631] > [Java] Provide parallel searcher > > > Key: ARROW-6863 > URL: https://issues.apache.org/jira/browse/ARROW-6863 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > For scenarios where the vector is large and the a low response time is > required, we need to search the vector in parallel to improve the > responsiveness. > This issue tries to provide a parallel searcher for the equality semantics > (the support for ordering semantics is not ready yet, as we need a way to > distribute the comparator). > The implementation is based on multi-threading. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5914) [CI] Build bundled dependencies in docker build step
[ https://issues.apache.org/jira/browse/ARROW-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-5914: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [CI] Build bundled dependencies in docker build step > > > Key: ARROW-5914 > URL: https://issues.apache.org/jira/browse/ARROW-5914 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Francois Saint-Jacques >Priority: Minor > Fix For: 1.0.0 > > > In the recently introduced ARROW-5803, some heavy dependencies (thrift, > protobuf, flatbufers, grpc) are build at each invocation of docker-compose > build (thus each travis test). > We should aim to build the third party dependencies in docker build phase > instead, to exploit caching and docker-compose pull so that the CI step > doesn't need to build said dependencies each time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-4226) [Format][C++] Add CSF sparse tensor support
[ https://issues.apache.org/jira/browse/ARROW-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-4226: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [Format][C++] Add CSF sparse tensor support > --- > > Key: ARROW-4226 > URL: https://issues.apache.org/jira/browse/ARROW-4226 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Format >Reporter: Kenta Murata >Assignee: Rok Mihevc >Priority: Minor > Labels: sparse > Fix For: 1.0.0 > > > [https://github.com/apache/arrow/pull/2546#pullrequestreview-156064172] > {quote}Perhaps in the future, if zero-copy and future-proof-ness is really > what we want, we might want to add the CSF (compressed sparse fiber) format, > a generalisation of CSR/CSC. I'm currently working on adding it to > PyData/Sparse, and I plan to make it the preferred format (COO will still be > around though). > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6528) [C++] Spurious Flight test failures (port allocation failure)
[ https://issues.apache.org/jira/browse/ARROW-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6528: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++] Spurious Flight test failures (port allocation failure) > - > > Key: ARROW-6528 > URL: https://issues.apache.org/jira/browse/ARROW-6528 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > Seems like our port allocation scheme inside unit tests is still not very > reliable :-/ > https://ci.ursalabs.org/#/builders/71/builds/4147/steps/8/logs/stdio > {code} > [--] 3 tests from TestMetadata > [ RUN ] TestMetadata.DoGet > E0905 12:45:40.322644527 10203 server_chttp2.cc:40] > {"created":"@1567687540.322612245","description":"No address added out of > total 1 > resolved","file":"../src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1567687540.322609844","description":"Unable > to configure > socket","fd":7,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1567687540.322602634","description":"Address > already in > use","errno":98,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address > already in use","syscall":"bind"}]}]} > ../src/arrow/flight/flight_test.cc:429: Failure > Failed > 'server->Init(options)' failed with Unknown error: Server did not start > properly > /buildbot/AMD64_Conda_Python_3_7/cpp/build-support/run-test.sh: line 97: > 10203 Segmentation fault (core dumped) $TEST_EXECUTABLE "$@" 2>&1 > 10204 Done| $ROOT/build-support/asan_symbolize.py > 10205 Done| ${CXXFILT:-c++filt} > 10206 Done| > $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE > 10207 Done| $pipe_cmd 2>&1 > 10208 Done| tee $LOGFILE > /buildbot/AMD64_Conda_Python_3_7/cpp/build/src/arrow/flight > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6501) [C++] Remove non_zero_length field from SparseIndex
[ https://issues.apache.org/jira/browse/ARROW-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6501: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++] Remove non_zero_length field from SparseIndex > --- > > Key: ARROW-6501 > URL: https://issues.apache.org/jira/browse/ARROW-6501 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Major > Fix For: 1.0.0 > > > We can remove non_zero_length field from SparseIndex because it can be > supplied from the shape of the indices tensor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6393) [C++]Add EqualOptions support in SparseTensor::Equals
[ https://issues.apache.org/jira/browse/ARROW-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6393: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++]Add EqualOptions support in SparseTensor::Equals > - > > Key: ARROW-6393 > URL: https://issues.apache.org/jira/browse/ARROW-6393 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Major > Fix For: 1.0.0 > > > SparseTensor::Equals should take EqualOptions argument as Tensor::Equals does. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6312) [C++] Declare required Libs.private in arrow.pc package config
[ https://issues.apache.org/jira/browse/ARROW-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6312: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++] Declare required Libs.private in arrow.pc package config > -- > > Key: ARROW-6312 > URL: https://issues.apache.org/jira/browse/ARROW-6312 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.1 >Reporter: Michael Maguire >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The current arrow.pc package config file produced is deficient and doesn't > properly declare static libraries pre-requisities that must be linked in in > order to *statically* link in libarrow.a > Currently it just has: > ``` > Libs: -L${libdir} -larrow > ``` > But in cases, e.g. where you enabled snappy, brotli or zlib support in arrow, > our toolchains need to see an arrow.pc file something more like: > ``` > Libs: -L${libdir} -larrow > Libs.private: -lsnappy -lboost_system -lz -llz4 -lbrotlidec -lbrotlienc > -lbrotlicommon -lzstd > ``` > If not, we get linkage errors. I'm told the convention is that if the .a has > an UNDEF, the Requires.private plus the Libs.private should resolve all the > undefs. See the Libs.private info in [https://linux.die.net/man/1/pkg-config] > > Note, however, as Sutou Kouhei pointed out in > [https://github.com/apache/arrow/pull/5123#issuecomment-522771452,] the > additional Libs.private need to be dynamically generated based on whether > functionality like snappy, brotli or zlib is enabled.. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7121) [C++][CI][Windows] Enable more features on the windows GHA build
[ https://issues.apache.org/jira/browse/ARROW-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7121: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++][CI][Windows] Enable more features on the windows GHA build > > > Key: ARROW-7121 > URL: https://issues.apache.org/jira/browse/ARROW-7121 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > Fix For: 1.0.0 > > > Like `ARROW_GANDIVA: ON`, `ARROW_FLIGHT: ON`, `ARROW_PARQUET: ON` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7049) [C++] warnings building on mingw-w64
[ https://issues.apache.org/jira/browse/ARROW-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7049: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++] warnings building on mingw-w64 > > > Key: ARROW-7049 > URL: https://issues.apache.org/jira/browse/ARROW-7049 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Jeroen >Priority: Minor > Fix For: 1.0.0 > > > Two warnings when building libarrow 0.15.1 on mingw-w64: > {code} > [ 2%] Running thrift compiler on parquet.thrift > [WARNING:C:/msys64/home/mingw-packages/mingw-w64-arrow/src/apache-arrow-0.15.1/cpp/src/parquet/parquet.thrift:297] > The "byte" type is a compatibility alias for "i8". Use "i8" to emphasize the > signedness of this type. > {code} > And later: > {code} > 81%] Building CXX object > src/parquet/CMakeFiles/parquet_static.dir/column_reader.cc.obj > C:/msys64/home/mingw-packages/mingw-w64-arrow/src/apache-arrow-0.15.1/cpp/src/parquet/arrow/writer.cc: > In member function 'virtual arrow::Status > parquet::arrow::FileWriterImpl::WriteColumnChunk(const > std::shared_ptr&, int64_t, int64_t)': > C:/msys64/home/mingw-packages/mingw-w64-arrow/src/apache-arrow-0.15.1/cpp/src/parquet/arrow/writer.cc:79:41: > warning: 'schema_field' may be used uninitialized in this function > [-Wmaybe-uninitialized] > schema_manifest_(schema_manifest) {} > ^ > C:/msys64/home/mingw-packages/mingw-w64-arrow/src/apache-arrow-0.15.1/cpp/src/parquet/arrow/writer.cc:466:24: > note: 'schema_field' was declared here > const SchemaField* schema_field; > {code} > Maybe CI with `CXXFLAGS += -Werror` ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7080) [Python][Parquet] Expose parquet field_id in Schema objects
[ https://issues.apache.org/jira/browse/ARROW-7080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7080: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [Python][Parquet] Expose parquet field_id in Schema objects > --- > > Key: ARROW-7080 > URL: https://issues.apache.org/jira/browse/ARROW-7080 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Ted Gooch >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > > I'm in the process of adding parquet read support to > Iceberg([https://iceberg.apache.org/]), and we use the parquet field_ids as a > consistent id when reading a parquet file to create a map between the current > schema and the schema of the file being read. Unless I've missed something, > it appears that field_id is not exposed in the python APIs in > pyarrow._parquet.ParquetSchema nor is it available in pyarrow.lib.Schema. > Would it be possible to add this to either of those two objects? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas
[ https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7365: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [Python] Support FixedSizeList type in conversion to numpy/pandas > - > > Key: ARROW-7365 > URL: https://issues.apache.org/jira/browse/ARROW-7365 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > Follow-up on ARROW-7261, still need to add support for FixedSizeListType in > the arrow -> python conversion (arrow_to_pandas.cc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7332) [C++][Parquet] Explicitly catch status exceptions in PARQUET_CATCH_NOT_OK
[ https://issues.apache.org/jira/browse/ARROW-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7332: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++][Parquet] Explicitly catch status exceptions in PARQUET_CATCH_NOT_OK > - > > Key: ARROW-7332 > URL: https://issues.apache.org/jira/browse/ARROW-7332 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Minor > Fix For: 1.0.0 > > > PARQUET_THROW_NOT_OK throws a ParquetStatusException, which contains a full > Status rather than just an error string. These could be caught explicitly in > PARQUET_CATCH_NOT_OK and the original status returned rather than creating a > new status: > {code} > } catch (const ::parquet::ParquetStatusException& e) { \ > return e.status(); \ > } catch (const ::parquet::ParquetException& e) { \ > return Status::IOError(e.what()) \ > {code} > This will retain the original StatusCode rather than overwriting it with > IOError. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7499) [C++] CMake should collect libs when making static build
[ https://issues.apache.org/jira/browse/ARROW-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7499: --- Fix Version/s: (was: 0.16.0) 1.0.0 > [C++] CMake should collect libs when making static build > > > Key: ARROW-7499 > URL: https://issues.apache.org/jira/browse/ARROW-7499 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Neal Richardson >Assignee: Kouhei Sutou >Priority: Major > Fix For: 1.0.0 > > > From https://github.com/apache/arrow/pull/6068/files#r360672071: > {code} > # Copy the bundled static libs from the build to the install dir > find . -regex .*/.*/lib/.*\\.a\$ | xargs -I{} cp -u {} ${DEST_DIR}/lib > {code} > {quote}I think that we should do this by CMake when -DARROW_BUILD_STATIC=ON > is specified. > ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}/arrow/vendored/libXXX.a may > be better for the installed path to avoid conflict.{quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7537) [CI][R] Nightly macOS autobrew job should be more verbose if it fails
[ https://issues.apache.org/jira/browse/ARROW-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-7537. Resolution: Fixed Issue resolved by pull request 6155 [https://github.com/apache/arrow/pull/6155] > [CI][R] Nightly macOS autobrew job should be more verbose if it fails > - > > Key: ARROW-7537 > URL: https://issues.apache.org/jira/browse/ARROW-7537 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Things like > https://travis-ci.org/ursa-labs/crossbow/builds/634643469#L673-L676 are hard > to debug because the installation log is not printed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7578) [R] Add support for datasets with IPC files and with multiple sources
[ https://issues.apache.org/jira/browse/ARROW-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7578: -- Labels: pull-request-available (was: ) > [R] Add support for datasets with IPC files and with multiple sources > - > > Key: ARROW-7578 > URL: https://issues.apache.org/jira/browse/ARROW-7578 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset, R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`
[ https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6895: -- Priority: Critical (was: Major) > [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader > repeats returned values when calling `NextBatch()` > --- > > Key: ARROW-6895 > URL: https://issues.apache.org/jira/browse/ARROW-6895 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.0 > Environment: Linux 5.2.17-200.fc30.x86_64 (Docker) >Reporter: Adam Hooper >Assignee: Francois Saint-Jacques >Priority: Critical > Fix For: 0.16.0 > > Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet > > > Given most columns, I can run a loop like: > {code:cpp} > std::unique_ptr columnReader(/*...*/); > while (nRowsRemaining > 0) { > int n = std::min(100, nRowsRemaining); > std::shared_ptr chunkedArray; > auto status = columnReader->NextBatch(n, &chunkedArray); > // ... and then use `chunkedArray` > nRowsRemaining -= n; > } > {code} > (The context is: "convert Parquet to CSV/JSON, with small memory footprint." > Used in https://github.com/CJWorkbench/parquet-to-arrow) > Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; > the second return value looks like {{val100...val199}}; and so on. > ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The > first {{NextBatch()}} return value looks like {{val0...val100}}; the second > return value looks like {{val0...val99, val100...val199}} (ChunkedArray with > two arrays); the third return value looks like {{val0...val99, > val100...val199, val200...val299}} (ChunkedArray with three arrays); and so > on. The returned arrays are never cleared. > In sum: {{NextBatch()}} on a dictionary column reader returns the wrong > values. > I've attached a minimal Parquet file that presents this problem with the > above code; and I've written a patch that fixes this one case, to illustrate > where things are wrong. I don't think I understand enough edge cases to > decree that my patch is a correct fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7093) [R] Support creating ScalarExpressions for more data types
[ https://issues.apache.org/jira/browse/ARROW-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7093: -- Assignee: Romain Francois (was: Neal Richardson) > [R] Support creating ScalarExpressions for more data types > -- > > Key: ARROW-7093 > URL: https://issues.apache.org/jira/browse/ARROW-7093 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Romain Francois >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > See > https://github.com/apache/arrow/blob/master/r/src/expression.cpp#L93-L107. > ARROW-6340 was limited to integer/double/logical. This will let us make > dataset filter expressions with all those other types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7093) [R] Support creating ScalarExpressions for more data types
[ https://issues.apache.org/jira/browse/ARROW-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7093: -- Assignee: Neal Richardson (was: Romain Francois) > [R] Support creating ScalarExpressions for more data types > -- > > Key: ARROW-7093 > URL: https://issues.apache.org/jira/browse/ARROW-7093 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > See > https://github.com/apache/arrow/blob/master/r/src/expression.cpp#L93-L107. > ARROW-6340 was limited to integer/double/logical. This will let us make > dataset filter expressions with all those other types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7538) Clarify actual and desired size in AllocationManager
[ https://issues.apache.org/jira/browse/ARROW-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7538: -- Labels: pull-request-available (was: ) > Clarify actual and desired size in AllocationManager > > > Key: ARROW-7538 > URL: https://issues.apache.org/jira/browse/ARROW-7538 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: David Li >Priority: Major > Labels: pull-request-available > > As a follow up to the review of ARROW-7329, we should clarify the different > sizes (desired vs actual size) in AllocationManager: > https://github.com/apache/arrow/pull/5973#discussion_r354729754 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
[ https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016106#comment-17016106 ] Neal Richardson commented on ARROW-7518: Both GHA cron and crossbow nightly sounds redundant. Given our current state of the art, I think we should prefer crossbow (better reporting, ability to trigger on demand) > [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages > --- > > Key: ARROW-7518 > URL: https://issues.apache.org/jira/browse/ARROW-7518 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Blocker > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This new module is not enabled in the package builds -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016063#comment-17016063 ] Antoine Pitrou commented on ARROW-7584: --- I think the plan would be to have a high-level function that takes a URI or a list of URIs and then constructs a dataset reader from them. Those URIs could point to simple files or partitioned data sources. That high-level function doesn't exist yet, though. > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seem clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016057#comment-17016057 ] Fabian Höring commented on ARROW-7584: -- Imo each language has its specificities. While it is a good idea to get a consistent API cross language trying to do (exactly) the same thing in each language will also confuse people. I don't know arrow very well but protobuf for example doesn't use the same wrappers in c# and Java. Can you explain on how you intend using this for reading parquet from Python for example ? > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seem clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016054#comment-17016054 ] Fabian Höring edited comment on ARROW-7584 at 1/15/20 2:53 PM: --- I agree for not needing to support transactions. Something lightweight would be better. was (Author: fhoering): I agree for better not to support transactions. Something lightweight would be better. > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seem clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016054#comment-17016054 ] Fabian Höring edited comment on ARROW-7584 at 1/15/20 2:53 PM: --- I agree for better not to support transactions. Something lightweight would be better. was (Author: fhoering): I agree for transactions. Something lightweight would be better. > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seem clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016054#comment-17016054 ] Fabian Höring commented on ARROW-7584: -- I agree for transactions. Something lightweight would be better. > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seem clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7583) [C++][Flight] Auth handler tests fragile on Windows
[ https://issues.apache.org/jira/browse/ARROW-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016039#comment-17016039 ] Antoine Pitrou commented on ARROW-7583: --- Or perhaps we can relax the test, along with an explanatory comment? > [C++][Flight] Auth handler tests fragile on Windows > --- > > Key: ARROW-7583 > URL: https://issues.apache.org/jira/browse/ARROW-7583 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Antoine Pitrou >Priority: Minor > > This occurs often on AppVeyor: > {code} > [--] 3 tests from TestAuthHandler > [ RUN ] TestAuthHandler.PassAuthenticatedCalls > [ OK ] TestAuthHandler.PassAuthenticatedCalls (4 ms) > [ RUN ] TestAuthHandler.FailUnauthenticatedCalls > ..\src\arrow\flight\flight_test.cc(1126): error: Value of: status.message() > Expected: has substring "Invalid token" > Actual: "Could not write record batch to stream: " > [ FAILED ] TestAuthHandler.FailUnauthenticatedCalls (3 ms) > [ RUN ] TestAuthHandler.CheckPeerIdentity > [ OK ] TestAuthHandler.CheckPeerIdentity (2 ms) > [--] 3 tests from TestAuthHandler (10 ms total) > [--] 3 tests from TestBasicAuthHandler > [ RUN ] TestBasicAuthHandler.PassAuthenticatedCalls > [ OK ] TestBasicAuthHandler.PassAuthenticatedCalls (4 ms) > [ RUN ] TestBasicAuthHandler.FailUnauthenticatedCalls > ..\src\arrow\flight\flight_test.cc(1224): error: Value of: status.message() > Expected: has substring "Invalid token" > Actual: "Could not write record batch to stream: " > [ FAILED ] TestBasicAuthHandler.FailUnauthenticatedCalls (4 ms) > [ RUN ] TestBasicAuthHandler.CheckPeerIdentity > [ OK ] TestBasicAuthHandler.CheckPeerIdentity (3 ms) > [--] 3 tests from TestBasicAuthHandler (11 ms total) > {code} > See e.g. > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/30110376/job/vbtd22813g5hlgfl#L2252 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seem clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work - Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102, I like the idea of a https://github.com/intake/filesystem_spec repo. Some comments on the proposed solutions there: Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem that is not good enough in yet another repo Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from dask/hdfs3 - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] from dask/hdfs3 - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes, already implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - W
[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016020#comment-17016020 ] Antoine Pitrou commented on ARROW-7584: --- Keep in mind that the filesystem API is meant mainly to be used in conjunction with other Arrow facilities, primarily the new datasets facility (which may not be documented yet?). Making it nicer to use is a respectable goal as well, but the primary goal might be kept in mind. As for fsspec, the abstract filesystem API is so big that it doesn't seem very convenient to implement (e.g. do we have to support transactions?). > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seems clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
[ https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7518: -- Labels: pull-request-available (was: ) > [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages > --- > > Key: ARROW-7518 > URL: https://issues.apache.org/jira/browse/ARROW-7518 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Blocker > Labels: pull-request-available > Fix For: 0.16.0 > > > This new module is not enabled in the package builds -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016010#comment-17016010 ] Fabian Höring commented on ARROW-7584: -- As indicated I can work on that if there is consensus of what needs to be done and PRs will be accepted. If I were to choose I would - rename all new methods to stick to the olds ones - make open() work - add some useful methods from dask/hdfs3 (that we use on our side). I also like the idea of fsspec but not if it will be yet another wrapper again. Only if pyarrow pulls the spec for real and implements it (would introduce a new dependency though) > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seems clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016007#comment-17016007 ] Antoine Pitrou commented on ARROW-7584: --- Also cc [~jorisvandenbossche] for advice. > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seems clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Labels: FileSystem (was: ) > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > Labels: FileSystem > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seems clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes, > already implemented by > https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would > permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
[ https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016004#comment-17016004 ] Krisztian Szucs commented on ARROW-7518: We have a GHA cron job and a crossbow nightly to tests the hdfs integration. > [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages > --- > > Key: ARROW-7518 > URL: https://issues.apache.org/jira/browse/ARROW-7518 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Blocker > Fix For: 0.16.0 > > > This new module is not enabled in the package builds -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work - Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102, I like the idea of a https://github.com/intake/filesystem_spec repo. Some comments on the proposed solutions there: Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem that is not good enough in yet another repo Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from dask/hdfs3 - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] from dask/hdfs3 - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes, already implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods -
[jira] [Assigned] (ARROW-7119) [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces
[ https://issues.apache.org/jira/browse/ARROW-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-7119: -- Assignee: Krisztian Szucs > [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces > --- > > Key: ARROW-7119 > URL: https://issues.apache.org/jira/browse/ARROW-7119 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The script was previously used on Travis, we should enable it in docker and > on GitHub actions to speed up the debugging process. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7576) [C++][Dev] Improve fuzzing setup
[ https://issues.apache.org/jira/browse/ARROW-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-7576. --- Fix Version/s: 0.16.0 Resolution: Fixed Issue resolved by pull request 6195 [https://github.com/apache/arrow/pull/6195] > [C++][Dev] Improve fuzzing setup > > > Key: ARROW-7576 > URL: https://issues.apache.org/jira/browse/ARROW-7576 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Developer Tools >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015999#comment-17015999 ] Fabian Höring commented on ARROW-7584: -- [~apitrou] [~kszucs] > [Python] Improve ergonomics of new FileSystem API > - > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Fabian Höring >Priority: Major > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h2. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seems clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h2. Possible solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy I think becasue there would be always 2 > methods to do the work > - Make everything compatible to FSSpec and reference the Spec, see > https://issues.apache.org/jira/browse/ARROW-7102, > I like the idea of a https://github.com/intake/filesystem_spec repo. Some > comments on the proposed solutions there: > Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be > having to wrap again a FileSystem that is not good enough in yet another repo > Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the > documented "official" pyarow FileSystem it is fine I think, otherwise I would > be yet another wrapper on top of the pyarrow "official" fs > h2. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will also just wrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h2. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce > [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from > dask/hdfs3 > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > from dask/hdfs3 > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes), it > would permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7576) [C++][Dev] Improve fuzzing setup
[ https://issues.apache.org/jira/browse/ARROW-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-7576: - Assignee: Antoine Pitrou > [C++][Dev] Improve fuzzing setup > > > Key: ARROW-7576 > URL: https://issues.apache.org/jira/browse/ARROW-7576 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Developer Tools >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7119) [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces
[ https://issues.apache.org/jira/browse/ARROW-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7119: -- Labels: pull-request-available (was: ) > [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces > --- > > Key: ARROW-7119 > URL: https://issues.apache.org/jira/browse/ARROW-7119 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > The script was previously used on Travis, we should enable it in docker and > on GitHub actions to speed up the debugging process. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work - Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102, I like the idea of a https://github.com/intake/filesystem_spec repo. Some comments on the proposed solutions there: Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem that is not good enough in yet another repo Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from dask/hdfs3 - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] from dask/hdfs3 - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSyst
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work - Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102, I like the idea of a https://github.com/intake/filesystem_spec repo. Some comments on the proposed solutions there: Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem the is not good enough in yet another repo Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from dask/hdfs3 - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] from dask/hdfs3 - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSyste
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work - Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102, I like the idea of a fsspex repo. Some comments on the proposed solutions: Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem the is not good enough in yet another repo Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from dask/hdfs3 - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] from dask/hdfs3 - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Possible solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work - Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102, I like the idea of a fsspex repo. Some comments on the proposed solutions: Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem the is not good enough in yet another repo Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do th
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work - Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102, I like the idea of a fsspex repo. Some comments on the proposed solutions: Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem the is not good enough in yet another repo Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do th
[jira] [Commented] (ARROW-7583) [C++][Flight] Auth handler tests fragile on Windows
[ https://issues.apache.org/jira/browse/ARROW-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015984#comment-17015984 ] David Li commented on ARROW-7583: - Interesting. It seems an error from a path where gRPC gives a boolean success result (rather than a full error message) is masking the actual, intended error later on. My guess is that this happens: # We start the DoPut, but this does not actually write any data. # We close the writer right away. {{RecordBatchPayloadWriter::Close}} calls {{CheckStarted}} first, which tries to write the schema. # If the test server responds quickly enough, the write fails. gRPC reports errors in this path as a boolean, so we report a generic error instead of the actual error message. If the test server doesn't respond quickly enough, then the write appears to succeed (IIRC gRPC does some sort of buffering?) and then we actually close the stream, at which point we get the intended error message. The solution might be to bypass RecordBatchPayloadWriter's close and go directly to our implementation. > [C++][Flight] Auth handler tests fragile on Windows > --- > > Key: ARROW-7583 > URL: https://issues.apache.org/jira/browse/ARROW-7583 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Antoine Pitrou >Priority: Minor > > This occurs often on AppVeyor: > {code} > [--] 3 tests from TestAuthHandler > [ RUN ] TestAuthHandler.PassAuthenticatedCalls > [ OK ] TestAuthHandler.PassAuthenticatedCalls (4 ms) > [ RUN ] TestAuthHandler.FailUnauthenticatedCalls > ..\src\arrow\flight\flight_test.cc(1126): error: Value of: status.message() > Expected: has substring "Invalid token" > Actual: "Could not write record batch to stream: " > [ FAILED ] TestAuthHandler.FailUnauthenticatedCalls (3 ms) > [ RUN ] TestAuthHandler.CheckPeerIdentity > [ OK ] TestAuthHandler.CheckPeerIdentity (2 ms) > [--] 3 tests from TestAuthHandler (10 ms total) > [--] 3 tests from TestBasicAuthHandler > [ RUN ] TestBasicAuthHandler.PassAuthenticatedCalls > [ OK ] TestBasicAuthHandler.PassAuthenticatedCalls (4 ms) > [ RUN ] TestBasicAuthHandler.FailUnauthenticatedCalls > ..\src\arrow\flight\flight_test.cc(1224): error: Value of: status.message() > Expected: has substring "Invalid token" > Actual: "Could not write record batch to stream: " > [ FAILED ] TestBasicAuthHandler.FailUnauthenticatedCalls (4 ms) > [ RUN ] TestBasicAuthHandler.CheckPeerIdentity > [ OK ] TestBasicAuthHandler.CheckPeerIdentity (3 ms) > [--] 3 tests from TestBasicAuthHandler (11 ms total) > {code} > See e.g. > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/30110376/job/vbtd22813g5hlgfl#L2252 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that us
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic on
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h2. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h2. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h2. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h2. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h3. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the str
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h3. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3: Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h3. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the str
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3: Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h3. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Solutions - If the current Python API is still unused we could just rename the methods - We could everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3. Other considerations on ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code}
[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated ARROW-7584: - Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Solutions - If the current Python API is still unused we could just rename the methods - We could everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3. Other considerations on ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *File access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Solutions - If the current Python API is still unused we could just rename the methods - We could everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3. Other considerations on ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files > [Python] Improve ergonomics of new FileSystem API > --