[jira] [Created] (ARROW-17154) [C++] Change cmake project name from arrow_python to pyarrow_cpp
Alenka Frim created ARROW-17154: --- Summary: [C++] Change cmake project name from arrow_python to pyarrow_cpp Key: ARROW-17154 URL: https://issues.apache.org/jira/browse/ARROW-17154 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Alenka Frim Assignee: Alenka Frim Fix For: 10.0.0 See discussion https://github.com/apache/arrow/pull/13311#discussion_r926198302 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17153) [CI][Homebrew] Require glib-utils
Kouhei Sutou created ARROW-17153: Summary: [CI][Homebrew] Require glib-utils Key: ARROW-17153 URL: https://issues.apache.org/jira/browse/ARROW-17153 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou Fix For: 9.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17152) [Docs] Enable dark mode on documentation site
Will Jones created ARROW-17152: -- Summary: [Docs] Enable dark mode on documentation site Key: ARROW-17152 URL: https://issues.apache.org/jira/browse/ARROW-17152 Project: Apache Arrow Issue Type: New Feature Reporter: Will Jones Fix For: 10.0.0 Attachments: Screen Shot 2022-07-20 at 3.10.51 PM.png, Screen Shot 2022-07-20 at 3.12.18 PM.png pydata-sphinx-theme adds dark mode in version 0.9.0. We will need to adapt our logo ([see docs|https://pydata-sphinx-theme.readthedocs.io/en/stable/user_guide/configuring.html?highlight=dark#different-logos-for-light-and-dark-mode]). There are also some places in the docs where we may need to adjust additional CSS. See attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17151) [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode
Will Jones created ARROW-17151: -- Summary: [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode Key: ARROW-17151 URL: https://issues.apache.org/jira/browse/ARROW-17151 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Will Jones Fix For: 9.0.0 pydata-sphinx-theme introduced automatic dark mode. However there is a series of changes we need to do (such as providing a dark-mode Arrow logo) before we will be ready for this. For the 9.0.0 release, we should instead pin to the version of pydata-sphinx-theme just before that release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17150) [R] Allow statically linked libcurl in GCS when building libarrow DLL in RTools
Will Jones created ARROW-17150: -- Summary: [R] Allow statically linked libcurl in GCS when building libarrow DLL in RTools Key: ARROW-17150 URL: https://issues.apache.org/jira/browse/ARROW-17150 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: Will Jones Fix For: 10.0.0 Neal's patch in ARROW-16510 enabled libcurl to be linked statically in the google cloud storage dependency, but this only seems to work for static libraries on RTools (Windows). For development Rtools environments, we currently use dynamic Arrow libraries instead, but currently we get linking errors to libcurl when ARROW_GCS is on. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17149) [R] Enable GCS tests for Windows
Will Jones created ARROW-17149: -- Summary: [R] Enable GCS tests for Windows Key: ARROW-17149 URL: https://issues.apache.org/jira/browse/ARROW-17149 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Affects Versions: 9.0.0 Reporter: Will Jones Fix For: 10.0.0 In ARROW-16879, I found the GCS tests were hanging in CI, but couldn't diagnose why. We should solve that and enable the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-adbc] lidavidm merged pull request #41: [Python] Complete minimal bindings for ADBC
lidavidm merged PR #41: URL: https://github.com/apache/arrow-adbc/pull/41 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] lidavidm closed issue #37: Reorganize and complete Python bindings
lidavidm closed issue #37: Reorganize and complete Python bindings URL: https://github.com/apache/arrow-adbc/issues/37 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-17148) [R] Improve evaluation of R functions from C++
Dewey Dunnington created ARROW-17148: Summary: [R] Improve evaluation of R functions from C++ Key: ARROW-17148 URL: https://issues.apache.org/jira/browse/ARROW-17148 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dewey Dunnington There are currently a few places where we call R code from C++ (and after ARROW-16444 and ARROW-16703 we will have some more where the overhead of calling into R might be greater than the time it takes to actually evaluate the function/the functions will be called in a tight loop). The current approach uses {{cpp11::function}}. This is totally fine and safe but generates some ugly backtraces on error and is potentially slower than the lean-and-mean approach of purrr (whose entire job is to call R functions in a loop and has been heavily optimized). The purrr approach is to construct the {{call()}} and calling environment in advance and then just run `Rf_eval(call, env)` in the loop. This is both faster (fewer R API calls) and generates better backtraces (e.g., {{Error in fun(arg1, arg2)}} instead of {{Error in (function(a, b) { ...the whole content of the function ... })(every, deparsed, argument)}}. Before optimizing that heavily we should of course benchmark to see exactly how much that matters! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17147) [R] parse_date_time should support locale parameter
Rok Mihevc created ARROW-17147: -- Summary: [R] parse_date_time should support locale parameter Key: ARROW-17147 URL: https://issues.apache.org/jira/browse/ARROW-17147 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Rok Mihevc See [discussion here|https://github.com/apache/arrow/pull/13627#discussion_r924875872]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17146) [R] parse_date_time should support quiet = FALSE
Rok Mihevc created ARROW-17146: -- Summary: [R] parse_date_time should support quiet = FALSE Key: ARROW-17146 URL: https://issues.apache.org/jira/browse/ARROW-17146 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Rok Mihevc See [discussion here|https://github.com/apache/arrow/pull/13627#discussion_r924875872]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-adbc] lidavidm opened a new pull request, #41: [Python] Complete minimal bindings for ADBC
lidavidm opened a new pull request, #41: URL: https://github.com/apache/arrow-adbc/pull/41 Also refactors the bindings to not depend on PyArrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-17145) [C++] Compilation warnings on gcc in release mode
Antoine Pitrou created ARROW-17145: -- Summary: [C++] Compilation warnings on gcc in release mode Key: ARROW-17145 URL: https://issues.apache.org/jira/browse/ARROW-17145 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou With gcc 10.3 I get this warning in release mode. {code} [168/321] Building CXX object src/arrow/CMakeFiles/arrow_testing_objlib.dir/compute/exec/test_util.cc.o In file included from /home/antoine/arrow/dev/cpp/src/arrow/compute/exec/test_util.h:28, from /home/antoine/arrow/dev/cpp/src/arrow/compute/exec/test_util.cc:18: /home/antoine/arrow/dev/cpp/src/arrow/compute/exec.h: In member function 'R arrow::internal::FnOnce::FnImpl::invoke(A&& ...) [with Fn = arrow::Future<>::WrapResultyOnComplete::Callback::ThenOnComplete >)::, arrow::Future<>::PassthruOnFailure >):: > > >; R = void; A = {const arrow::FutureImpl&}]': /home/antoine/arrow/dev/cpp/src/arrow/compute/exec.h:177:21: warning: '*((void*)(&)+8).arrow::compute::ExecBatch::length' may be used uninitialized in this function [-Wmaybe-uninitialized] 177 | struct ARROW_EXPORT ExecBatch { | ^ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17144) Adding sqrt Function
Sahaj Gupta created ARROW-17144: --- Summary: Adding sqrt Function Key: ARROW-17144 URL: https://issues.apache.org/jira/browse/ARROW-17144 Project: Apache Arrow Issue Type: New Feature Reporter: Sahaj Gupta Adding Sqrt Function. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17143) [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`
SHIMA Tatsuya created ARROW-17143: - Summary: [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer` Key: ARROW-17143 URL: https://issues.apache.org/jira/browse/ARROW-17143 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Affects Versions: 8.0.1 Reporter: SHIMA Tatsuya Related to ARROW-8813 The arrow package can convert json files to data frames very easily, but {{tidyr::unnest_longer}} is needed for array expansion. Wonder if {{tidyr}} could be added to the recommended package and examples like this could be included in the documentation and test cases. {code:r} tf <- tempfile() on.exit(unlink(tf)) writeLines(' { "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } } { "hello": 3.25, "world": null } { "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } } ', tf) arrow::read_json_arrow(tf) |> tidyr::unnest(foo, names_sep = ".") |> tidyr::unnest_longer(foo.bar) #> # A tibble: 6 × 3 #> hello world foo.bar #> #> 1 3.5 FALSE 1 #> 2 3.5 FALSE 2 #> 3 3.25 NA NA #> 4 0TRUE3 #> 5 0TRUE4 #> 6 0TRUE5 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17142) `equals` method on Parquet Metadata segfaults when passed `None
Kshiteej K created ARROW-17142: -- Summary: `equals` method on Parquet Metadata segfaults when passed `None Key: ARROW-17142 URL: https://issues.apache.org/jira/browse/ARROW-17142 Project: Apache Arrow Issue Type: Bug Reporter: Kshiteej K {code:java} import pyarrow as pa import pyarrow.parquet as pq table = pa.table({"a": [1, 2, 3]}) # Here metadata is None metadata = table.schema.metadata fname = "data.parquet" pq.write_table(table, fname) # Get `metadata`. r_metadata = pq.read_metadata(fname) # Equals on Metadata segfaults when passed None r_metadata.equals(metadata) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path
Rok Mihevc created ARROW-17141: -- Summary: [C++] Enable selecting nested fields in StructArray with field path Key: ARROW-17141 URL: https://issues.apache.org/jira/browse/ARROW-17141 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Rok Mihevc Currently selecting a nested field in a StructArray requires multiple selects or flattening of schema. It would be more user friendly to provide a field path e.g.: field_in_top_struct.field_in_substruct. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17140) Adding Floor Function
Sahaj Gupta created ARROW-17140: --- Summary: Adding Floor Function Key: ARROW-17140 URL: https://issues.apache.org/jira/browse/ARROW-17140 Project: Apache Arrow Issue Type: New Feature Reporter: Sahaj Gupta Adding Floor Function -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17139) [Python] Add field() method to get field from StructType
Joris Van den Bossche created ARROW-17139: - Summary: [Python] Add field() method to get field from StructType Key: ARROW-17139 URL: https://issues.apache.org/jira/browse/ARROW-17139 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche >From ARROW-17047: We could also add a {{field()}} method to {{StructType}} that returns you a field? (that is more discoverable than [], and would be consistent with a Schema and with StructArray (to get the child array for that field)) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17137) [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected`
Simon Weiß created ARROW-17137: -- Summary: [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected` Key: ARROW-17137 URL: https://issues.apache.org/jira/browse/ARROW-17137 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Simon Weiß Hey, I have a data frame for which one column is a nested struct array. Converting it to a `pyarrow.Table` fails if the data frame gets too big. I could reproduce the bug with a minimal example with anonymized data that is roughly similar to mine. When I set, e.g., `N_ROWS=500_000`, or smaller, it is working fine. ```python import pandas as pd import pyarrow as pa N_ROWS = 800_000 item_record = { "someImportantAssets": [ { "square": "https://some.super.loong.link.com/withmany/lorem/upload/ipsum/stilllooonger/lorem/\{someparameter}/156fdjjf644984dfdfaera648/specificLink-i15348891; } ], "id": "i15348891", "title": "Some Long Item Title i15348891", } user_record = { "userId": "faa4648-4964drf-64648fafa648-4648falj", "data": [item_record for _ in range(24)], } df = pd.DataFrame([user_record for _ in range(N_ROWS)]) table = pa.Table.from_pandas(df) ``` ```python-traceback Traceback (most recent call last): table = pa.Table.from_pandas(df) File "pyarrow/table.pxi", line 1658, in pyarrow.lib.Table.from_pandas File "pyarrow/table.pxi", line 1702, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1314, in pyarrow.lib.Table.validate File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #1 invalid: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (13256071 < 13256072) ``` The length is always smaller than expected by 1. h2. Expected behavior: Run without errors or fail with a better error message. h2. System Info and Versions: Apple M1 Pro but also happened on amd64 Linux machine on AWS ``` arrow-cpp 7.0.0 py39h8a997f0_8_cpu conda-forge pyarrow 7.0.0 py39h3a11367_8_cpu conda-forge python 3.9.7 h54d631c_3_cpython conda-forge ``` I could also reproduce with `pyarrow 8.0.0` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17138) [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected`
Simon Weiß created ARROW-17138: -- Summary: [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected` Key: ARROW-17138 URL: https://issues.apache.org/jira/browse/ARROW-17138 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Simon Weiß Hey, I have a data frame for which one column is a nested struct array. Converting it to a `pyarrow.Table` fails if the data frame gets too big. I could reproduce the bug with a minimal example with anonymized data that is roughly similar to mine. When I set, e.g., `N_ROWS=500_000`, or smaller, it is working fine. ```python import pandas as pd import pyarrow as pa N_ROWS = 800_000 item_record = { "someImportantAssets": [ { "square": "https://some.super.loong.link.com/withmany/lorem/upload/ipsum/stilllooonger/lorem/\{someparameter}/156fdjjf644984dfdfaera648/specificLink-i15348891; } ], "id": "i15348891", "title": "Some Long Item Title i15348891", } user_record = { "userId": "faa4648-4964drf-64648fafa648-4648falj", "data": [item_record for _ in range(24)], } df = pd.DataFrame([user_record for _ in range(N_ROWS)]) table = pa.Table.from_pandas(df) ``` ```python-traceback Traceback (most recent call last): table = pa.Table.from_pandas(df) File "pyarrow/table.pxi", line 1658, in pyarrow.lib.Table.from_pandas File "pyarrow/table.pxi", line 1702, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1314, in pyarrow.lib.Table.validate File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #1 invalid: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (13256071 < 13256072) ``` The length is always smaller than expected by 1. h2. Expected behavior: Run without errors or fail with a better error message. h2. System Info and Versions: Apple M1 Pro but also happened on amd64 Linux machine on AWS ``` arrow-cpp 7.0.0 py39h8a997f0_8_cpu conda-forge pyarrow 7.0.0 py39h3a11367_8_cpu conda-forge python 3.9.7 h54d631c_3_cpython conda-forge ``` I could also reproduce with `pyarrow 8.0.0` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17136) open_append_stream throwing an error if file does not exists
Sagar Shinde created ARROW-17136: Summary: open_append_stream throwing an error if file does not exists Key: ARROW-17136 URL: https://issues.apache.org/jira/browse/ARROW-17136 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 8.0.0 Reporter: Sagar Shinde as per the document method, open_append_stream will create the file if does not exists. But when I try to append to the file in hdfs it is throwing an error like file, not found. hdfsOpenFile(/tmp/xyz.json): FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;) error: RemoteException: Failed to append to non-existent file /tmp/xyz.json for client 10.128.8.11 at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) java.io.FileNotFoundException: Failed to append to non-existent file /tmp/xyz.json for client 10.128.8.11 at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394) at org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423) at org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:431) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:400) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1386) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Failed to append to non-existent file /tmp/xyz.json for client 10.128.8.11 at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104) at