[jira] [Created] (ARROW-18105) Arrow Flight SegFault
Ziheng Wang created ARROW-18105: --- Summary: Arrow Flight SegFault Key: ARROW-18105 URL: https://issues.apache.org/jira/browse/ARROW-18105 Project: Apache Arrow Issue Type: Bug Components: FlightRPC Affects Versions: 9.0.0 Reporter: Ziheng Wang Typo in grpc endpoint results in segfault. Probably should result in warning instead. ziheng@ziheng:~$ python3 Python 3.8.0 (default, Nov 6 2019, 21:49:08) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow.flight >>> flight_client = pyarrow.flight.connect("grcp://0.0.0.0:5005") -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18104) Can't connect to HDFS in tmux?
Kyoung-Rok Jang created ARROW-18104: --- Summary: Can't connect to HDFS in tmux? Key: ARROW-18104 URL: https://issues.apache.org/jira/browse/ARROW-18104 Project: Apache Arrow Issue Type: Bug Affects Versions: 9.0.0 Reporter: Kyoung-Rok Jang Hello. I'm trying to load `.parquet` files from HDFS path using `pyarrow.fs.HadoopFileSystem`. I'm working in my company's cluster which relies on kerberos. In the main shell the following code works. But strangely the code doesn't work in `tmux` session. I do `kinit` before running this code. One more strange thing is I can perform hdfs commands in tmux without any problem e.g. `hdfs dfs -ls`. What would be the cause? ```sh from pyarrow import fs hdfs = fs.HadoopFileSystem("default", port=0) hdfs.create_dir("created_by_pyarrow") # works in the main shell, doesn't work in tmux ``` The error I see is as follows: ``` 2022-10-19 00:06:45,704 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)] hdfsGetPathInfo(created_by_pyarrow): getFileInfo error: KrbApErrException: Fail to create credential. (63) - No service credsjava.io.IOException: DestHost:destPort abc-nn1.bdp.bdata.ai:9020 , LocalHost:localPort asca0x0930.nfra.io/10.168.26.12:0. Failed on local exception: java.io.IOException: javax$ security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683) Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:760) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:723) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:817) at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:411) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1572) at org.apache.hadoop.ipc.Client.call(Client.java:1403)
[jira] [Created] (ARROW-18103) [Packaging][deb][RPM] Upload artifacts patterns are wrong
Kouhei Sutou created ARROW-18103: Summary: [Packaging][deb][RPM] Upload artifacts patterns are wrong Key: ARROW-18103 URL: https://issues.apache.org/jira/browse/ARROW-18103 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou Fix For: 10.0.0 It may upload artifacts that have the same base name. It causes 422 HTTP response from GitHub. Because GitHub's upload artifact API returns 422 if the base name already exists in the release. https://app.travis-ci.com/github/ursacomputing/crossbow/builds/256830240 {noformat} curl: (22) The requested URL returned error: 422 {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18102) dplyr::count and dplyr::tally implementation return NA instead of 0
Adam Black created ARROW-18102: -- Summary: dplyr::count and dplyr::tally implementation return NA instead of 0 Key: ARROW-18102 URL: https://issues.apache.org/jira/browse/ARROW-18102 Project: Apache Arrow Issue Type: Bug Components: R Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0 Reporter: Adam Black I'm using dplyr with FileSystemDataset objects. The expected behavior is similar (or the same as) dataframe behavior. When the FileSystemDataset has zero rows dplyr::count and dplyr::tally return NA instead of 0. I would expect the result to be 0. ``` r library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union path <- tempfile(fileext = ".feather") zero_row_dataset <- cars %>% filter(dist < 0) # expected behavior zero_row_dataset %>% count() #> n #> 1 0 zero_row_dataset %>% tally() #> n #> 1 0 nrow(zero_row_dataset) #> [1] 0 # now test behavior with a FileSystemDataset write_feather(zero_row_dataset, path) ds <- open_dataset(path, format = "feather") ds #> FileSystemDataset with 1 Feather file #> speed: double #> dist: double #> #> See $metadata for additional Schema metadata # actual behavior ds %>% count() %>% collect() # incorrect result #> # A tibble: 1 × 1 #> n #> #> 1 NA ds %>% tally() %>% collect() # incorrect result #> # A tibble: 1 × 1 #> n #> #> 1 NA nrow(ds) # works as expected #> [1] 0 ``` Created on 2022-10-19 with [reprex v2.0.2](https://reprex.tidyverse.org) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18101) [R] RecordBatchReaderHead from ExecPlan with UDF cannot be read
Neal Richardson created ARROW-18101: --- Summary: [R] RecordBatchReaderHead from ExecPlan with UDF cannot be read Key: ARROW-18101 URL: https://issues.apache.org/jira/browse/ARROW-18101 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson {code} register_scalar_function( "times_32", function(context, x) x * 32.0, int32(), float64(), auto_convert = TRUE ) record_batch(a = 1:1000) %>% dplyr::mutate(b = times_32(a)) %>% as_record_batch_reader() %>% head(11) %>% as_arrow_table() # Error: NotImplemented: Call to R (resolve scalar user-defined function output data type) from a non-R thread from an unsupported context # /arrow/cpp/src/arrow/compute/exec.cc:649 kernel_->signature->out_type().Resolve(kernel_ctx_, args.inputs) # /arrow/cpp/src/arrow/compute/exec/expression.cc:602 executor->Init(_context, {kernel, types, options}) # /arrow/cpp/src/arrow/compute/exec/project_node.cc:91 ExecuteScalarExpression(simplified_expr, target, plan()->exec_context()) # /arrow/cpp/src/arrow/record_batch.cc:336 ReadNext() # /arrow/cpp/src/arrow/record_batch.cc:350 ToRecordBatches() {code} It works fine if you don't call {{as_record_batch_reader()}} in the middle. Oddly, it also works fine if you add {{as_adq()}} (aka {{collapse()}}) after head() and before evaluating to table--that is, if you run it through an ExecPlan again, it doesn't error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18100) [C++] Intermittent failure in TestNewScanner.Backpressure
Weston Pace created ARROW-18100: --- Summary: [C++] Intermittent failure in TestNewScanner.Backpressure Key: ARROW-18100 URL: https://issues.apache.org/jira/browse/ARROW-18100 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace For example: https://github.com/ursacomputing/crossbow/actions/runs/3277989378 /jobs/5395881371#step:5:3133 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18099) Cannot create pandas categorical from table only with nulls
Damian Barabonkov created ARROW-18099: - Summary: Cannot create pandas categorical from table only with nulls Key: ARROW-18099 URL: https://issues.apache.org/jira/browse/ARROW-18099 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 9.0.0 Environment: OSX 12.6 M1 silicon Reporter: Damian Barabonkov A pyarrow Table with only null values cannot be instantiated as a Pandas DataFrame with said column as a category. However, pandas does support "empty" categoricals. Therefore, a simple patch would be to load the pa.Table as an object first and convert, once in pandas, to a categorical which will be empty. However, that does not solve the pyarrow bug at its root. Sample reproducible example ```python import pyarrow as pa pylist = [\{'x': None, '__index_level_0__': 2}, \{'x': None, '__index_level_0__': 3}] tbl = pa.Table.from_pylist(pylist) # Errors df_broken = tbl.to_pandas(categories=["x"]) # Works df_works = tbl.to_pandas() df_works = df_works.astype(\{"x": "category"}) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18098) [C++] Vector kernel for "intersecting" two arrays (all common elements)
Joris Van den Bossche created ARROW-18098: - Summary: [C++] Vector kernel for "intersecting" two arrays (all common elements) Key: ARROW-18098 URL: https://issues.apache.org/jira/browse/ARROW-18098 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Joris Van den Bossche This would be similar to numpy's {{intersect1d}} (https://numpy.org/doc/stable/reference/generated/numpy.intersect1d.html) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18097) [C++] Add a "list_contains" kernel
Joris Van den Bossche created ARROW-18097: - Summary: [C++] Add a "list_contains" kernel Key: ARROW-18097 URL: https://issues.apache.org/jira/browse/ARROW-18097 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Joris Van den Bossche Assume you have a list array: {code} arr = pa.array([["a", "b"], ["a", "c"], ["b", "c", "d"]]) {code} And you want to know for each list if it contains a certain value (of the same type as the list's values). A "list_contains" function (or other name) would be useful for that: {code} pc.list_contains(arr, "a") # -> True, True False {code} The current workaround that I found was flattening, checking equality, and then reducing again with groupby, but this is quite tedious: {code} >>> temp = pa.table({'index': pc.list_parent_indices(arr), 'contains_value': >>> pc.equal(pc.list_flatten(arr), "a")}) >>> temp.group_by('index').aggregate([('contains_value', >>> 'any')])['contains_value_any'].chunk(0) [ true, true, false ] {code} But this also only works if there are no empty or missing list values. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18096) [Dev] Remove github user names from merge commit message
Joris Van den Bossche created ARROW-18096: - Summary: [Dev] Remove github user names from merge commit message Key: ARROW-18096 URL: https://issues.apache.org/jira/browse/ARROW-18096 Project: Apache Arrow Issue Type: Task Components: Developer Tools Reporter: Joris Van den Bossche We currently use the top post comment body of a github PR as the body of the commit message. It is not uncommon to tag someone when opening a PR, but retaining those github usernames in the commit message is annoying as that can generate additional notifications for the people that were tagged. It should be straightforward to remove the github user names from the message body (for example, just remove the @, so it doesn't work anymore as user name link) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18095) [CI][C++][MinGW] All tests exited with 0xc0000139
Kouhei Sutou created ARROW-18095: Summary: [CI][C++][MinGW] All tests exited with 0xc139 Key: ARROW-18095 URL: https://issues.apache.org/jira/browse/ARROW-18095 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou https://github.com/apache/arrow/actions/runs/3261682270/jobs/5357126875 {noformat} + ctest --label-regex unittest --output-on-failure --parallel 2 --timeout 300 --exclude-regex 'gandiva-internals-test|gandiva-projector-test|gandiva-utf8-test|gandiva-binary-test|gandiva-boolean-expr-test|gandiva-date-time-test|gandiva-decimal-single-test|gandiva-decimal-test|gandiva-filter-project-test|gandiva-filter-test|gandiva-hash-test|gandiva-if-expr-test|gandiva-in-expr-test|gandiva-literal-test|gandiva-null-validity-test|gandiva-precompiled-test|gandiva-projector-test' Test project D:/a/arrow/arrow/build/cpp Start 1: arrow-array-test Start 2: arrow-buffer-test 1/67 Test #2: arrow-buffer-test .Exit code 0xc139 ***Exception: 0.15 sec Start 3: arrow-extension-type-test 2/67 Test #1: arrow-array-test ..Exit code 0xc139 ***Exception: 0.17 sec Start 4: arrow-misc-test 3/67 Test #3: arrow-extension-type-test .Exit code 0xc139 ***Exception: 0.04 sec 39 - arrow-dataset-discovery-test (Exit code 0xc139 ) 40 - arrow-dataset-file-ipc-test (Exit code 0xc139 ) 41 - arrow-dataset-file-test (Exit code 0xc139 ) 42 - arrow-dataset-partition-test (Exit code 0xc139 ) 43 - arrow-dataset-scanner-test (Exit code 0xc139 ) 44 - arrow-dataset-file-csv-test (Exit code 0xc139 ) 45 - arrow-dataset-file-parquet-test (Exit code 0xc139 ) 46 - arrow-filesystem-test (Exit code 0xc139 ) Errors while running CTest 47 - arrow-gcsfs-test (Exit code 0xc139 ) 48 - arrow-s3fs-test (Exit code 0xc139 ) 49 - arrow-flight-internals-test (Exit code 0xc139 ) 50 - arrow-flight-test (Exit code 0xc139 ) 51 - arrow-flight-sql-test (Exit code 0xc139 ) 52 - arrow-feather-test (Exit code 0xc139 ) 53 - arrow-ipc-json-simple-test (Exit code 0xc139 ) 54 - arrow-ipc-read-write-test (Exit code 0xc139 ) 55 - arrow-ipc-tensor-test (Exit code 0xc139 ) 56 - arrow-json-test (Exit code 0xc139 ) 57 - parquet-internals-test (Exit code 0xc139 ) 58 - parquet-reader-test (Exit code 0xc139 ) 59 - parquet-writer-test (Exit code 0xc139 ) 60 - parquet-arrow-test (Exit code 0xc139 ) 61 - parquet-arrow-internals-test (Exit code 0xc139 ) 62 - parquet-encryption-test (Exit code 0xc139 ) 63 - parquet-encryption-key-management-test (Exit code 0xc139 ) 64 - parquet-file-deserialize-test (Exit code 0xc139 ) 65 - parquet-schema-test (Exit code 0xc139 ) 66 - gandiva-projector-build-validation-test (Exit code 0xc139 ) 67 - gandiva-to-string-test (Exit code 0xc139 ) Error: Process completed with exit code 8. {noformat} The last succeeded job: https://github.com/apache/arrow/actions/runs/3256683017/jobs/5347422431 -- This message was sent by Atlassian Jira (v8.20.10#820010)