[jira] [Created] (ARROW-18105) Arrow Flight SegFault

2022-10-19 Thread Ziheng Wang (Jira)
Ziheng Wang created ARROW-18105:
---

 Summary: Arrow Flight SegFault
 Key: ARROW-18105
 URL: https://issues.apache.org/jira/browse/ARROW-18105
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC
Affects Versions: 9.0.0
Reporter: Ziheng Wang


Typo in grpc endpoint results in segfault. Probably should result in warning 
instead.

ziheng@ziheng:~$ python3
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.flight
>>> flight_client = pyarrow.flight.connect("grcp://0.0.0.0:5005")



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18104) Can't connect to HDFS in tmux?

2022-10-19 Thread Kyoung-Rok Jang (Jira)
Kyoung-Rok Jang created ARROW-18104:
---

 Summary: Can't connect to HDFS in tmux?
 Key: ARROW-18104
 URL: https://issues.apache.org/jira/browse/ARROW-18104
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 9.0.0
Reporter: Kyoung-Rok Jang


Hello. I'm trying to load `.parquet` files from HDFS path using 
`pyarrow.fs.HadoopFileSystem`. I'm working in my company's cluster which relies 
on kerberos. In the main shell the following code works. But strangely the code 
doesn't work in `tmux` session. I do `kinit` before running this code. One more 
strange thing is I can perform hdfs commands in tmux without any problem e.g. 
`hdfs dfs -ls`. What would be the cause? 

 

```sh

from pyarrow import fs

 

hdfs = fs.HadoopFileSystem("default", port=0)

hdfs.create_dir("created_by_pyarrow") # works in the main shell, doesn't work 
in tmux

```

The error I see is as follows:

```

2022-10-19 00:06:45,704 WARN ipc.Client: Exception encountered while connecting 
to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused 
by GSSException: No valid credentials provided (Mechanism level: Fail to create 
credential. (63) - No service creds)] hdfsGetPathInfo(created_by_pyarrow): 
getFileInfo error: KrbApErrException: Fail to create credential. (63) - No 
service credsjava.io.IOException: DestHost:destPort abc-nn1.bdp.bdata.ai:9020 , 
LocalHost:localPort asca0x0930.nfra.io/10.168.26.12:0. Failed on local 
exception: java.io.IOException: javax$ security.sasl.SaslException: GSS 
initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Fail to create credential. (63) - No service creds)] at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515) at 
org.apache.hadoop.ipc.Client.call(Client.java:1457) at 
org.apache.hadoop.ipc.Client.call(Client.java:1367) at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
 at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
 at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660) at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580)
 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683) Caused by: 
java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Fail 
to create credential. (63) - No service creds)] at 
org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:760) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
 at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:723)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:817) at 
org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:411) at 
org.apache.hadoop.ipc.Client.getConnection(Client.java:1572) at 
org.apache.hadoop.ipc.Client.call(Client.java:1403) 

[jira] [Created] (ARROW-18103) [Packaging][deb][RPM] Upload artifacts patterns are wrong

2022-10-19 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18103:


 Summary: [Packaging][deb][RPM] Upload artifacts patterns are wrong
 Key: ARROW-18103
 URL: https://issues.apache.org/jira/browse/ARROW-18103
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 10.0.0


It may upload artifacts that have the same base name. It causes 422 HTTP 
response from GitHub. Because GitHub's upload artifact API returns 422 if the 
base name already exists in the release.

https://app.travis-ci.com/github/ursacomputing/crossbow/builds/256830240

{noformat}
curl: (22) The requested URL returned error: 422 
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18102) dplyr::count and dplyr::tally implementation return NA instead of 0

2022-10-19 Thread Adam Black (Jira)
Adam Black created ARROW-18102:
--

 Summary: dplyr::count and dplyr::tally implementation return NA 
instead of 0
 Key: ARROW-18102
 URL: https://issues.apache.org/jira/browse/ARROW-18102
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
 Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0
Reporter: Adam Black


I'm using dplyr with FileSystemDataset objects. The expected behavior is 
similar (or the same as) dataframe behavior. When the FileSystemDataset has 
zero rows dplyr::count and dplyr::tally return NA instead of 0. I would expect 
the result to be 0.

 

``` r
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

path <- tempfile(fileext = ".feather")

zero_row_dataset <- cars %>% filter(dist < 0)

# expected behavior
zero_row_dataset %>% 
  count()
#>   n
#> 1 0

zero_row_dataset %>% 
  tally()
#>   n
#> 1 0

nrow(zero_row_dataset)
#> [1] 0

# now test behavior with a FileSystemDataset
write_feather(zero_row_dataset, path)
ds <- open_dataset(path, format = "feather")
ds
#> FileSystemDataset with 1 Feather file
#> speed: double
#> dist: double
#> 
#> See $metadata for additional Schema metadata

# actual behavior
ds %>% 
  count() %>% 
  collect() # incorrect result
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    NA

ds %>% 
  tally() %>% 
  collect() # incorrect result
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    NA

nrow(ds) # works as expected
#> [1] 0
```

Created on 2022-10-19 with [reprex 
v2.0.2](https://reprex.tidyverse.org)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18101) [R] RecordBatchReaderHead from ExecPlan with UDF cannot be read

2022-10-19 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-18101:
---

 Summary: [R] RecordBatchReaderHead from ExecPlan with UDF cannot 
be read
 Key: ARROW-18101
 URL: https://issues.apache.org/jira/browse/ARROW-18101
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson


{code}
  register_scalar_function(
"times_32",
function(context, x) x * 32.0,
int32(),
float64(),
auto_convert = TRUE
  )
  record_batch(a = 1:1000) %>%
dplyr::mutate(b = times_32(a)) %>%
as_record_batch_reader() %>%
head(11) %>%
as_arrow_table()

# Error: NotImplemented: Call to R (resolve scalar user-defined function output 
data type) from a non-R thread from an unsupported context
# /arrow/cpp/src/arrow/compute/exec.cc:649  
kernel_->signature->out_type().Resolve(kernel_ctx_, args.inputs)
# /arrow/cpp/src/arrow/compute/exec/expression.cc:602  
executor->Init(_context, {kernel, types, options})
# /arrow/cpp/src/arrow/compute/exec/project_node.cc:91  
ExecuteScalarExpression(simplified_expr, target, plan()->exec_context())
# /arrow/cpp/src/arrow/record_batch.cc:336  ReadNext()
# /arrow/cpp/src/arrow/record_batch.cc:350  ToRecordBatches()
{code}

It works fine if you don't call {{as_record_batch_reader()}} in the middle. 
Oddly, it also works fine if you add {{as_adq()}} (aka {{collapse()}}) after 
head() and before evaluating to table--that is, if you run it through an 
ExecPlan again, it doesn't error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18100) [C++] Intermittent failure in TestNewScanner.Backpressure

2022-10-19 Thread Weston Pace (Jira)
Weston Pace created ARROW-18100:
---

 Summary: [C++] Intermittent failure in TestNewScanner.Backpressure
 Key: ARROW-18100
 URL: https://issues.apache.org/jira/browse/ARROW-18100
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


For example:

https://github.com/ursacomputing/crossbow/actions/runs/3277989378
/jobs/5395881371#step:5:3133



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18099) Cannot create pandas categorical from table only with nulls

2022-10-19 Thread Damian Barabonkov (Jira)
Damian Barabonkov created ARROW-18099:
-

 Summary: Cannot create pandas categorical from table only with 
nulls
 Key: ARROW-18099
 URL: https://issues.apache.org/jira/browse/ARROW-18099
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 9.0.0
 Environment: OSX 12.6
M1 silicon
Reporter: Damian Barabonkov


A pyarrow Table with only null values cannot be instantiated as a Pandas 
DataFrame with said column as a category. However, pandas does support "empty" 
categoricals. Therefore, a simple patch would be to load the pa.Table as an 
object first and convert, once in pandas, to a categorical which will be empty. 
However, that does not solve the pyarrow bug at its root.

 

Sample reproducible example
```python

import pyarrow as pa



pylist = [\{'x': None, '__index_level_0__': 2}, \{'x': None, 
'__index_level_0__': 3}]
tbl = pa.Table.from_pylist(pylist)

 

# Errors

df_broken = tbl.to_pandas(categories=["x"])

 

# Works
df_works = tbl.to_pandas()
df_works = df_works.astype(\{"x": "category"})

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18098) [C++] Vector kernel for "intersecting" two arrays (all common elements)

2022-10-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18098:
-

 Summary: [C++] Vector kernel for "intersecting" two arrays (all 
common elements)
 Key: ARROW-18098
 URL: https://issues.apache.org/jira/browse/ARROW-18098
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Joris Van den Bossche


This would be similar to numpy's {{intersect1d}} 
(https://numpy.org/doc/stable/reference/generated/numpy.intersect1d.html)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18097) [C++] Add a "list_contains" kernel

2022-10-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18097:
-

 Summary: [C++] Add a "list_contains" kernel
 Key: ARROW-18097
 URL: https://issues.apache.org/jira/browse/ARROW-18097
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Joris Van den Bossche


Assume you have a list array:

{code}
arr = pa.array([["a", "b"], ["a", "c"], ["b", "c", "d"]])
{code}

And you want to know for each list if it contains a certain value (of the same 
type as the list's values). A "list_contains" function (or other name) would be 
useful for that:

{code}
pc.list_contains(arr, "a")
# -> True, True False
{code}

The current workaround that I found was flattening, checking equality, and then 
reducing again with groupby, but this is quite tedious:

{code}
>>> temp = pa.table({'index': pc.list_parent_indices(arr), 'contains_value': 
>>> pc.equal(pc.list_flatten(arr), "a")})
>>> temp.group_by('index').aggregate([('contains_value', 
>>> 'any')])['contains_value_any'].chunk(0)

[
  true,
  true,
  false
]
{code}

But this also only works if there are no empty or missing list values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18096) [Dev] Remove github user names from merge commit message

2022-10-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18096:
-

 Summary: [Dev] Remove github user names from merge commit message
 Key: ARROW-18096
 URL: https://issues.apache.org/jira/browse/ARROW-18096
 Project: Apache Arrow
  Issue Type: Task
  Components: Developer Tools
Reporter: Joris Van den Bossche


We currently use the top post comment body of a github PR as the body of the 
commit message. It is not uncommon to tag someone when opening a PR, but 
retaining those github usernames in the commit message is annoying as that can 
generate additional notifications for the people that were tagged.

It should be straightforward to remove the github user names from the message 
body (for example, just remove the @, so it doesn't work anymore as user name 
link)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18095) [CI][C++][MinGW] All tests exited with 0xc0000139

2022-10-19 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18095:


 Summary: [CI][C++][MinGW] All tests exited with 0xc139
 Key: ARROW-18095
 URL: https://issues.apache.org/jira/browse/ARROW-18095
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://github.com/apache/arrow/actions/runs/3261682270/jobs/5357126875

{noformat}
+ ctest --label-regex unittest --output-on-failure --parallel 2 --timeout 300 
--exclude-regex 
'gandiva-internals-test|gandiva-projector-test|gandiva-utf8-test|gandiva-binary-test|gandiva-boolean-expr-test|gandiva-date-time-test|gandiva-decimal-single-test|gandiva-decimal-test|gandiva-filter-project-test|gandiva-filter-test|gandiva-hash-test|gandiva-if-expr-test|gandiva-in-expr-test|gandiva-literal-test|gandiva-null-validity-test|gandiva-precompiled-test|gandiva-projector-test'
Test project D:/a/arrow/arrow/build/cpp
  Start  1: arrow-array-test
  Start  2: arrow-buffer-test
 1/67 Test  #2: arrow-buffer-test .Exit code 0xc139
***Exception:   0.15 sec

  Start  3: arrow-extension-type-test
 2/67 Test  #1: arrow-array-test ..Exit code 0xc139
***Exception:   0.17 sec

  Start  4: arrow-misc-test
 3/67 Test  #3: arrow-extension-type-test .Exit code 0xc139
***Exception:   0.04 sec
 39 - arrow-dataset-discovery-test (Exit code 0xc139
)
 40 - arrow-dataset-file-ipc-test (Exit code 0xc139
)
 41 - arrow-dataset-file-test (Exit code 0xc139
)
 42 - arrow-dataset-partition-test (Exit code 0xc139
)
 43 - arrow-dataset-scanner-test (Exit code 0xc139
)
 44 - arrow-dataset-file-csv-test (Exit code 0xc139
)
 45 - arrow-dataset-file-parquet-test (Exit code 0xc139
)
 46 - arrow-filesystem-test (Exit code 0xc139
)
Errors while running CTest
 47 - arrow-gcsfs-test (Exit code 0xc139
)
 48 - arrow-s3fs-test (Exit code 0xc139
)
 49 - arrow-flight-internals-test (Exit code 0xc139
)
 50 - arrow-flight-test (Exit code 0xc139
)
 51 - arrow-flight-sql-test (Exit code 0xc139
)
 52 - arrow-feather-test (Exit code 0xc139
)
 53 - arrow-ipc-json-simple-test (Exit code 0xc139
)
 54 - arrow-ipc-read-write-test (Exit code 0xc139
)
 55 - arrow-ipc-tensor-test (Exit code 0xc139
)
 56 - arrow-json-test (Exit code 0xc139
)
 57 - parquet-internals-test (Exit code 0xc139
)
 58 - parquet-reader-test (Exit code 0xc139
)
 59 - parquet-writer-test (Exit code 0xc139
)
 60 - parquet-arrow-test (Exit code 0xc139
)
 61 - parquet-arrow-internals-test (Exit code 0xc139
)
 62 - parquet-encryption-test (Exit code 0xc139
)
 63 - parquet-encryption-key-management-test (Exit code 0xc139
)
 64 - parquet-file-deserialize-test (Exit code 0xc139
)
 65 - parquet-schema-test (Exit code 0xc139
)
 66 - gandiva-projector-build-validation-test (Exit code 0xc139
)
 67 - gandiva-to-string-test (Exit code 0xc139
)
Error: Process completed with exit code 8.
{noformat}

The last succeeded job: 
https://github.com/apache/arrow/actions/runs/3256683017/jobs/5347422431



--
This message was sent by Atlassian Jira
(v8.20.10#820010)