[jira] [Created] (ARROW-10949) [Rust] Avoid clones in getting values of boolean arrays
Jorge Leitão created ARROW-10949: Summary: [Rust] Avoid clones in getting values of boolean arrays Key: ARROW-10949 URL: https://issues.apache.org/jira/browse/ARROW-10949 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10948) [C++] Always use GTestConfig.cmake
Kouhei Sutou created ARROW-10948: Summary: [C++] Always use GTestConfig.cmake Key: ARROW-10948 URL: https://issues.apache.org/jira/browse/ARROW-10948 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10947) [Rust][DataFusion] Refactor UTF8 to Date32 for Performance
Mike Seddon created ARROW-10947: --- Summary: [Rust][DataFusion] Refactor UTF8 to Date32 for Performance Key: ARROW-10947 URL: https://issues.apache.org/jira/browse/ARROW-10947 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon After adding benchmarking capability to the UTF8 to Date32/Date64 CAST functions there was opportunity to improve the performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10946) [Rust] Make ChunkIter not depend on a buffer
Jorge Leitão created ARROW-10946: Summary: [Rust] Make ChunkIter not depend on a buffer Key: ARROW-10946 URL: https://issues.apache.org/jira/browse/ARROW-10946 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10945) [Rust] [DataFusion] Allow User Defined Aggregates to return multiple values / structs
Andrew Lamb created ARROW-10945: --- Summary: [Rust] [DataFusion] Allow User Defined Aggregates to return multiple values / structs Key: ARROW-10945 URL: https://issues.apache.org/jira/browse/ARROW-10945 Project: Apache Arrow Issue Type: New Feature Reporter: Andrew Lamb Usecase: I want to implement a user defined aggregate function that produces more than one column ( logical values) Specifically I am trying to implement the InfluxDB 'selector' functions `first`, `last`, `min`, and `max` as DataFusion aggregate functions. I can't use the built in aggregate functions in DataFusion as selector functions aren't exactly like normal aggregate functions -- they return both the actual aggregate value as well as a timestamp. In addition, `first` and `last` pick a row in the value column based on the value in the timestamp column. After some investigation, I realize I can't elegantly use the built in user defined aggregate framework in DataFusion either. As an example of what is going on here, let's take ``` value | time --+-- 3 | 1000 2 | 2000 1 | 3000 ``` The result of `last(value)` should be be two columns `1 | 3000` -- however, modeling this as a DataFusion aggregate does not seem to be possible at this time. Each aggregate function can return a single columnar value but we need to return 2 (the `.value` and `.time` fields). Ideally I was thinking that the UDF could produce a Struct (with named field `value` and `time`) but the evaluate function([code])(https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/mod.rs#L238))returns a `ScalarValue` and at the moment they [don't have support for Structs](https://github.com/apache/arrow/blob/master/rust/datafusion/src/scalar.rs#L44) I suspect that we would also need to add support in DataFusion for selecting fields from structs See additional detail and context on https://github.com/influxdata/influxdb_iox/issues/448#issuecomment-744601824 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10944) [Rust] Implement min/max kernels for BooleanArray
Andrew Lamb created ARROW-10944: --- Summary: [Rust] Implement min/max kernels for BooleanArray Key: ARROW-10944 URL: https://issues.apache.org/jira/browse/ARROW-10944 Project: Apache Arrow Issue Type: New Feature Reporter: Andrew Lamb Assignee: Andrew Lamb While this operation is of very limited utility, for completness and uniformity I would like to have a min/max aggregation kernel that works for BooleanArrays. Currently we have ones for primitive value (e.g. numeric) arrays as well as strings, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10943) [Rust] Intermittent build failure in parquet encoding
Andy Grove created ARROW-10943: -- Summary: [Rust] Intermittent build failure in parquet encoding Key: ARROW-10943 URL: https://issues.apache.org/jira/browse/ARROW-10943 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove I saw this test failure locally {code:java} encodings::encoding::tests::test_bool stdout thread 'encodings::encoding::tests::test_bool' panicked at 'Invalid byte when reading bool', parquet/src/util/bit_util.rs:73:18 {code} I ran "cargo test" again and it passed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10942) [C++] S3FileSystem::Impl::IsEmptyDirectory fails on Amazon S3
Juan Galvez created ARROW-10942: --- Summary: [C++] S3FileSystem::Impl::IsEmptyDirectory fails on Amazon S3 Key: ARROW-10942 URL: https://issues.apache.org/jira/browse/ARROW-10942 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 2.0.0 Reporter: Juan Galvez Running S3FileSystem::GetFileInfo() where the path is in the form "s3://bucket-name/dir-name" and this is a bucket on AWS S3, it throws the following error: "When reading information for key 'dir-name' in bucket 'bucket-name': AWS Error [code 15]: No response body. I tracked down the issue to the IsEmptyDirectory method, and noticed that removing kSep from this line: req.SetKey(ToAwsString(key) + kSep); fixes the issue. However, I don't know why kSep is needed in the first place so I'm not sure what a good solution would be. Also, the key variable on entering IsEmptyDirectory is just the name of the directory (doesn't have separators). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10941) [Doc][C++] Document supported Parquet encryption features
Antoine Pitrou created ARROW-10941: -- Summary: [Doc][C++] Document supported Parquet encryption features Key: ARROW-10941 URL: https://issues.apache.org/jira/browse/ARROW-10941 Project: Apache Arrow Issue Type: Task Components: C++, Documentation Reporter: Antoine Pitrou In ARROW-10918 we started documenting the Parquet format features supported by parquet-cpp, but I left a TODO for encryption features. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10940) [Rust] Extend sort kernel to ListArray
Ruihang Xia created ARROW-10940: --- Summary: [Rust] Extend sort kernel to ListArray Key: ARROW-10940 URL: https://issues.apache.org/jira/browse/ARROW-10940 Project: Apache Arrow Issue Type: Improvement Reporter: Ruihang Xia -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10939) [C#][FlightRPC] incompatible with java client for empty record batches
Alexander created ARROW-10939: - Summary: [C#][FlightRPC] incompatible with java client for empty record batches Key: ARROW-10939 URL: https://issues.apache.org/jira/browse/ARROW-10939 Project: Apache Arrow Issue Type: Bug Components: C#, FlightRPC Reporter: Alexander An error has been found when one sends an empty record batch from C# server and tries to read it with the java client. >From investigation the java client requires the protobuf tags to be sent in >the message even though it is empty. Java code can be seen here: [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/ArrowMessage.java] Normal functionality of gRPC is to exclude the entire tag if an object is empty, example code from generated csharp: if (DataBody.Length != 0) { output.WriteRawTag(194, 62); output.WriteBytes(DataBody); } To fix this so the csharp version is compatible with the java client requires a non empty flight data body must be sent or at least the tag of the body. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10938) [Arrow] upgrade dependency "flatbuffers" to 0.8.0
meng qingyou created ARROW-10938: Summary: [Arrow] upgrade dependency "flatbuffers" to 0.8.0 Key: ARROW-10938 URL: https://issues.apache.org/jira/browse/ARROW-10938 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: meng qingyou [flatbuffers](https://crates.io/crates/flatbuffers) 0.8.0 was released on Dec 10, 2020, with some notable changes: verifier common rust traits to FlatBufferBuilder new VectorIter add FlatBufferBuilder::force_defaults API Optional Scalars up to 2018 edition possible performance speedup ... and minor breaking change to some APIs, for example: remote "get_", return Result. Let's try this version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10937) ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)
Vladimir created ARROW-10937: Summary: ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0) Key: ARROW-10937 URL: https://issues.apache.org/jira/browse/ARROW-10937 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: Vladimir Hello It looks like pyarrow-2.0.0 has problems in reading parquet could not read partitioned datasets from S3 buckets: {code:java} import s3fs import pyarrow as pa import pyarrow.parquet as pq filesystem = s3fs.S3FileSystem() d = pd.date_range('1990-01-01', freq='D', periods=1) vals = np.random.randn(len(d), 4) x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) x['Year'] = x.index.year table = pa.Table.from_pandas(x, preserve_index=True) pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', partition_cols=['Year'], filesystem=filesystem) {code} Now, reading it via pq.read_table: {code:java} pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, use_pandas_metadata=True) {code} Raises exception: {code:java} ArrowInvalid: GetFileInfo() yielded path 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet', which is outside base dir 's3://bucket/test_pyarrow.parquet' {code} Direct read in pandas: {code:java} pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code} returns empty DataFrame. The issue does not exist in pyarrow-1.0.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)