[jira] [Created] (ARROW-10949) [Rust] Avoid clones in getting values of boolean arrays

2020-12-16 Thread Jira
Jorge Leitão created ARROW-10949:


 Summary: [Rust] Avoid clones in getting values of boolean arrays
 Key: ARROW-10949
 URL: https://issues.apache.org/jira/browse/ARROW-10949
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10948) [C++] Always use GTestConfig.cmake

2020-12-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10948:


 Summary: [C++] Always use GTestConfig.cmake
 Key: ARROW-10948
 URL: https://issues.apache.org/jira/browse/ARROW-10948
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10947) [Rust][DataFusion] Refactor UTF8 to Date32 for Performance

2020-12-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10947:
---

 Summary: [Rust][DataFusion] Refactor UTF8 to Date32 for Performance
 Key: ARROW-10947
 URL: https://issues.apache.org/jira/browse/ARROW-10947
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


After adding benchmarking capability to the UTF8 to Date32/Date64 CAST 
functions there was opportunity to improve the performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10946) [Rust] Make ChunkIter not depend on a buffer

2020-12-16 Thread Jira
Jorge Leitão created ARROW-10946:


 Summary: [Rust] Make ChunkIter not depend on a buffer
 Key: ARROW-10946
 URL: https://issues.apache.org/jira/browse/ARROW-10946
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10945) [Rust] [DataFusion] Allow User Defined Aggregates to return multiple values / structs

2020-12-16 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10945:
---

 Summary: [Rust] [DataFusion] Allow User Defined Aggregates to 
return multiple values / structs
 Key: ARROW-10945
 URL: https://issues.apache.org/jira/browse/ARROW-10945
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Andrew Lamb



Usecase:
I want to implement a user defined aggregate function that produces more than 
one column ( logical values)

Specifically I am trying to implement the InfluxDB 'selector' functions 
`first`, `last`, `min`, and `max` as DataFusion aggregate functions.

I can't use the built in aggregate functions in DataFusion as selector 
functions aren't exactly like normal aggregate functions -- they return both 
the actual aggregate value as well as a timestamp. In addition, `first` and 
`last` pick a row in the value column based on the value in the timestamp 
column.

After some investigation, I realize I can't elegantly use the built in user 
defined aggregate framework in DataFusion either. As an example of what is 
going on here, let's take

```
value | time
--+--
  3   | 1000
  2   | 2000
  1   | 3000
```

The result of `last(value)` should be be two columns `1 | 3000` -- however, 
modeling this as a DataFusion aggregate does not seem to be possible at this 
time.  Each aggregate function can return a single columnar value but we need 
to return 2 (the `.value` and `.time` fields).

Ideally I was thinking that the UDF could produce a Struct (with named field 
`value` and `time`) but the evaluate 
function([code])(https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/mod.rs#L238))returns
 a `ScalarValue` and at the moment they [don't have support for 
Structs](https://github.com/apache/arrow/blob/master/rust/datafusion/src/scalar.rs#L44)

I suspect that we would also need to add support in DataFusion for selecting 
fields from structs

See additional detail and context on 
https://github.com/influxdata/influxdb_iox/issues/448#issuecomment-744601824




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10944) [Rust] Implement min/max kernels for BooleanArray

2020-12-16 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10944:
---

 Summary: [Rust] Implement min/max kernels for BooleanArray
 Key: ARROW-10944
 URL: https://issues.apache.org/jira/browse/ARROW-10944
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Andrew Lamb
Assignee: Andrew Lamb


While this operation is of very limited utility, for completness and uniformity 
I would like to have a min/max aggregation kernel that works for BooleanArrays. 
Currently we have ones for primitive value (e.g. numeric) arrays as well as 
strings, etc.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10943) [Rust] Intermittent build failure in parquet encoding

2020-12-16 Thread Andy Grove (Jira)
Andy Grove created ARROW-10943:
--

 Summary: [Rust] Intermittent build failure in parquet encoding
 Key: ARROW-10943
 URL: https://issues.apache.org/jira/browse/ARROW-10943
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove


I saw this test failure locally
{code:java}
 encodings::encoding::tests::test_bool stdout 
thread 'encodings::encoding::tests::test_bool' panicked at 'Invalid byte when 
reading bool', parquet/src/util/bit_util.rs:73:18
 {code}
I ran "cargo test" again and it passed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10942) [C++] S3FileSystem::Impl::IsEmptyDirectory fails on Amazon S3

2020-12-16 Thread Juan Galvez (Jira)
Juan Galvez created ARROW-10942:
---

 Summary: [C++] S3FileSystem::Impl::IsEmptyDirectory fails on 
Amazon S3
 Key: ARROW-10942
 URL: https://issues.apache.org/jira/browse/ARROW-10942
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 2.0.0
Reporter: Juan Galvez


Running S3FileSystem::GetFileInfo() where the path is in the form 
"s3://bucket-name/dir-name" and this is a bucket on AWS S3, it throws the 
following error:

"When reading information for key 'dir-name' in bucket 'bucket-name': AWS Error 
[code 15]: No response body.

I tracked down the issue to the IsEmptyDirectory method, and noticed that 
removing kSep from this line:
req.SetKey(ToAwsString(key) + kSep);

fixes the issue.

However, I don't know why kSep is needed in the first place so I'm not sure 
what a good solution would be.
Also, the key variable on entering IsEmptyDirectory is just the name of the 
directory (doesn't have separators).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10941) [Doc][C++] Document supported Parquet encryption features

2020-12-16 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10941:
--

 Summary: [Doc][C++] Document supported Parquet encryption features
 Key: ARROW-10941
 URL: https://issues.apache.org/jira/browse/ARROW-10941
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Documentation
Reporter: Antoine Pitrou


In ARROW-10918 we started documenting the Parquet format features supported by 
parquet-cpp, but I left a TODO for encryption features.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10940) [Rust] Extend sort kernel to ListArray

2020-12-16 Thread Ruihang Xia (Jira)
Ruihang Xia created ARROW-10940:
---

 Summary: [Rust] Extend sort kernel to ListArray
 Key: ARROW-10940
 URL: https://issues.apache.org/jira/browse/ARROW-10940
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ruihang Xia






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10939) [C#][FlightRPC] incompatible with java client for empty record batches

2020-12-16 Thread Alexander (Jira)
Alexander created ARROW-10939:
-

 Summary: [C#][FlightRPC] incompatible with java client for empty 
record batches
 Key: ARROW-10939
 URL: https://issues.apache.org/jira/browse/ARROW-10939
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#, FlightRPC
Reporter: Alexander


An error has been found when one sends an empty record batch from C# server and 
tries to read it with the java client.

>From investigation the java client requires the protobuf tags to be sent in 
>the message even though it is empty. Java code can be seen here:

[https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/ArrowMessage.java]

Normal functionality of gRPC is to exclude the entire tag if an object is 
empty, example code from generated csharp:

if (DataBody.Length != 0) {
 output.WriteRawTag(194, 62);
 output.WriteBytes(DataBody);
 }

To fix this so the csharp version is compatible with the java client requires a 
non empty flight data body must be sent or at least the tag of the body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10938) [Arrow] upgrade dependency "flatbuffers" to 0.8.0

2020-12-16 Thread meng qingyou (Jira)
meng qingyou created ARROW-10938:


 Summary: [Arrow] upgrade dependency "flatbuffers" to 0.8.0
 Key: ARROW-10938
 URL: https://issues.apache.org/jira/browse/ARROW-10938
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: meng qingyou


[flatbuffers](https://crates.io/crates/flatbuffers) 0.8.0 was released on Dec 
10, 2020, with some notable changes:

verifier
common rust traits to FlatBufferBuilder
new VectorIter
add FlatBufferBuilder::force_defaults API
Optional Scalars
up to 2018 edition
possible performance speedup
... and minor breaking change to some APIs, for example: remote "get_", return 
Result.

Let's try this version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10937) ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)

2020-12-16 Thread Vladimir (Jira)
Vladimir created ARROW-10937:


 Summary: ArrowInvalid error on reading partitioned parquet files 
from S3 (arrow-2.0.0)
 Key: ARROW-10937
 URL: https://issues.apache.org/jira/browse/ARROW-10937
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
Reporter: Vladimir


Hello

It looks like pyarrow-2.0.0 has problems in reading parquet could not read 
partitioned datasets from S3 buckets:

 
{code:java}
import s3fs
import pyarrow as pa
import pyarrow.parquet as pq

filesystem = s3fs.S3FileSystem()

d = pd.date_range('1990-01-01', freq='D', periods=1)
vals = np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
x['Year'] = x.index.year

table = pa.Table.from_pandas(x, preserve_index=True)
pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', 
partition_cols=['Year'], filesystem=filesystem)
{code}
 

 Now, reading it via pq.read_table:

 
{code:java}
pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, 
use_pandas_metadata=True)
{code}
Raises exception:

 

 
{code:java}
ArrowInvalid: GetFileInfo() yielded path 
'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet',
 which is outside base dir 's3://bucket/test_pyarrow.parquet'
{code}
Direct read in pandas:

 

 
{code:java}
pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code}
 

returns empty DataFrame.

 

The issue does not exist in pyarrow-1.0.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)