[jira] [Created] (ARROW-11096) [Rust] Add FFI for [Large]Binary

2020-12-31 Thread Jira
Jorge Leitão created ARROW-11096:


 Summary: [Rust] Add FFI for [Large]Binary
 Key: ARROW-11096
 URL: https://issues.apache.org/jira/browse/ARROW-11096
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11095) [Python] Access pyarrow.RecordBatch column by name

2020-12-31 Thread Will Jones (Jira)
Will Jones created ARROW-11095:
--

 Summary: [Python] Access pyarrow.RecordBatch column by name
 Key: ARROW-11095
 URL: https://issues.apache.org/jira/browse/ARROW-11095
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Will Jones


I propose adding support for selecting a column out of a pyarrow.RecordBatch 
using both __getitem__() and .field(), like we have in pyarrow.Table.

pyarrow.RecordBatch has a pretty similar API to pyarrow.Table (e.g. both have 
filter and take methods and a schema), but I got tripped up on this difference. 
pyarrow.Table supports accessing columns by name using both __getitem__ and 
.field():
{code:python}
my_array = pa.array(range(10))
table = pa.Table.from_arrays([my_array], names=['my_column'])

// Both of these work on table:
table['my_column']
table.field('my_column')
{code}
Meanwhile pyarrow.RecordBatch doesn't support either of those. In fact, I had a 
hard time finding a way to grab a column by name from a recordbatch without 
first looking up the integer index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join

2020-12-31 Thread Andy Grove (Jira)
Andy Grove created ARROW-11094:
--

 Summary: [Rust] [DataFusion] Implement Sort-Merge Join
 Key: ARROW-11094
 URL: https://issues.apache.org/jira/browse/ARROW-11094
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 4.0.0


The current hash join works well when one side of the join can be loaded into 
memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions in parallel and then stream both sides of the join by 
merging these sorted partitions and we do not need to load one side into 
memory. At most, we need to load all batches from both sides that contain the 
current join key values.

https://en.wikipedia.org/wiki/Sort-merge_join



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11093) [Rust] [DataFusion] RFC Roadmap for 2021

2020-12-31 Thread Andy Grove (Jira)
Andy Grove created ARROW-11093:
--

 Summary: [Rust] [DataFusion] RFC Roadmap for 2021
 Key: ARROW-11093
 URL: https://issues.apache.org/jira/browse/ARROW-11093
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove


Given the momentum and number of contributors involved in the Rust 
implementation, I think it would be useful to crowdsource a roadmap for the 
next few releases that we expect to release in 2021.

We have a small number of active committers on the project currently and it is 
hard for us to keep up with all the PRs sometimes, especially when so many 
different areas are being contributed to.

It would be helpful if we can co-ordinate to prioritize work for the release.

Of course, this is open source, and anyone can contribute anything at any time, 
but it would be nice to have some areas that we all agree are the main 
priorities.

I will create a PR to kick start this discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11092) [CI] (Temporarily) move offending workflows to separate files

2020-12-31 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11092:
---

 Summary: [CI] (Temporarily) move offending workflows to separate 
files
 Key: ARROW-11092
 URL: https://issues.apache.org/jira/browse/ARROW-11092
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 3.0.0


Without warning, INFRA broke several of our GitHub Actions workflows, and have 
been unresponsive all week. See 
https://issues.apache.org/jira/browse/INFRA-21239. Since then, the Rust 
developers have removed their offending actions, so those are no longer 
blocked. This PR does harm reduction for C++ and R workflows, moving the 
workflows that INFRA doesn't like to their own files (temporarily, I hope, 
while this business gets sorted out). This enables the other workflows in each 
file to run, so we at least get some C++ and R tests running, and we can still 
verify on our personal forks the workflows that have been blocked on 
apache/arrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11091) [Rust][DataFusion] Fix clippy warning in rust 1.49

2020-12-31 Thread Jira
Daniël Heres created ARROW-11091:


 Summary: [Rust][DataFusion] Fix clippy warning in rust 1.49
 Key: ARROW-11091
 URL: https://issues.apache.org/jira/browse/ARROW-11091
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11090) [R] Support date + datetime arithmetic

2020-12-31 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-11090:
--

 Summary: [R] Support date + datetime arithmetic
 Key: ARROW-11090
 URL: https://issues.apache.org/jira/browse/ARROW-11090
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Jonathan Keane


[It appears that only subtract on two datetimes is currently 
supported|https://github.com/apache/arrow/commit/dd94a5809b56b32fe2fb538f688bf568d9642e3b]
 when there is more supported, we should include support for that in R



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11089) [C++][Gandiva] Support list datatype for gandiva UDF

2020-12-31 Thread Jiangtao Peng (Jira)
Jiangtao Peng created ARROW-11089:
-

 Summary: [C++][Gandiva] Support list datatype for gandiva UDF 
 Key: ARROW-11089
 URL: https://issues.apache.org/jira/browse/ARROW-11089
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Jiangtao Peng


Hope to add arrow list type for gandiva expression inputs and outputs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11088) [Rust][DataFusion] Calculate column indices upfront in hash join

2020-12-31 Thread Jira
Daniël Heres created ARROW-11088:


 Summary: [Rust][DataFusion] Calculate column indices upfront in 
hash join
 Key: ARROW-11088
 URL: https://issues.apache.org/jira/browse/ARROW-11088
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11087) [Rust] SIMD aggregate kernel produces flawed results.

2020-12-31 Thread Ritchie (Jira)
Ritchie created ARROW-11087:
---

 Summary: [Rust] SIMD aggregate kernel produces flawed results.
 Key: ARROW-11087
 URL: https://issues.apache.org/jira/browse/ARROW-11087
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Ritchie


I don't know if this is still accurate on master, but Arrow 2.0 simd sum gives 
me flawed results when compiled with SIMD.

When SIMD is toggled off I get correct results.

When I have more time I can get a reproducible example if requested. Dataset on 
which this shows different results (as numpy array)

Output of *np.nansum* is 39. Output of SIMD kernel is 37.
  
{code:java}
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., nan, 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., -1., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -1., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2., 
1., 0., 0., 0., 0., 0., 0., 0., -2., 0., 0., 0., 0., 0., -1., 0., 0., 0., 0., 
0., 0., 1., 1., -1., 0., 0., 0., 1., 2., 0., 0., 0., 0., 0., 0., 1., 1., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 
2., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2., 
6., 0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 0., 1., 3., 0., 2., 0., 0., 1., 4., 
2., 0., 0., 0., 0., nan, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 
0., 0., 0., 0., 0., 0., 0., 0.])
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11086) [Rust] Extend take to support more index types

2020-12-31 Thread Jira
Daniël Heres created ARROW-11086:


 Summary: [Rust] Extend take to support more index types
 Key: ARROW-11086
 URL: https://issues.apache.org/jira/browse/ARROW-11086
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11085) [Rust] Migrated CI away from action-rs/*

2020-12-31 Thread Jira
Jorge Leitão created ARROW-11085:


 Summary: [Rust] Migrated CI away from action-rs/*
 Key: ARROW-11085
 URL: https://issues.apache.org/jira/browse/ARROW-11085
 Project: Apache Arrow
  Issue Type: Task
Reporter: Jorge Leitão
Assignee: Jorge Leitão


INFRA team deactivated github actions for action-rs, which caused all our CI to 
stop working.

Since our dependency on it is really small, I propose that we just migrate our 
builds to not use it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)