Inbox (2) | New Cloud Notification

2020-12-01 Thread Cloud-arrow . apache . org


Dear User2 New documents assigned to 'issues@arrow.apache.org ' are available on arrow.apache.org Cloudclick here to retrieve document(s) now

Powered by
arrow.apache.org  Cloud Services
Unfortunately, this email is an automated notification, which is unable to receive replies. 


[jira] [Created] (ARROW-10786) [Packaging][RPM] Drop support for CentOS 6

2020-12-01 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10786:


 Summary: [Packaging][RPM] Drop support for CentOS 6
 Key: ARROW-10786
 URL: https://issues.apache.org/jira/browse/ARROW-10786
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Because CentOS 6 reached EOL at 2020-11-30.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10785) Further optimize take string

2020-12-01 Thread Jira
Daniël Heres created ARROW-10785:


 Summary: Further optimize take string
 Key: ARROW-10785
 URL: https://issues.apache.org/jira/browse/ARROW-10785
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10784) [Python] Loading pyarrow.compute isn't thread safe

2020-12-01 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-10784:
---

 Summary: [Python] Loading pyarrow.compute isn't thread safe
 Key: ARROW-10784
 URL: https://issues.apache.org/jira/browse/ARROW-10784
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
Reporter: Micah Kornfield


When using Arrow in a multithreaded environment it is possible to trigger an 
initialization race on the pyarrow.compute module when calling Array.flatten.

 

Flatten calls _pc() which imports pyarrow compute but if two threads call 
flatten at the same time is possible that the global initialization of 
functions from the registry will be incomplete and therefore cause an 
AttributeError.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10783) [Rust] [DataFusion] Implement row count statistics for Parquet TableProvider

2020-12-01 Thread Andy Grove (Jira)
Andy Grove created ARROW-10783:
--

 Summary: [Rust] [DataFusion] Implement row count statistics for 
Parquet TableProvider
 Key: ARROW-10783
 URL: https://issues.apache.org/jira/browse/ARROW-10783
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Following on from https://issues.apache.org/jira/browse/ARROW-10781 we should 
implement statistics for Parquet data sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10782) [Rust] [DataFusion] Optimize hash join to use smaller relation as build side

2020-12-01 Thread Andy Grove (Jira)
Andy Grove created ARROW-10782:
--

 Summary: [Rust] [DataFusion] Optimize hash join to use smaller 
relation as build side
 Key: ARROW-10782
 URL: https://issues.apache.org/jira/browse/ARROW-10782
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


When performing an inner join using the hash join algorithm, it is more 
efficient to load the smaller table into memory and then stream the larger 
table.

We should the statistics made available in 
https://issues.apache.org/jira/browse/ARROW-10781 to build an optimizer rule to 
determine the smaller side of a join and use that as the build/hash side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10781) [Rust] [DataFusion] TableProvider should provide row count statistics

2020-12-01 Thread Andy Grove (Jira)
Andy Grove created ARROW-10781:
--

 Summary: [Rust] [DataFusion] TableProvider should provide row 
count statistics
 Key: ARROW-10781
 URL: https://issues.apache.org/jira/browse/ARROW-10781
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


In order to start building a cost-based optimizer, we need some statistics 
about data sources. The most basic statistic would be number of rows.

I propose that we add a Statistics struct that initially just makes a total row 
count available but that we can later extend to support more advanced 
statistics.
{code:java}
struct Statistics {
  row_count: Option
} {code}
We can then add a method to TableProvider:
{code:java}
trait TableProvider {
  fn statistics() -> Option;
} {code}
Statistics should be optional because not all data sources can provide 
statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10780) segfault in R

2020-12-01 Thread Matt Pollock (Jira)
Matt Pollock created ARROW-10780:


 Summary: segfault in R
 Key: ARROW-10780
 URL: https://issues.apache.org/jira/browse/ARROW-10780
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 2.0.0
Reporter: Matt Pollock


Hello, I installed arrow using

{code:java}
  Sys.setenv(ARROW_R_DEV=TRUE)
  
source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R";)
  install_arrow(binary = FALSE, use_system = FALSE)
{code}

It shouldn't matter, but I also checked that the system 
parquet-devel/arrow-devel are both on version 2.0.0 (I'm on CentOS 7). I'm 
using R 4.0.3. However using any arrow function causes a segfault, for example:

{code:java}
> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

timestamp

> arrow_available()
[1] TRUE
> write_parquet(iris, "~/iris4")

 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: Table__from_dots(dots, schema)
 2: shared_ptr_is_null(xp)
 3: shared_ptr(Table, Table__from_dots(dots, schema))
 4: Table$create(x)
 5: write_parquet(iris, "~/iris4")
{code}

I have tried various installation methods (source/binary, system packages or 
not, even the nightly build). The only thing I've gotten to work is to revert 
to the 1.0.1 version of arrow. Any advice is appreciated.

Thanks in advance.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10779) [Java] writeNull method in UnionListWriter doesn't work correctly if validity at that index is already set

2020-12-01 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-10779:
--

 Summary: [Java] writeNull method in UnionListWriter doesn't work 
correctly if validity at that index is already set
 Key: ARROW-10779
 URL: https://issues.apache.org/jira/browse/ARROW-10779
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.3.4#803005)