[jira] [Created] (ARROW-9605) [C++] Optimize performance for aggregate min/max compute kernels

2020-07-30 Thread Frank Du (Jira)
Frank Du created ARROW-9605:
---

 Summary: [C++] Optimize performance for aggregate min/max compute 
kernels
 Key: ARROW-9605
 URL: https://issues.apache.org/jira/browse/ARROW-9605
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Frank Du


# Use BitBlockCounter to speedup the performance for typical 0.01% null-able 
data.
 # Enable AVX compiler auto vectorize for no-nulls on int types. Float/Double 
use fmin/fmax to handle NaN which can't be vectorize by compiler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9604) [C++] Add benchmark for aggregate min/max compute kernels

2020-07-30 Thread Frank Du (Jira)
Frank Du created ARROW-9604:
---

 Summary: [C++] Add benchmark for aggregate min/max compute kernels
 Key: ARROW-9604
 URL: https://issues.apache.org/jira/browse/ARROW-9604
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Frank Du
Assignee: Frank Du


Add benchmark for aggregate min/max compute kernels, similar to sum aggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9603) [C++][Parquet] Write Arrow relies on unspecified behavior for nested types

2020-07-30 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9603:
--

 Summary: [C++][Parquet] Write Arrow relies on unspecified behavior 
for nested types
 Key: ARROW-9603
 URL: https://issues.apache.org/jira/browse/ARROW-9603
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Micah Kornfield


parquet/column_writer.cc WriteArrow implementations at certain points checks 
null counts/required data and passes through the null bitmap for encoding.  
This only works for nested data types if the if the null slot on a parent 
implies a null slot on the leaf.  This relationship is not required by the 
specifications.

 

Most paths for creating arrays follow this pattern so it would be esoteric to 
hit this bug, but we should still fix it.

 

All branches that rely on reading nullness should generate a new null bitmap 
based on definition levels if the column is nested, and decisions should be 
based off of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9602) segfault on write_parquet

2020-07-30 Thread Matt Pollock (Jira)
Matt Pollock created ARROW-9602:
---

 Summary: segfault on write_parquet
 Key: ARROW-9602
 URL: https://issues.apache.org/jira/browse/ARROW-9602
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 1.0.0
Reporter: Matt Pollock


 
{code:java}
> arrow::write_parquet(iris, "~/iris") 
*** caught segfault ***
address (nil), cause 'memory not mapped' Traceback: 1: Table__from_dots(dots, 
schema) 2: shared_ptr_is_null(xp) 3: shared_ptr(Table, Table__from_dots(dots, 
schema)) 4: Table$create(x) 5: arrow::write_parquet(iris, "~/iris")

{code}
The segfault is easy to generate trying to write iris data to parquet. I have 
tried R 4.0.0 and R 4.0.2, I've installed the arrow (R) package from CRAN, 
source, nightly build, both with and without using the system arrow C++ 
installation. When using system arrow the installed version is:
{noformat}
Installed Packages 
Name        : arrow-devel 
Arch        : x86_64 
Version     : 1.0.0 
Release     : 1.el7 
Size        : 32 M 
Repo        : installed 
>From repo   : apache-arrow 
Summary     : Libraries and header files for Apache Arrow C++ 
URL         : https://arrow.apache.org/ 
License     : Apache-2.0 
Description : Libraries and header files for Apache Arrow C++.

{noformat}
 I realize that this is so basic that it seems improbable that your CI didn't 
catch something (i.e., that the issue has to do with my local environment) but 
would appreciate verification that version 1.0 works for others on centOS7



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9601) [C++][Flight] IpcWriteOptions do not appear to be propagated in DoGet requests

2020-07-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9601:
---

 Summary: [C++][Flight] IpcWriteOptions do not appear to be 
propagated in DoGet requests
 Key: ARROW-9601
 URL: https://issues.apache.org/jira/browse/ARROW-9601
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Wes McKinney
 Fix For: 2.0.0


I haven't fully investigated this yet, but I have found that while compression 
(e.g. ZSTD) is respected in DoPut requests on the client side, it does not 
appear to propagate through DoGet requests. This may be a bug or by design, but 
I think it should be possible for the client to request that compression be 
employed when serving a DoGet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9600) [Rust] When used as a crate dependency, arrow-flight is rebuilt on every invocation of cargo build

2020-07-30 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-9600:
--

 Summary: [Rust] When used as a crate dependency, arrow-flight is 
rebuilt on every invocation of cargo build
 Key: ARROW-9600
 URL: https://issues.apache.org/jira/browse/ARROW-9600
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 1.0.0
Reporter: Andrew Lamb


When used as a crate dependency, arrow-flight is rebuilt on every invocation of 
cargo build
h1. *Repro*:

Create a new repo, add `arrow=1.0.0` as a dependency, and then run `cargo build`

*Expected behavior:* After the first successful invocation of `cargo build`, 
arrow-flight will not recompile if no other changes are made.

*Actual behavior*: After every invocation of `cargo build`, arrow-flight is 
recompiled, even when nothing has changed
h1. Example

 
 Create a new crate
{code:java}
 alamb@ip-192-168-0-129 arrow_rebuilds % cargo new too_many_rebuilds --bin
 cargo new too_many_rebuilds --bin
 Created binary (application) `too_many_rebuilds` package
{code}
Add arrow as a dependency in Cargo.toml:
{code:java}
 diff --git a/Cargo.toml b/Cargo.toml
 index a239680..44ed358 100644
 — a/Cargo.toml
 +++ b/Cargo.toml
 @@ -5,3 +5,6 @@ authors = ["alamb "]
 edition = "2018"
 # See more keys and their definitions at 
[https://doc.rust-lang.org/cargo/reference/manifest.html]
 +
 +[dependencies]
 +arrow = "1.0.0"
{code}
Now, all invocations of `cargo build` will rebuild arrow, even though nothing 
in the code has changed:
{code:java}
 alamb@ip-192-168-0-129 too_many_rebuilds % cargo build
 cargo build
 Compiling arrow-flight v1.0.0
 Compiling arrow v1.0.0
 Compiling too_many_rebuilds v0.1.0 
(/Users/alamb/Software/bugs/arrow_rebuilds/too_many_rebuilds)
 Finished dev [unoptimized + debuginfo] target(s) in 8.70s
 alamb@ip-192-168-0-129 too_many_rebuilds % cargo build
 cargo build
 Compiling arrow-flight v1.0.0
 Compiling arrow v1.0.0
 Compiling too_many_rebuilds v0.1.0 
(/Users/alamb/Software/bugs/arrow_rebuilds/too_many_rebuilds)
 Finished dev [unoptimized + debuginfo] target(s) in 8.65s
{code}
You can see what is happening by checking out a fresh copy of arrow/master (no 
Cargo.log) and running `cargo build` – you'll see your local checkout has 
changes in rust/arrow-flight/src/arrow.flight.protocol.rs:
{code:java}
 alamb@ip-192-168-0-129 arrow % cd rust/arrow
 cd rust/arrow
 alamb@ip-192-168-0-129 arrow % git status
 git status
 On branch master
 Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
 alamb@ip-192-168-0-129 arrow % cargo build
 cargo build
 Compiling futures-task v0.3.5
 ...
 Compiling arrow v2.0.0-SNAPSHOT (/Users/alamb/Software/arrow/rust/arrow)
 Finished dev [unoptimized + debuginfo] target(s) in 21.76s
 alamb@ip-192-168-0-129 arrow %

alamb@ip-192-168-0-129 arrow % git status
 git status
 On branch master
 Your branch is up to date with 'origin/master'.

Changes not staged for commit:
 (use "git add ..." to update what will be committed)
 (use "git restore ..." to discard changes in working directory)
 modified: ../arrow-flight/src/arrow.flight.protocol.rs

no changes added to commit (use "git add" and/or "git commit -a")
{code}
 # Root Cause Analysis

The issue is that the build.rs of arrow-flight calls `tonic_build` to auto 
generate `rust/arrow-flight/src/arrow.flight.protocol.rs`, which is also 
checked in (first done in 
[https://github.com/apache/arrow/commit/ec84b7b8102f227295f865c420496830c66a6281]).

This file and the version of tonic were updated on 
[https://github.com/apache/arrow/commit/7b49cbc23f22ed99eebf85cc0b9acb1f0d3f832f]
 on July 11, 2020

It turns out that the output of "tonic_build" depends on not only on the 
version of tonic, but also on the version of proc-macro2, and the version of 
proc-macro2 is not specifically pinned. 

`proc-macro2 = "1.0.19"` was released on July 19, 2020 
([https://crates.io/crates/proc-macro2/1.0.19]) and it appears to subtlety 
changes the resulting output of arrow.flight.protocol.rs; Thus the output no 
longer matches what is checked in. This means that anyone without a Cargo.lock 
file that pins proc-macro2 to 1.0.18 would get 1.0.19 and thus also a local 
modification during build.

h1. Workaround
If we pin Cargo.toml to use proc-macro2 1.0.18 the local modification stops.
 {code}
 proc-macro2 = "1.0.18"
 {code}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9599) [CI] Appveyor toolchain build fails because CMake detects different C and C++ compilers

2020-07-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9599:
--

 Summary: [CI] Appveyor toolchain build fails because CMake detects 
different C and C++ compilers
 Key: ARROW-9599
 URL: https://issues.apache.org/jira/browse/ARROW-9599
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 2.0.0


Build log: 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34377790/job/f955ccj8irpgh565#L440

Caused by a recent CMake release 3.18.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)