[jira] [Created] (ARROW-9605) [C++] Optimize performance for aggregate min/max compute kernels
Frank Du created ARROW-9605: --- Summary: [C++] Optimize performance for aggregate min/max compute kernels Key: ARROW-9605 URL: https://issues.apache.org/jira/browse/ARROW-9605 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Frank Du # Use BitBlockCounter to speedup the performance for typical 0.01% null-able data. # Enable AVX compiler auto vectorize for no-nulls on int types. Float/Double use fmin/fmax to handle NaN which can't be vectorize by compiler. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9604) [C++] Add benchmark for aggregate min/max compute kernels
Frank Du created ARROW-9604: --- Summary: [C++] Add benchmark for aggregate min/max compute kernels Key: ARROW-9604 URL: https://issues.apache.org/jira/browse/ARROW-9604 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Frank Du Assignee: Frank Du Add benchmark for aggregate min/max compute kernels, similar to sum aggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9603) [C++][Parquet] Write Arrow relies on unspecified behavior for nested types
Micah Kornfield created ARROW-9603: -- Summary: [C++][Parquet] Write Arrow relies on unspecified behavior for nested types Key: ARROW-9603 URL: https://issues.apache.org/jira/browse/ARROW-9603 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Micah Kornfield parquet/column_writer.cc WriteArrow implementations at certain points checks null counts/required data and passes through the null bitmap for encoding. This only works for nested data types if the if the null slot on a parent implies a null slot on the leaf. This relationship is not required by the specifications. Most paths for creating arrays follow this pattern so it would be esoteric to hit this bug, but we should still fix it. All branches that rely on reading nullness should generate a new null bitmap based on definition levels if the column is nested, and decisions should be based off of that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9602) segfault on write_parquet
Matt Pollock created ARROW-9602: --- Summary: segfault on write_parquet Key: ARROW-9602 URL: https://issues.apache.org/jira/browse/ARROW-9602 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 1.0.0 Reporter: Matt Pollock {code:java} > arrow::write_parquet(iris, "~/iris") *** caught segfault *** address (nil), cause 'memory not mapped' Traceback: 1: Table__from_dots(dots, schema) 2: shared_ptr_is_null(xp) 3: shared_ptr(Table, Table__from_dots(dots, schema)) 4: Table$create(x) 5: arrow::write_parquet(iris, "~/iris") {code} The segfault is easy to generate trying to write iris data to parquet. I have tried R 4.0.0 and R 4.0.2, I've installed the arrow (R) package from CRAN, source, nightly build, both with and without using the system arrow C++ installation. When using system arrow the installed version is: {noformat} Installed Packages Name : arrow-devel Arch : x86_64 Version : 1.0.0 Release : 1.el7 Size : 32 M Repo : installed >From repo : apache-arrow Summary : Libraries and header files for Apache Arrow C++ URL : https://arrow.apache.org/ License : Apache-2.0 Description : Libraries and header files for Apache Arrow C++. {noformat} I realize that this is so basic that it seems improbable that your CI didn't catch something (i.e., that the issue has to do with my local environment) but would appreciate verification that version 1.0 works for others on centOS7 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9601) [C++][Flight] IpcWriteOptions do not appear to be propagated in DoGet requests
Wes McKinney created ARROW-9601: --- Summary: [C++][Flight] IpcWriteOptions do not appear to be propagated in DoGet requests Key: ARROW-9601 URL: https://issues.apache.org/jira/browse/ARROW-9601 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Reporter: Wes McKinney Fix For: 2.0.0 I haven't fully investigated this yet, but I have found that while compression (e.g. ZSTD) is respected in DoPut requests on the client side, it does not appear to propagate through DoGet requests. This may be a bug or by design, but I think it should be possible for the client to request that compression be employed when serving a DoGet -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9600) [Rust] When used as a crate dependency, arrow-flight is rebuilt on every invocation of cargo build
Andrew Lamb created ARROW-9600: -- Summary: [Rust] When used as a crate dependency, arrow-flight is rebuilt on every invocation of cargo build Key: ARROW-9600 URL: https://issues.apache.org/jira/browse/ARROW-9600 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 1.0.0 Reporter: Andrew Lamb When used as a crate dependency, arrow-flight is rebuilt on every invocation of cargo build h1. *Repro*: Create a new repo, add `arrow=1.0.0` as a dependency, and then run `cargo build` *Expected behavior:* After the first successful invocation of `cargo build`, arrow-flight will not recompile if no other changes are made. *Actual behavior*: After every invocation of `cargo build`, arrow-flight is recompiled, even when nothing has changed h1. Example Create a new crate {code:java} alamb@ip-192-168-0-129 arrow_rebuilds % cargo new too_many_rebuilds --bin cargo new too_many_rebuilds --bin Created binary (application) `too_many_rebuilds` package {code} Add arrow as a dependency in Cargo.toml: {code:java} diff --git a/Cargo.toml b/Cargo.toml index a239680..44ed358 100644 — a/Cargo.toml +++ b/Cargo.toml @@ -5,3 +5,6 @@ authors = ["alamb "] edition = "2018" # See more keys and their definitions at [https://doc.rust-lang.org/cargo/reference/manifest.html] + +[dependencies] +arrow = "1.0.0" {code} Now, all invocations of `cargo build` will rebuild arrow, even though nothing in the code has changed: {code:java} alamb@ip-192-168-0-129 too_many_rebuilds % cargo build cargo build Compiling arrow-flight v1.0.0 Compiling arrow v1.0.0 Compiling too_many_rebuilds v0.1.0 (/Users/alamb/Software/bugs/arrow_rebuilds/too_many_rebuilds) Finished dev [unoptimized + debuginfo] target(s) in 8.70s alamb@ip-192-168-0-129 too_many_rebuilds % cargo build cargo build Compiling arrow-flight v1.0.0 Compiling arrow v1.0.0 Compiling too_many_rebuilds v0.1.0 (/Users/alamb/Software/bugs/arrow_rebuilds/too_many_rebuilds) Finished dev [unoptimized + debuginfo] target(s) in 8.65s {code} You can see what is happening by checking out a fresh copy of arrow/master (no Cargo.log) and running `cargo build` – you'll see your local checkout has changes in rust/arrow-flight/src/arrow.flight.protocol.rs: {code:java} alamb@ip-192-168-0-129 arrow % cd rust/arrow cd rust/arrow alamb@ip-192-168-0-129 arrow % git status git status On branch master Your branch is up to date with 'origin/master'. nothing to commit, working tree clean alamb@ip-192-168-0-129 arrow % cargo build cargo build Compiling futures-task v0.3.5 ... Compiling arrow v2.0.0-SNAPSHOT (/Users/alamb/Software/arrow/rust/arrow) Finished dev [unoptimized + debuginfo] target(s) in 21.76s alamb@ip-192-168-0-129 arrow % alamb@ip-192-168-0-129 arrow % git status git status On branch master Your branch is up to date with 'origin/master'. Changes not staged for commit: (use "git add ..." to update what will be committed) (use "git restore ..." to discard changes in working directory) modified: ../arrow-flight/src/arrow.flight.protocol.rs no changes added to commit (use "git add" and/or "git commit -a") {code} # Root Cause Analysis The issue is that the build.rs of arrow-flight calls `tonic_build` to auto generate `rust/arrow-flight/src/arrow.flight.protocol.rs`, which is also checked in (first done in [https://github.com/apache/arrow/commit/ec84b7b8102f227295f865c420496830c66a6281]). This file and the version of tonic were updated on [https://github.com/apache/arrow/commit/7b49cbc23f22ed99eebf85cc0b9acb1f0d3f832f] on July 11, 2020 It turns out that the output of "tonic_build" depends on not only on the version of tonic, but also on the version of proc-macro2, and the version of proc-macro2 is not specifically pinned. `proc-macro2 = "1.0.19"` was released on July 19, 2020 ([https://crates.io/crates/proc-macro2/1.0.19]) and it appears to subtlety changes the resulting output of arrow.flight.protocol.rs; Thus the output no longer matches what is checked in. This means that anyone without a Cargo.lock file that pins proc-macro2 to 1.0.18 would get 1.0.19 and thus also a local modification during build. h1. Workaround If we pin Cargo.toml to use proc-macro2 1.0.18 the local modification stops. {code} proc-macro2 = "1.0.18" {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9599) [CI] Appveyor toolchain build fails because CMake detects different C and C++ compilers
Krisztian Szucs created ARROW-9599: -- Summary: [CI] Appveyor toolchain build fails because CMake detects different C and C++ compilers Key: ARROW-9599 URL: https://issues.apache.org/jira/browse/ARROW-9599 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 2.0.0 Build log: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34377790/job/f955ccj8irpgh565#L440 Caused by a recent CMake release 3.18. -- This message was sent by Atlassian Jira (v8.3.4#803005)