[jira] [Created] (ARROW-10953) CLONE - [R] as.data.frame.Table crashes R with schema and no record batches
Neal Richardson created ARROW-10953: --- Summary: CLONE - [R] as.data.frame.Table crashes R with schema and no record batches Key: ARROW-10953 URL: https://issues.apache.org/jira/browse/ARROW-10953 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 2.0.0 Environment: > sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 10 (buster) Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bigrquery_1.3.2bigrquerystorage_0.1.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.6 [4] compiler_4.0.3 dbplyr_2.0.0 tools_4.0.3 [7] odbc_1.3.0 getPass_0.2-2digest_0.6.27 [10] bit_4.0.4gargle_0.5.0 jsonlite_1.7.1 [13] memoise_1.1.0lifecycle_0.2.0 tibble_3.0.4 [16] pkgconfig_2.0.3 rlang_0.4.8 extraw_1.8.25 [19] DBI_1.1.0rstudioapi_0.13 curl_4.3 [22] xml2_1.3.2 dplyr_1.0.2 httr_1.4.2 [25] askpass_1.1 fs_1.5.0 generics_0.1.0 [28] vctrs_0.3.5 hms_0.5.3bit64_4.0.5 [31] tidyselect_1.1.0 glue_1.4.2 data.table_1.13.2 [34] R6_2.5.0 readxl_1.3.1 connect.cap_0.3.19 [37] purrr_0.3.4 blob_1.2.1 magrittr_2.0.1 [40] ellipsis_0.3.1 assertthat_0.2.1 keyring_1.1.0 [43] arrow_2.0.0.20201117 openssl_1.4.3crayon_1.3.4 Reporter: Bruno Tremblay Fix For: 3.0.0 Objective is to build a 0 rows data.frame using an arrow schema field definition {code:java} #IPC stream containing only a schema stream<-as.raw(c(255,255,255,255,16,1,0,0,16,0,0,0,0,0,10,0,12,0,6,0,5,0,8,0,10,0,0,0,0,1,3,0,12,0,0,0,8,0,8,0,0,0,4,0,8,0,0,0,4,0,0,0,4,0,0,0,160,0,0,0,92,0,0,0,48,0,0,0,4,0,0,0,128,255,255,255,0,0,1,5,20,0,0,0,12,0,0,0,4,0,0,0,0,0,0,0,176,255,255,255,7,0,0,0,82,69,80,79,78,83,69,0,168,255,255,255,0,0,1,5,20,0,0,0,12,0,0,0,4,0,0,0,0,0,0,0,216,255,255,255,6,0,0,0,68,69,84,65,73,76,0,0,208,255,255,255,0,0,1,5,24,0,0,0,16,0,0,0,4,0,0,0,0,0,0,0,4,0,4,0,4,0,0,0,8,0,0,0,68,65,84,65,84,89,80,69,0,0,0,0,16,0,20,0,8,0,6,0,7,0,12,0,0,0,16,0,16,0,0,0,0,0,1,7,36,0,0,0,20,0,0,0,4,0,0,0,0,0,0,0,8,0,12,0,4,0,8,0,8,0,0,0,38,0,0,0,9,0,0,0,8,0,0,0,77,65,67,84,65,95,73,68,0,0,0,0,0,0,0,0)) readr <- RecordBatchStreamReader$create(stream) readr$read_table() # Error in Table__from_RecordBatchStreamReader(self) : # Invalid: Must pass at least one record batch or an explicit Schema # Now trying to be too clever tb <- Table$create(data.frame(), schema = readr$schema) dtf <- as.data.frame(tb) # This will crash you R session {code} Tested on nightly, same behavior. It's borderline a bug / feature request, but to be a drop in replacement for some DBI methods, it needs to be able to build 0 rows data.frame with the correct class for each column. Thank you and have a nice day. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11071) [R][CI] Use processx to set up minio and flight servers in tests
Neal Richardson created ARROW-11071: --- Summary: [R][CI] Use processx to set up minio and flight servers in tests Key: ARROW-11071 URL: https://issues.apache.org/jira/browse/ARROW-11071 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 3.0.0 Rather than rely on them being set up outside of the tests. processx is already a transitive test dependency (testthat uses it) so there's no reason for us not to. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11079) [R] Catch up on changelog since 2.0
Neal Richardson created ARROW-11079: --- Summary: [R] Catch up on changelog since 2.0 Key: ARROW-11079 URL: https://issues.apache.org/jira/browse/ARROW-11079 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11080) [C++][Dataset] Improvements to implicit casting
Neal Richardson created ARROW-11080: --- Summary: [C++][Dataset] Improvements to implicit casting Key: ARROW-11080 URL: https://issues.apache.org/jira/browse/ARROW-11080 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Assignee: Ben Kietzman Fix For: 3.0.0 Followup to ARROW-10322. In ARROW-9187, where we started making use of more compute functions in R, we found a couple of places where implicit casts weren't being inserted where they should: * https://github.com/apache/arrow/pull/8947/commits/843ff2a39d8a4e1c92247fb672567c0b85b4f45a#diff-79100695986bbd6a63704fe9f238ce3ae9a39ddd093b7f6b213d4a722309d20aR576 "Function multiply_checked has no kernel matching input types (scalar[double], array[int32])" * https://github.com/apache/arrow/pull/8947/commits/843ff2a39d8a4e1c92247fb672567c0b85b4f45a#diff-79100695986bbd6a63704fe9f238ce3ae9a39ddd093b7f6b213d4a722309d20aR590 "Function add_checked has no kernel matching input types (array[double], array[int32])" because implicit casts are only applied to scalars to cast them to the type of the other argument This may speak to a need for more rules around how inputs should be casted/promoted in different contexts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11092) [CI] (Temporarily) move offending workflows to separate files
Neal Richardson created ARROW-11092: --- Summary: [CI] (Temporarily) move offending workflows to separate files Key: ARROW-11092 URL: https://issues.apache.org/jira/browse/ARROW-11092 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 3.0.0 Without warning, INFRA broke several of our GitHub Actions workflows, and have been unresponsive all week. See https://issues.apache.org/jira/browse/INFRA-21239. Since then, the Rust developers have removed their offending actions, so those are no longer blocked. This PR does harm reduction for C++ and R workflows, moving the workflows that INFRA doesn't like to their own files (temporarily, I hope, while this business gets sorted out). This enables the other workflows in each file to run, so we at least get some C++ and R tests running, and we can still verify on our personal forks the workflows that have been blocked on apache/arrow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11136) [R] Bindings for is.nan
Neal Richardson created ARROW-11136: --- Summary: [R] Bindings for is.nan Key: ARROW-11136 URL: https://issues.apache.org/jira/browse/ARROW-11136 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Jonathan Keane ARROW-11043 added this compute kernel in C++, so we should wire it up in R -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11152) [CI][C++] Fix Homebrew numpy installation on macOS builds
Neal Richardson created ARROW-11152: --- Summary: [CI][C++] Fix Homebrew numpy installation on macOS builds Key: ARROW-11152 URL: https://issues.apache.org/jira/browse/ARROW-11152 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Neal Richardson Fix For: 3.0.0 Numpy fails to install with homebrew because it tries to upgrade gcc and hits a {{brew link}} error. Running {{brew unlink gcc@8 gcc@9}} before {{brew install}} could work around this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11153) [C++][Packaging] Move debian/ubuntu/centos packaging off of Travis-CI
Neal Richardson created ARROW-11153: --- Summary: [C++][Packaging] Move debian/ubuntu/centos packaging off of Travis-CI Key: ARROW-11153 URL: https://issues.apache.org/jira/browse/ARROW-11153 Project: Apache Arrow Issue Type: New Feature Components: C++, Packaging Reporter: Neal Richardson Assignee: Kouhei Sutou Fix For: 3.0.0 Per mailing list discussion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11154) [CI][C++] Move homebrew crossbow tests off of Travis-CI
Neal Richardson created ARROW-11154: --- Summary: [CI][C++] Move homebrew crossbow tests off of Travis-CI Key: ARROW-11154 URL: https://issues.apache.org/jira/browse/ARROW-11154 Project: Apache Arrow Issue Type: New Feature Components: C++, Packaging Reporter: Neal Richardson Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11155) [C++][Packaging] Move gandiva crossbow jobs off of Travis-CI
Neal Richardson created ARROW-11155: --- Summary: [C++][Packaging] Move gandiva crossbow jobs off of Travis-CI Key: ARROW-11155 URL: https://issues.apache.org/jira/browse/ARROW-11155 Project: Apache Arrow Issue Type: New Feature Components: C++, Packaging Reporter: Neal Richardson Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11176) [R] Expose memory pool name and document setting it
Neal Richardson created ARROW-11176: --- Summary: [R] Expose memory pool name and document setting it Key: ARROW-11176 URL: https://issues.apache.org/jira/browse/ARROW-11176 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Jonathan Keane Fix For: 4.0.0 Followup to ARROW-11009, which did this in C++ and added the binding in Python. This could be useful not only for debugging but also for benchmarking. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11210) [CI] Restore workflows that had been blocked by INFRA
Neal Richardson created ARROW-11210: --- Summary: [CI] Restore workflows that had been blocked by INFRA Key: ARROW-11210 URL: https://issues.apache.org/jira/browse/ARROW-11210 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 3.0.0 See INFRA-21239, ARROW-11092, ARROW-11132 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11217) [C++] Runtime SIMD check on Apple hardware missing
Neal Richardson created ARROW-11217: --- Summary: [C++] Runtime SIMD check on Apple hardware missing Key: ARROW-11217 URL: https://issues.apache.org/jira/browse/ARROW-11217 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson Fix For: 3.0.0 [~jeroenooms] hit a crash in the "sum" compute kernel using the R package on a new M1 machine running the rosetta emulator: https://gist.github.com/jeroen/c60548b29ff7f6807a6554799bd01cb7 According to https://developer.apple.com/documentation/apple_silicon/about_the_rosetta_translation_environment, we should be checking sysctlbyname for AVX* capabilities, but we are not. We only use that function in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/cpu_info.cc#L350-L359 to check cpu cache size. This may also explain a crash we observed previously on a very old macOS CRAN machine. I think we should to resolve this before the 3.0 release if possible, in order to avoid bug reports as more people get M1s. cc [~apitrou] [~uwe] [~kou] [~frankdu] [~yibo] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11240) [Packaging][R] Add mimalloc to R packaging
Neal Richardson created ARROW-11240: --- Summary: [Packaging][R] Add mimalloc to R packaging Key: ARROW-11240 URL: https://issues.apache.org/jira/browse/ARROW-11240 Project: Apache Arrow Issue Type: New Feature Components: Packaging, R Reporter: Neal Richardson Fix For: 3.0.0 See also ARROW-11231 Relevant scripts: * ci/scripts/PKGBUILD * dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb * r/inst/build_arrow_static.sh -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11247) [C++] Infer date32 columns in CSV
Neal Richardson created ARROW-11247: --- Summary: [C++] Infer date32 columns in CSV Key: ARROW-11247 URL: https://issues.apache.org/jira/browse/ARROW-11247 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Jared Lander Assignee: Neal Richardson Fix For: 3.0.0 See ARROW-11243 for the original report -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11277) [C++] Fix compilation error in dataset expressions on macOS 10.11
Neal Richardson created ARROW-11277: --- Summary: [C++] Fix compilation error in dataset expressions on macOS 10.11 Key: ARROW-11277 URL: https://issues.apache.org/jira/browse/ARROW-11277 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson Assignee: Ben Kietzman See https://github.com/autobrew/homebrew-core/pull/61#issuecomment-761605455 R binary packages for macOS are built with an old SDK, so this is needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11338) [R] Bindings for quantile and median
Neal Richardson created ARROW-11338: --- Summary: [R] Bindings for quantile and median Key: ARROW-11338 URL: https://issues.apache.org/jira/browse/ARROW-11338 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 Following ARROW-10831 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11350) [C++] Bump dependency versions
Neal Richardson created ARROW-11350: --- Summary: [C++] Bump dependency versions Key: ARROW-11350 URL: https://issues.apache.org/jira/browse/ARROW-11350 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11392) [R] Remove ARROW_R_WITH_ARROW flags
Neal Richardson created ARROW-11392: --- Summary: [R] Remove ARROW_R_WITH_ARROW flags Key: ARROW-11392 URL: https://issues.apache.org/jira/browse/ARROW-11392 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 ARROW-10735 did the first part of this. Once we're sure that we want to fully remove the wrapping, * Remove all references to ARROW_R_WITH_ARROW * Remove arrow_available() function and all references to it (arrow must always be available) * Update docs to remove mention of the possibility that you could have a package installation that doesn't do anything * Remove all references to TEST_R_WITH_ARROW environment variable and remove the r_only() test wrapper -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11423) [R] value_counts and some StructArray methods
Neal Richardson created ARROW-11423: --- Summary: [R] value_counts and some StructArray methods Key: ARROW-11423 URL: https://issues.apache.org/jira/browse/ARROW-11423 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 Exposing value_counts() is useful for exploration, even if it is limited to counting over a single (non-struct) array. And since it returns a StructArray, I found it useful to implement some more methods on that object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11424) [C++] Add more StructType and StructArray methods
Neal Richardson created ARROW-11424: --- Summary: [C++] Add more StructType and StructArray methods Key: ARROW-11424 URL: https://issues.apache.org/jira/browse/ARROW-11424 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 A StructType is basically a Schema (vector of Fields), right? Likewise, a StructArray is pretty much the same as a RecordBatch, right? Schema and RecordBatch have many more methods than StructType/StructArray, but we should be able to do the same kinds of things to structs. Also, an observation while working on ARROW-11423: the method to extract an Array column from a StructArray is called {{field()} and {{GetFieldByName()}}, which is confusing since Schema/StructType is what contains Field objects, and {{StructArray::field()}} returns Array, not Field. cc [~bkietz] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11441) [R] Read CSV from character vector
Neal Richardson created ARROW-11441: --- Summary: [R] Read CSV from character vector Key: ARROW-11441 URL: https://issues.apache.org/jira/browse/ARROW-11441 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 `readr::read_csv()` lets you read in data from a character vector, useful for (e.g.) taking the results of a system call and reading it in as a data.frame. {code} > readr::read_csv(c("a,b", "1,2", "3,4")) # A tibble: 2 x 2 a b 1 1 2 2 3 4 {code} One solution would be similar to ARROW-9235, perhaps, treating it as a textConnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11460) [R] Use system compression libraries if present on Linux
Neal Richardson created ARROW-11460: --- Summary: [R] Use system compression libraries if present on Linux Key: ARROW-11460 URL: https://issues.apache.org/jira/browse/ARROW-11460 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson We vendor/bundle all compression libraries and have them disabled in the default build. This is reliable, but it would be nice to use system libraries if they're present. It's not as simple as setting {{ARROW_DEPENDENCY_SOURCE=AUTO}} because we have to know if we're using them in order to set the right `-lwhatever` flags in the R package build. Maybe these can be determined from the C++ build/cmake output rather than detected outside the build (but this may require ARROW-6312). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11474) [C++] Update bundled re2 version
Neal Richardson created ARROW-11474: --- Summary: [C++] Update bundled re2 version Key: ARROW-11474 URL: https://issues.apache.org/jira/browse/ARROW-11474 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 I tried increasing the re2 version to 2020-11-01 in but it failed in a few builds with {code} /usr/bin/ar: /root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a: No such file or directory make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9 make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2 {code} (or similar). My theory is that something changed in their cmake build setup so that either libre2.a is not where we expect it, or it's building a shared library instead, or something. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11475) [C++] Upgrade mimalloc
Neal Richardson created ARROW-11475: --- Summary: [C++] Upgrade mimalloc Key: ARROW-11475 URL: https://issues.apache.org/jira/browse/ARROW-11475 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 I tried this in ARROW-11350 but ran into an issue (https://github.com/microsoft/mimalloc/issues/353). That has since been resolved and we could apply a patch to bring it in. Or we can wait for it to get into a proper release. There is also now a 1.7 release, which claims to work on the Apple M1, as well as a 2.0 version, which claims better performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11486) [Website] jekyll build fails with Ruby 3.0
Neal Richardson created ARROW-11486: --- Summary: [Website] jekyll build fails with Ruby 3.0 Key: ARROW-11486 URL: https://issues.apache.org/jira/browse/ARROW-11486 Project: Apache Arrow Issue Type: New Feature Components: Website Reporter: Neal Richardson Assignee: Kouhei Sutou See https://github.com/apache/arrow-site/runs/1786669028?check_suite_focus=true for example. This started failing when the default ruby version increased from 2.7 to 3.0. Pinning the ruby version to 2.7 fixed it (https://github.com/apache/arrow-site/pull/92/commits/b1b8c4fc9138b28ede427967e37da70e12670969); maybe that's good enough? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11499) [Packaging] Remove all use of bintray
Neal Richardson created ARROW-11499: --- Summary: [Packaging] Remove all use of bintray Key: ARROW-11499 URL: https://issues.apache.org/jira/browse/ARROW-11499 Project: Apache Arrow Issue Type: New Feature Components: Packaging Reporter: Neal Richardson Fix For: 4.0.0 Bintray is being shut down on May 1, and possibly as early as February 28 we won't be able to upload to it. https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/ Feel free to make subtasks to break out this work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11501) [C++] endianness check does not work on Solaris
Neal Richardson created ARROW-11501: --- Summary: [C++] endianness check does not work on Solaris Key: ARROW-11501 URL: https://issues.apache.org/jira/browse/ARROW-11501 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson {code} In file included from /export/home/XXVfZhv/Rtemp/RtmpoK9Cps/file3f4a341e5d8f/cpp/src/arrow/type_traits.h:26:0, from /export/home/XXVfZhv/Rtemp/RtmpoK9Cps/file3f4a341e5d8f/cpp/src/arrow/scalar.h:36, from /export/home/XXVfZhv/Rtemp/RtmpoK9Cps/file3f4a341e5d8f/cpp/src/arrow/datum.h:28, from /export/home/XXVfZhv/Rtemp/RtmpoK9Cps/file3f4a341e5d8f/cpp/src/arrow/dataset/expression.h:32, from /export/home/XXVfZhv/Rtemp/RtmpoK9Cps/file3f4a341e5d8f/cpp/src/arrow/dataset/dataset.h:28, from /export/home/XXVfZhv/Rtemp/RtmpoK9Cps/file3f4a341e5d8f/cpp/src/arrow/dataset/dataset.cc:18: /export/home/XXVfZhv/Rtemp/RtmpoK9Cps/file3f4a341e5d8f/cpp/src/arrow/util/bit_util.h:26:42: fatal error: endian.h: No such file or directory {code} Googling the error message shows some known issues and workarounds for this on Solaris, e.g.: * https://github.com/Sereal/Sereal/issues/139 * https://gitlab.torproject.org/legacy/trac/-/issues/11426 cc [~kiszk] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11500) [R] Allow bundled build script to run on Solaris
Neal Richardson created ARROW-11500: --- Summary: [R] Allow bundled build script to run on Solaris Key: ARROW-11500 URL: https://issues.apache.org/jira/browse/ARROW-11500 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 Minor changes that allow us to at least attempt a build on Solaris. Does not resolve C++ build issues -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11507) [R] Bindings for GetRuntimeInfo
Neal Richardson created ARROW-11507: --- Summary: [R] Bindings for GetRuntimeInfo Key: ARROW-11507 URL: https://issues.apache.org/jira/browse/ARROW-11507 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11513) [R] Bindings for sub/gsub
Neal Richardson created ARROW-11513: --- Summary: [R] Bindings for sub/gsub Key: ARROW-11513 URL: https://issues.apache.org/jira/browse/ARROW-11513 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11514) [R] Bindings for str_c
Neal Richardson created ARROW-11514: --- Summary: [R] Bindings for str_c Key: ARROW-11514 URL: https://issues.apache.org/jira/browse/ARROW-11514 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11515) [R] Bindings for strsplit
Neal Richardson created ARROW-11515: --- Summary: [R] Bindings for strsplit Key: ARROW-11515 URL: https://issues.apache.org/jira/browse/ARROW-11515 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 split_pattern is the C++ compute function name -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11516) [R] Allow all C++ compute functions to be called by name in dplyr
Neal Richardson created ARROW-11516: --- Summary: [R] Allow all C++ compute functions to be called by name in dplyr Key: ARROW-11516 URL: https://issues.apache.org/jira/browse/ARROW-11516 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 Followup to ARROW-9856. Use list_compute_functions (added here) to make all Arrow C++ compute functions available directly by name (in case you want to use the non-checked arithmetic, or an ascii specific kernel, or something without a natural R analogue). Will require a bit more refactoring to handle variable numbers of args, as well as some additional options handling. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11589) [R] Add methods for modifying Schemas
Neal Richardson created ARROW-11589: --- Summary: [R] Add methods for modifying Schemas Key: ARROW-11589 URL: https://issues.apache.org/jira/browse/ARROW-11589 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Carl Boettiger Fix For: 4.0.0 $<-, [[<-, and (probably) [<- methods. We have the extracting versions implemented but not the updating ones, and that would be useful. Motivating use case: schema detection for a dataset misreads a column, so take the autodetected schema, modify one field, and then re-create the dataset with the correct schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11591) [C++] Prototype version of hash aggregation
Neal Richardson created ARROW-11591: --- Summary: [C++] Prototype version of hash aggregation Key: ARROW-11591 URL: https://issues.apache.org/jira/browse/ARROW-11591 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11610) [C++] Download boost from sourceforge instead of bintray
Neal Richardson created ARROW-11610: --- Summary: [C++] Download boost from sourceforge instead of bintray Key: ARROW-11610 URL: https://issues.apache.org/jira/browse/ARROW-11610 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 e.g. https://sourceforge.net/projects/boost/files/boost/1.67.0/boost_1_67_0.tar.gz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11611) [C++] Move third party dependency mirrors from bintray
Neal Richardson created ARROW-11611: --- Summary: [C++] Move third party dependency mirrors from bintray Key: ARROW-11611 URL: https://issues.apache.org/jira/browse/ARROW-11611 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 We added copies of these a while back to handle rate limiting to our own bintray. We should either remove them or update and move them elsewhere. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11612) [C++] Rebuild trimmed boost bundle
Neal Richardson created ARROW-11612: --- Summary: [C++] Rebuild trimmed boost bundle Key: ARROW-11612 URL: https://issues.apache.org/jira/browse/ARROW-11612 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 And host somewhere other than bintray. We can prune it further now that we've dropped boost::regex, too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11613) [R] Move nightly C++ builds off of bintray
Neal Richardson created ARROW-11613: --- Summary: [R] Move nightly C++ builds off of bintray Key: ARROW-11613 URL: https://issues.apache.org/jira/browse/ARROW-11613 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11657) [R] group_by with .drop specified errors
Neal Richardson created ARROW-11657: --- Summary: [R] group_by with .drop specified errors Key: ARROW-11657 URL: https://issues.apache.org/jira/browse/ARROW-11657 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 cf. https://github.com/tidyverse/dplyr/issues/5763 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11658) [R] Handle mutate/rename inside group_by
Neal Richardson created ARROW-11658: --- Summary: [R] Handle mutate/rename inside group_by Key: ARROW-11658 URL: https://issues.apache.org/jira/browse/ARROW-11658 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson Fix For: 4.0.0 Followup to ARROW-11657 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11659) [R] Preserve group_by .drop argument
Neal Richardson created ARROW-11659: --- Summary: [R] Preserve group_by .drop argument Key: ARROW-11659 URL: https://issues.apache.org/jira/browse/ARROW-11659 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11660) [C++] Move RecordBatch::SelectColumns method from R to C++ library
Neal Richardson created ARROW-11660: --- Summary: [C++] Move RecordBatch::SelectColumns method from R to C++ library Key: ARROW-11660 URL: https://issues.apache.org/jira/browse/ARROW-11660 Project: Apache Arrow Issue Type: New Feature Components: C++, R Reporter: Neal Richardson Fix For: 4.0.0 Table has a proper SelectColumns method in the C++ library but the RecordBatch one is in the R library and should be pushed down to C++ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11672) [R] Fix string function test failure on R 3.3
Neal Richardson created ARROW-11672: --- Summary: [R] Fix string function test failure on R 3.3 Key: ARROW-11672 URL: https://issues.apache.org/jira/browse/ARROW-11672 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 https://github.com/ursacomputing/crossbow/runs/1916519092#step:7:389 This test was added in ARROW-9856 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11683) [R] Support dplyr::mutate()
Neal Richardson created ARROW-11683: --- Summary: [R] Support dplyr::mutate() Key: ARROW-11683 URL: https://issues.apache.org/jira/browse/ARROW-11683 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11693) [C++] Add string length kernel
Neal Richardson created ARROW-11693: --- Summary: [C++] Add string length kernel Key: ARROW-11693 URL: https://issues.apache.org/jira/browse/ARROW-11693 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 We have "binary_length" but that doesn't handle UTF-8 the way we need for this. Example (from R): {code} > string <- "áéíóú" > nchar(string) [1] 5 > arrow:::call_function("binary_length", Scalar$create(string)) Scalar 10 {code} cc [~maartenbreddels] [~apitrou] [~jorisvandenbossche] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11699) [R] Implement dplyr::across()
Neal Richardson created ARROW-11699: --- Summary: [R] Implement dplyr::across() Key: ARROW-11699 URL: https://issues.apache.org/jira/browse/ARROW-11699 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson It's not a generic, but because it seems only to be called inside of functions like `mutate()`, we can insert our own version of it into the NSE data mask -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11700) [R] Internationalize error handling in tidy eval
Neal Richardson created ARROW-11700: --- Summary: [R] Internationalize error handling in tidy eval Key: ARROW-11700 URL: https://issues.apache.org/jira/browse/ARROW-11700 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 We have {code} tryCatch(eval_tidy(expr, mask), error = function(e) { # Look for the cases where bad input was given, i.e. this would fail # in regular dplyr anyway, and let those raise those as errors; # else, for things not supported by Arrow return a "try-error", # which we'll handle differently msg <- conditionMessage(e) # TODO: internationalization? if (grepl("object '.*'.not.found", msg)) { stop(e) } if (grepl('could not find function ".*"', msg)) { stop(e) } invisible(structure(msg, class = "try-error", condition = e)) }) {code} and tests for this behavior, but the tests are skipped because they only match correctly in an English locale because these base R messages are translated. We can generate these regular expressions dynamically by triggering the R errors on a known nonexistent object: {code} > tryCatch(X_X, error = function(e) conditionMessage(e)) [1] "object 'X_X' not found" > tryCatch(X_X(), error = function(e) conditionMessage(e)) [1] "could not find function \"X_X\"" > sub("X_X", ".*", tryCatch(X_X, error = function(e) > conditionMessage(e))) [1] "object '.*' not found" {code} And this will respect i18n: {code} > Sys.setenv(LANGUAGE="FR_fr") > sub("X_X", ".*", tryCatch(X_X, error = function(e) > conditionMessage(e))) [1] "objet '.*' introuvable" {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11701) [R] Implement dplyr::relocate()
Neal Richardson created ARROW-11701: --- Summary: [R] Implement dplyr::relocate() Key: ARROW-11701 URL: https://issues.apache.org/jira/browse/ARROW-11701 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 Is a generic so we can support it properly. Allows for column reordering, callable directly or with the .before/.after args to mutate(). This is something we can implement with the current C++ backend support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11702) [R] Enable ungrouped aggregations in non-Dataset expressions
Neal Richardson created ARROW-11702: --- Summary: [R] Enable ungrouped aggregations in non-Dataset expressions Key: ARROW-11702 URL: https://issues.apache.org/jira/browse/ARROW-11702 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Things like {{mutate(table, x_norm = x / mean(x, na.rm = TRUE))}} could be supported for queries on Table/RecordBatch (but not yet on Dataset), but even so there are lots of gotchas, such as order of evaluation when building up a lazy query (i.e. evaluating aggregation before or after a filter expression that may change the value of the aggregation result). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11703) [R] Implement dplyr::arrange()
Neal Richardson created ARROW-11703: --- Summary: [R] Implement dplyr::arrange() Key: ARROW-11703 URL: https://issues.apache.org/jira/browse/ARROW-11703 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 Only for Table/RecordBatch for now. There are sorting functions in the compute module now (https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions) and I think they have Python bindings already. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11704) [R] Wire up dplyr::mutate() for datasets
Neal Richardson created ARROW-11704: --- Summary: [R] Wire up dplyr::mutate() for datasets Key: ARROW-11704 URL: https://issues.apache.org/jira/browse/ARROW-11704 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11705) [R] Support scalar value recycling in RecordBatch/Table$create()
Neal Richardson created ARROW-11705: --- Summary: [R] Support scalar value recycling in RecordBatch/Table$create() Key: ARROW-11705 URL: https://issues.apache.org/jira/browse/ARROW-11705 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 Compare: {code} > tibble::tibble(a=1:5, b = 42) # A tibble: 5 x 2 a b 1 142 2 242 3 342 4 442 5 542 > arrow::record_batch(a=1:5, b = 42) Error: Invalid: All arrays must have the same length {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11734) [C++] vendored safe-math.h does not compile on Solaris
Neal Richardson created ARROW-11734: --- Summary: [C++] vendored safe-math.h does not compile on Solaris Key: ARROW-11734 URL: https://issues.apache.org/jira/browse/ARROW-11734 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson Assignee: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11735) [R] Allow parquet to be an optional component like S3
Neal Richardson created ARROW-11735: --- Summary: [R] Allow parquet to be an optional component like S3 Key: ARROW-11735 URL: https://issues.apache.org/jira/browse/ARROW-11735 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Neal Richardson Fix For: 4.0.0 Parquet requires thrift and it seems that thrift (at least as of version 0.12) does not compile on Solaris. We could debug that, or we could also make Parquet an optional feature in the R bindings. That might have some value anyway so that one could build a lighter/minimal R package, if that were helpful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11736) [R] Allow string compute functions to be optional
Neal Richardson created ARROW-11736: --- Summary: [R] Allow string compute functions to be optional Key: ARROW-11736 URL: https://issues.apache.org/jira/browse/ARROW-11736 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Neal Richardson Fix For: 4.0.0 The Solaris build fails to build {{libarrow_bundled_dependencies.a}} because of some mismatch of arguments to the {{ar}} command: {code} [ 19%] Bundling /export/home/XnknpBn/Rtemp/RtmpBOhxfH/file66df7a592ae4/release/libarrow_bundled_dependencies.a gmake[2]: Entering directory '/export/home/XnknpBn/Rtemp/RtmpBOhxfH/file66df7a592ae4' usage: ar -d[-SvV] archive file ... ar -m[-abiSvV] [posname] archive file ... ar -p[-vV][-sS] archive [file ...] ar -q[-cuvSV] [-abi] [posname] [file ...] ar -r[-cuvSV] [-abi] [posname] [file ...] ar -t[-vV][-sS] archive [file ...] ar -x[-vV][-sSCT] archive [file ...] gmake[2]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/build.make:61: release/libarrow_bundled_dependencies.a] Error 1 {code} If ARROW_PARQUET=OFF (ARROW-11735), the only dependencies to bundle are re2 and utf8proc. So we could either fix the {{ar}} invocation, or we could make re2 and utf8proc optional. Build-wise, they are optional, but we have some tests that call the string kernels, and we'd need to know that they should be skipped (i.e. another option in {{skip_if_not_available()}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11737) [C++] Patch vendored xxhash for Solaris
Neal Richardson created ARROW-11737: --- Summary: [C++] Patch vendored xxhash for Solaris Key: ARROW-11737 URL: https://issues.apache.org/jira/browse/ARROW-11737 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 It fails to compile, but interestingly just as I was looking into the error, I see that the issue has been fixed _today_ in xxhash: https://github.com/Cyan4973/xxHash/pull/498 So I think we just need to apply this patch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11740) [C++] posix_memalign not declared in scope on Solaris
Neal Richardson created ARROW-11740: --- Summary: [C++] posix_memalign not declared in scope on Solaris Key: ARROW-11740 URL: https://issues.apache.org/jira/browse/ARROW-11740 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson {code} [ 27%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/memory_pool.cc.o /export/home/X4HzInm/Rtemp/Rtmp1Zx7Xc/file1f6372fd66ce/cpp/src/arrow/memory_pool.cc:In static member function static arrow::Status arrow::{anonymous}::SystemAllocator::AllocateAligned(int64_t, uint8_t**): /export/home/X4HzInm/Rtemp/Rtmp1Zx7Xc/file1f6372fd66ce/cpp/src/arrow/memory_pool.cc:187:64:error: posix_memalignwas not declared in this scope static_cast(size)); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11752) [R] Replace usage of testthat::expect_is()
Neal Richardson created ARROW-11752: --- Summary: [R] Replace usage of testthat::expect_is() Key: ARROW-11752 URL: https://issues.apache.org/jira/browse/ARROW-11752 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Fix For: 4.0.0 Per https://testthat.r-lib.org/reference/expect_is.html it has been superceded. We have ~180 instances of it in our tests that should be upgraded. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11754) [R] Support dplyr::compute()
Neal Richardson created ARROW-11754: --- Summary: [R] Support dplyr::compute() Key: ARROW-11754 URL: https://issues.apache.org/jira/browse/ARROW-11754 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson See discussion at https://github.com/apache/arrow/pull/9521#discussion_r581367505 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11755) [R] Add tests from dplyr/test-mutate.r
Neal Richardson created ARROW-11755: --- Summary: [R] Add tests from dplyr/test-mutate.r Key: ARROW-11755 URL: https://issues.apache.org/jira/browse/ARROW-11755 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Fix For: 4.0.0 Review https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r and port tests over to arrow as needed to see if there are edge cases we aren't covering appropriately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11785) [R] Fallback when filtering Table with if_any() expression fails
Neal Richardson created ARROW-11785: --- Summary: [R] Fallback when filtering Table with if_any() expression fails Key: ARROW-11785 URL: https://issues.apache.org/jira/browse/ARROW-11785 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson Fix For: 4.0.0 {code} > iris %>% record_batch() %>% +filter(if_any(ends_with("Width"), ~ . > 4)) Warning: Filter expression not implemented in Arrow: if_any(ends_with("Width"), ~. > 4); pulling data into R Error: Cannot extract rows with an object of class NULL {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11832) [R] Handle conversion of extra nested struct column
Neal Richardson created ARROW-11832: --- Summary: [R] Handle conversion of extra nested struct column Key: ARROW-11832 URL: https://issues.apache.org/jira/browse/ARROW-11832 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson Assignee: Romain Francois Fix For: 4.0.0 Followup to ARROW-10570. See https://github.com/apache/arrow/pull/8650/#issuecomment-788404473 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11864) [R] Document arrow.int64_downcast option
Neal Richardson created ARROW-11864: --- Summary: [R] Document arrow.int64_downcast option Key: ARROW-11864 URL: https://issues.apache.org/jira/browse/ARROW-11864 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Matthew Summersgill Fix For: 4.0.0 See ARROW-9083 and discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11878) [C++] Improve Converter API to support chunking
Neal Richardson created ARROW-11878: --- Summary: [C++] Improve Converter API to support chunking Key: ARROW-11878 URL: https://issues.apache.org/jira/browse/ARROW-11878 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 We would like to be able to chunk a data frame when converting to Arrow Table in R (see ARROW-9293). Apparently this is also not supported in pyarrow. [~romainfrancois] says two things need to happen: - Converter api needs to be able to Extend() a range of values, as opposed to the current api we have : {{Status Extend(SEXP x, int64_t size)}} override which says ingest that vector x and btw it has this many elements. - Chunker or perhaps another/new class would sit on top of that and perhaps {{Chunker::Extend(x)}} would call multiple times (one for each chunk) {{Converter$Extend(x, start, size)}}. The current chunker solves I believe a different problem and is rooted in a Converter that deals with elements one by one so that: - if the element can be Append() that’s fine - if not, then create a new chunk and try again The current chunker has a multiple element method but it’s an all or nothing: {code} // we could get bit smarter here since the whole batch of appendable values // will be rejected if a capacity error is raised Status Extend(InputType values, int64_t size) { auto status = converter_->Extend(values, size); if (ARROW_PREDICT_FALSE(status.IsCapacityError())) { if (converter_->builder()->length() == 0) { return status; } ARROW_RETURN_NOT_OK(FinishChunk()); return Extend(values, size); } length_ += size; return status; } {code} This does not give a way to say e.g. take this vector and chunk it into arrays of this size, which is what we want. cc [~kszucs] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11912) [R] Remove args from FeatherReader$create
Neal Richardson created ARROW-11912: --- Summary: [R] Remove args from FeatherReader$create Key: ARROW-11912 URL: https://issues.apache.org/jira/browse/ARROW-11912 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 They aren't used anymore because FeatherReader$create() now requires that you provide it a file connection. (We leaked connections before when it accepted a string file path and opened a connection if needed.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11921) [R] Set LC_COLLATE in r/data-raw/codegen.R
Neal Richardson created ARROW-11921: --- Summary: [R] Set LC_COLLATE in r/data-raw/codegen.R Key: ARROW-11921 URL: https://issues.apache.org/jira/browse/ARROW-11921 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 So that the sort order of the generated wrapping code is stable across machines. Otherwise we'll keep thrashing on arrowExports.cpp whenever different people rebuild things (cf. https://github.com/apache/arrow/commit/21999ecd3cf2b9141e182c648eb13ab3836500d0#diff-f6ded32632f8b1516f0e852b8e648af02be39e60010c546a17502d1830245076). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11950) [C++] Add unary negative kernel
Neal Richardson created ARROW-11950: --- Summary: [C++] Add unary negative kernel Key: ARROW-11950 URL: https://issues.apache.org/jira/browse/ARROW-11950 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Assignee: Eduardo Ponce Fix For: 4.0.0 Related to ARROW-11945. So that you can have an expression like {{-col}}. You can approximate this with doing {{0 - col}}, but I would guess it could be done more efficiently. cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11954) [C++] arrow/util/io_util.cc does not compile on Solaris
Neal Richardson created ARROW-11954: --- Summary: [C++] arrow/util/io_util.cc does not compile on Solaris Key: ARROW-11954 URL: https://issues.apache.org/jira/browse/ARROW-11954 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson Looks similar to ARROW-11740 {code} /export/home/XI4sjNd/Rtemp/RtmpvN4Lx2/fileef105d2909/cpp/src/arrow/util/io_util.cc: In function ‘arrow::Status arrow::internal::MemoryMapRemap(void*, std::size_t, std::size_t, int, void**)’: /export/home/XI4sjNd/Rtemp/RtmpvN4Lx2/fileef105d2909/cpp/src/arrow/util/io_util.cc:1089:48: error: ‘MREMAP_MAYMOVE’ was not declared in this scope *new_addr = mremap(addr, old_size, new_size, MREMAP_MAYMOVE); ^ /export/home/XI4sjNd/Rtemp/RtmpvN4Lx2/fileef105d2909/cpp/src/arrow/util/io_util.cc:1089:62: error: ‘mremap’ was not declared in this scope *new_addr = mremap(addr, old_size, new_size, MREMAP_MAYMOVE); ^ /export/home/XI4sjNd/Rtemp/RtmpvN4Lx2/fileef105d2909/cpp/src/arrow/util/io_util.cc: In function ‘arrow::Status arrow::internal::MemoryAdviseWillNeed(const std::vector&)’: /export/home/XI4sjNd/Rtemp/RtmpvN4Lx2/fileef105d2909/cpp/src/arrow/util/io_util.cc:1144:59: error: ‘POSIX_MADV_WILLNEED’ was not declared in this scope int err = posix_madvise(aligned.addr, aligned.size, POSIX_MADV_WILLNEED); ^ /export/home/XI4sjNd/Rtemp/RtmpvN4Lx2/fileef105d2909/cpp/src/arrow/util/io_util.cc:1144:78: error: ‘posix_madvise’ was not declared in this scope int err = posix_madvise(aligned.addr, aligned.size, POSIX_MADV_WILLNEED); ^ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11993) [C++] Don't download xsimd if ARROW_SIMD_LEVEL=NONE
Neal Richardson created ARROW-11993: --- Summary: [C++] Don't download xsimd if ARROW_SIMD_LEVEL=NONE Key: ARROW-11993 URL: https://issues.apache.org/jira/browse/ARROW-11993 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson It doesn't get used if SIMD level is NONE, so we shouldn't bother downloading it. cc [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11994) [R] Build fails if dataset enabled but parquet is not
Neal Richardson created ARROW-11994: --- Summary: [R] Build fails if dataset enabled but parquet is not Key: ARROW-11994 URL: https://issues.apache.org/jira/browse/ARROW-11994 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson Following ARROW-11735; discovered while working on ARROW-10734. The arrow::dataset::ParquetFileFormat and related classes require both dataset and parquet. The {{#if defined}} logic in r/src/dataset.cpp is right and both are required, but in the wrapping that is generated for arrowExports.cpp, we only use the annotation on the functions, {{[[dataset::export]]}} to wrap. So the ParquetFileFormat methods in arrowExports.cpp are if defined ARROW_R_WITH_DATASET and fail if parquet is not available. Not a priority to fix (for Solaris I can turn off ARROW_DATASET and avoid this), just wanted to note it in case we need to revisit this wrapping logic later anyway. cc [~icook] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11996) [R] Make r/configure run successfully on Solaris
Neal Richardson created ARROW-11996: --- Summary: [R] Make r/configure run successfully on Solaris Key: ARROW-11996 URL: https://issues.apache.org/jira/browse/ARROW-11996 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Neal Richardson Assignee: Neal Richardson Replace some {{$()}} with backticks and use {{sed}} in a safe way -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12081) [R] Bindings for utf8_length
Neal Richardson created ARROW-12081: --- Summary: [R] Bindings for utf8_length Key: ARROW-12081 URL: https://issues.apache.org/jira/browse/ARROW-12081 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 Following ARROW-11693 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12085) [R] Installation on ppc64le
Neal Richardson created ARROW-12085: --- Summary: [R] Installation on ppc64le Key: ARROW-12085 URL: https://issues.apache.org/jira/browse/ARROW-12085 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson >From https://github.com/apache/arrow/issues/9747 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12094) [C++][R] Fix/workaround re2 building on clang/libc++
Neal Richardson created ARROW-12094: --- Summary: [C++][R] Fix/workaround re2 building on clang/libc++ Key: ARROW-12094 URL: https://issues.apache.org/jira/browse/ARROW-12094 Project: Apache Arrow Issue Type: New Feature Components: C++, R Reporter: Neal Richardson Fix For: 4.0.0 See https://github.com/apache/arrow/pull/8468#issuecomment-807807284. We either need to fix the build (maybe there's something not getting passed through to build_re2 correctly in cmake) or figure out the conditions under which the C++ build should turn off re2. See also ARROW-11736 to make regex compute functions optional in R tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12095) [CI][C++] Add nightly job to test offline build
Neal Richardson created ARROW-12095: --- Summary: [CI][C++] Add nightly job to test offline build Key: ARROW-12095 URL: https://issues.apache.org/jira/browse/ARROW-12095 Project: Apache Arrow Issue Type: New Feature Components: C++, Continuous Integration Reporter: Neal Richardson Fix For: 5.0.0 See discussion on https://github.com/apache/arrow/pull/9803 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12128) [CI][Crossbow] Remove (or fix) test-ubuntu-16.04-cpp job
Neal Richardson created ARROW-12128: --- Summary: [CI][Crossbow] Remove (or fix) test-ubuntu-16.04-cpp job Key: ARROW-12128 URL: https://issues.apache.org/jira/browse/ARROW-12128 Project: Apache Arrow Issue Type: New Feature Components: C++, Continuous Integration Reporter: Neal Richardson Fix For: 4.0.0 ARROW-8049 increased the minimum cmake version required for bundled thrift to 3.10, which is not what 16.04 ships. We removed packaging jobs in ARROW-11910 because it is EOL in April 2021, but we still have a nightly job that is failing and other related materials (Dockerfile etc.) for 16.04. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12134) [C++] Add regex string match kernel
Neal Richardson created ARROW-12134: --- Summary: [C++] Add regex string match kernel Key: ARROW-12134 URL: https://issues.apache.org/jira/browse/ARROW-12134 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 We have a basic {{match_substring}} kernel already but not a regular expression one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12137) [R] New/improved vignette on dplyr features
Neal Richardson created ARROW-12137: --- Summary: [R] New/improved vignette on dplyr features Key: ARROW-12137 URL: https://issues.apache.org/jira/browse/ARROW-12137 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Ian Cook Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12141) [R] Bindings for grepl
Neal Richardson created ARROW-12141: --- Summary: [R] Bindings for grepl Key: ARROW-12141 URL: https://issues.apache.org/jira/browse/ARROW-12141 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 Depends on ARROW-12134. There's {{match_substring_regex}} and {{match_substring}} for the {{fixed = TRUE}} version. Also map to {{stringr::str_detect}} as appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12198) [R] bindings for strptime
Neal Richardson created ARROW-12198: --- Summary: [R] bindings for strptime Key: ARROW-12198 URL: https://issues.apache.org/jira/browse/ARROW-12198 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12199) [R] bindings for stddev, variance
Neal Richardson created ARROW-12199: --- Summary: [R] bindings for stddev, variance Key: ARROW-12199 URL: https://issues.apache.org/jira/browse/ARROW-12199 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12197) [R] dplyr bindings for cast, dictionary_encode
Neal Richardson created ARROW-12197: --- Summary: [R] dplyr bindings for cast, dictionary_encode Key: ARROW-12197 URL: https://issues.apache.org/jira/browse/ARROW-12197 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12200) [R] Export and document list_compute_functions
Neal Richardson created ARROW-12200: --- Summary: [R] Export and document list_compute_functions Key: ARROW-12200 URL: https://issues.apache.org/jira/browse/ARROW-12200 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 4.0.0 Since they're available to call in dplyr now, we should make it available. Note that not all compute functions are suitable to work in filter/mutate, and some will require custom C++ wiring for the FunctionOptions. But many/most just work now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12212) [R][CI] Test nightly on solaris
Neal Richardson created ARROW-12212: --- Summary: [R][CI] Test nightly on solaris Key: ARROW-12212 URL: https://issues.apache.org/jira/browse/ARROW-12212 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration, R Reporter: Neal Richardson Followup to ARROW-10734. Setting up a solaris vm on github actions may be possible. We can try to setup https://github.com/vmactions/solaris-vm with R from https://files.r-hub.io/opencsw/. A temporary solution could be a nightly r-hub build kicked off by the arrow-r-nightly CI; it would email me with the results. Not ideal but it would at least alert us to issues closer to when they are merged and not just at release time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12213) [R] copy_files doesn't make it easy to copy a single file
Neal Richardson created ARROW-12213: --- Summary: [R] copy_files doesn't make it easy to copy a single file Key: ARROW-12213 URL: https://issues.apache.org/jira/browse/ARROW-12213 Project: Apache Arrow Issue Type: New Feature Components: C++, R Reporter: Neal Richardson copy_files (i.e. fs::CopyFiles) makes it trivial to recursively copy a directory/bucket to or from S3, but I'm having a hard time downloading a single file. cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12236) [R][CI] Add check that all docs pages are listed in _pkgdown.yml
Neal Richardson created ARROW-12236: --- Summary: [R][CI] Add check that all docs pages are listed in _pkgdown.yml Key: ARROW-12236 URL: https://issues.apache.org/jira/browse/ARROW-12236 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration, R Reporter: Neal Richardson Our (external) nightly R packaging and docs build is failing to render the pkgdown site: https://github.com/ursa-labs/arrow-r-nightly/runs/2266551062?check_suite_focus=true#step:9:55 This is due to (1) a [new-ish change in pkgdown|https://github.com/r-lib/pkgdown/pull/1395] that errors if topics are not included and (2) the recent addition of FragmentScanOptions, which did not get added to _pkgdown.yml. We should validate this on our regular CI in order to prevent future issues like this. We often have to add things to _pkgdown.yml right at release time, and it would be better to keep up as we go. Some ideas for how: * Add a step to an existing R workflow (e.g. https://github.com/apache/arrow/blob/master/.github/workflows/r.yml#L60) that does this check * Add a new workflow that is triggered only on changes to `r/man` and `r/_pkgdown.yml` * In either case, this could be done as a bash script, a python script, or an R script. If using R, note that the docker-based CI jobs won't have R installed, so you might want to tack it onto one of the windows jobs (which uses the setup-r action), but then you're in windows. * You could install pkgdown and try to build the site, but that's a lot of dependency to download and install just to essentially compare some lines in a yaml file with a directory listing (i.e., make sure that all {{r/man/*.Rd}} have corresponding entries in the reference part of the yml), so python or even a bash script might be more efficient to run. And since this is going to run a lot, it's worth considering how to keep runtime down even if that means more work to set it up. * If you're scripting this standalone, think you'll need to filter out Rd files that have {{\keyword{internal}}} as pkgdown excludes those from the reference list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12304) [R] Update news and polish docs for 4.0
Neal Richardson created ARROW-12304: --- Summary: [R] Update news and polish docs for 4.0 Key: ARROW-12304 URL: https://issues.apache.org/jira/browse/ARROW-12304 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12316) [C++] Switch default memory allocator from jemalloc to mimalloc
Neal Richardson created ARROW-12316: --- Summary: [C++] Switch default memory allocator from jemalloc to mimalloc Key: ARROW-12316 URL: https://issues.apache.org/jira/browse/ARROW-12316 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 Benchmarking shows that mimalloc seems to be faster on real workflows (at least on macOS, still collecting data on Ubuntu). We could switch the default memory pool cases so that mimalloc is preferred. cc [~jonkeane] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12356) [Website] Update install page instructions to point to artifactory
Neal Richardson created ARROW-12356: --- Summary: [Website] Update install page instructions to point to artifactory Key: ARROW-12356 URL: https://issues.apache.org/jira/browse/ARROW-12356 Project: Apache Arrow Issue Type: Sub-task Components: Website Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 Looks like packages for old versions have been moved over, even if we can't upload new ones yet. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12370) [R] Bindings for power kernel
Neal Richardson created ARROW-12370: --- Summary: [R] Bindings for power kernel Key: ARROW-12370 URL: https://issues.apache.org/jira/browse/ARROW-12370 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 5.0.0 C++ implemented in ARROW-11070. There is a TODO in expression.R that references this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12571) [R][CI] Run nightly R with valgrind
Neal Richardson created ARROW-12571: --- Summary: [R][CI] Run nightly R with valgrind Key: ARROW-12571 URL: https://issues.apache.org/jira/browse/ARROW-12571 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration, R Reporter: Neal Richardson Fix For: 5.0.0 The wch/r-debug container that we run the ASAN/UBSAN sanitizer job also has a valgrind version of R: https://github.com/wch/r-debug#docker-image-for-debugging-r-memory-problems According to https://www.stats.ox.ac.uk/pub/bdr/memtests/README.txt, we possibly also should run R CMD check with --use-valgrind. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12575) [R] Use unary negative kernel
Neal Richardson created ARROW-12575: --- Summary: [R] Use unary negative kernel Key: ARROW-12575 URL: https://issues.apache.org/jira/browse/ARROW-12575 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 5.0.0 Followup to ARROW-11950. Grep for that issue number in the r directory to see where to make changes. https://github.com/apache/arrow/pull/10113/files#diff-ce5b94577014735990903d3d03bd4ea4b8c8e6d32f5227592e60b7dd6a912d59 shows what the new compute function is called. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12620) [C++] Dataset writing can only include projected columns if input columns are also included
Neal Richardson created ARROW-12620: --- Summary: [C++] Dataset writing can only include projected columns if input columns are also included Key: ARROW-12620 URL: https://issues.apache.org/jira/browse/ARROW-12620 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 4.0.0 Reporter: Neal Richardson I discovered this while working on https://github.com/apache/arrow/pull/10191. You can project new columns when writing a dataset, but only if they are derived from columns that are included in the output. Here's an R-based example: {code} # Simple function to write and re-open the new dataset write_then_open <- function(ds, path, ...) { write_dataset(ds, path, ...) open_dataset(path) } tab <- Table$create(a = 1:5) tab %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # a # # 1 1 # 2 2 # 3 3 # 4 4 # 5 5 # If you rename a column, it's all nulls tab %>% select(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # b # # 1NA # 2NA # 3NA # 4NA # 5NA # If you derive a new column and keep the original, it works tab %>% mutate(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 2 # a b # # 1 1 1 # 2 2 2 # 3 3 3 # 4 4 4 # 5 5 5 # transmute() only keeps the added columns, so it also illustrates the failure tab %>% transmute(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # b # # 1NA # 2NA # 3NA # 4NA # 5NA {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12633) [C++] Query engine v0 umbrella issue
Neal Richardson created ARROW-12633: --- Summary: [C++] Query engine v0 umbrella issue Key: ARROW-12633 URL: https://issues.apache.org/jira/browse/ARROW-12633 Project: Apache Arrow Issue Type: New Feature Components: C++, Python, R Reporter: Neal Richardson Fix For: 5.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12689) [R] Implement ArrowArrayStream C interface
Neal Richardson created ARROW-12689: --- Summary: [R] Implement ArrowArrayStream C interface Key: ARROW-12689 URL: https://issues.apache.org/jira/browse/ARROW-12689 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 5.0.0 See https://github.com/apache/arrow/commit/97879eb970bac52d93d2247200b9ca7acf6f3f93, which adds it and also adds Python bindings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12688) [R] Use DuckDB to query an Arrow Dataset
Neal Richardson created ARROW-12688: --- Summary: [R] Use DuckDB to query an Arrow Dataset Key: ARROW-12688 URL: https://issues.apache.org/jira/browse/ARROW-12688 Project: Apache Arrow Issue Type: New Feature Components: C++, R Reporter: Neal Richardson DuckDB can read data from an Arrow C-interface stream. Once we can provide that struct from R, presumably DuckDB could query on that stream. A first step is just connecting the pieces. A second step would be to handle parts of the DuckDB query and push down filtering/projection to Arrow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12694) [R] rtools35 job failing on 32-bit build tests
Neal Richardson created ARROW-12694: --- Summary: [R] rtools35 job failing on 32-bit build tests Key: ARROW-12694 URL: https://issues.apache.org/jira/browse/ARROW-12694 Project: Apache Arrow Issue Type: New Feature Components: C++, R Reporter: Neal Richardson See https://github.com/apache/arrow/actions/workflows/r.yml?query=branch%3Amaster, this started when ARROW-9697 (CountRows for Scanner) merged. It's only failing on rtools35 (aka gcc 4.9), and only on the 32-bit build (i386). Since there's no output about what failed, it's probably a segfault. The easiest way to get more information is to flip this {{if: false}} to true and let it print detailed output about where it was when it died https://github.com/apache/arrow/blob/master/.github/workflows/r.yml#L186 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12731) [R] Use InMemoryDataset for Table/RecordBatch in dplyr code
Neal Richardson created ARROW-12731: --- Summary: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code Key: ARROW-12731 URL: https://issues.apache.org/jira/browse/ARROW-12731 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 5.0.0 This lets us consolidate our Expression handling code and prepares us for more query evaluation in the near future. As a bonus, it should also simplify our dplyr NSE function definition and make it easier to add and test them going forward. -- This message was sent by Atlassian Jira (v8.3.4#803005)