[jira] [Created] (ARROW-12505) [Python] Reconcile LICENSE.txt with top-level LICENSE.txt
Antoine Pitrou created ARROW-12505: -- Summary: [Python] Reconcile LICENSE.txt with top-level LICENSE.txt Key: ARROW-12505 URL: https://issues.apache.org/jira/browse/ARROW-12505 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Antoine Pitrou Fix For: 5.0.0 The {{python}} directory has a {{LICENSE.txt}} file that seems intermittently maintained. Instead, PyArrow should always refer to the top-level {{LICENSE.txt}} (i.e. remove {{python/LICENSE.txt}}?). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12504) [Rust] Buffer::from_slice_ref incorrect capacity
Raphael Taylor-Davies created ARROW-12504: - Summary: [Rust] Buffer::from_slice_ref incorrect capacity Key: ARROW-12504 URL: https://issues.apache.org/jira/browse/ARROW-12504 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies Buffer::from_slice_ref sets the capacity without taking into account the size of the slice elements -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12506) [Python] Improve modularity of pyarrow codebase to speedup compile time
Alessandro Molina created ARROW-12506: - Summary: [Python] Improve modularity of pyarrow codebase to speedup compile time Key: ARROW-12506 URL: https://issues.apache.org/jira/browse/ARROW-12506 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Alessandro Molina There are some modules in pyarrow that end up being fairly big to compile because they are mostly based on including other `pxi` / `pxd` files. That means that when a change to those files is done a big module has to be recompiled slowing down the development process when experimenting (seems it's not uncommon that when a change is done it takes less time to recompile `libarrow` than `pyarrow` ) It would be convenient to divide those into separate modules that can lead to separate object files which would allow the compiler to recompile smaller chunks at the time, so that when a change is done we don't have to recompile the whole `lib.pyx` but can just recompile the module where the change is isolated to. The goal is to allow faster iteration over pyarrow by reducing time spent on waiting for cython compilation on each change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12507) [CI] Remove duplicated cron/nightly builds
Krisztian Szucs created ARROW-12507: --- Summary: [CI] Remove duplicated cron/nightly builds Key: ARROW-12507 URL: https://issues.apache.org/jira/browse/ARROW-12507 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 5.0.0 There are builds duplicated between the GHA cron jobs and crossbow nightlies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12509) [C++] More fine-grained control of file creation in filesystem layer
Antoine Pitrou created ARROW-12509: -- Summary: [C++] More fine-grained control of file creation in filesystem layer Key: ARROW-12509 URL: https://issues.apache.org/jira/browse/ARROW-12509 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou {{FileSystem::OpenOutputStream}} silently truncates an existing file. It would be better to give more control to the user. Ideally, one could choose between several options: "always overwrite and fail if doesn't exist", "overwrite if exists, otherwise create", "creates if doesn't exist, otherwise fails". One should research whether e.g. S3 supports such control. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12508) [R] expect_as_vector implementation causes test failure on R <= 3.3
Nic Crane created ARROW-12508: - Summary: [R] expect_as_vector implementation causes test failure on R <= 3.3 Key: ARROW-12508 URL: https://issues.apache.org/jira/browse/ARROW-12508 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nic Crane Assignee: Nic Crane See [https://github.com/ursacomputing/crossbow/runs/2407283789] for details; it only causes issues for R 3.3 but not later versions, and a quick search implies that it's to do with the use of `ifelse` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12510) [C++][Python][CSV] Allow quoted values to be null
Antoine Pitrou created ARROW-12510: -- Summary: [C++][Python][CSV] Allow quoted values to be null Key: ARROW-12510 URL: https://issues.apache.org/jira/browse/ARROW-12510 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Antoine Pitrou Fix For: 5.0.0 We should add an option such that quoted CSV values also undergo null detection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12511) [R] na.omit test error on Array and ChunkedArray
Mauricio 'Pachá' Vargas Sepúlveda created ARROW-12511: - Summary: [R] na.omit test error on Array and ChunkedArray Key: ARROW-12511 URL: https://issues.apache.org/jira/browse/ARROW-12511 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 3.0.0 Reporter: Mauricio 'Pachá' Vargas Sepúlveda _This is linked to https://github.com/apache/arrow/pull/10056._ *R 3.3. nightly* See https://github.com/ursacomputing/crossbow/runs/2407283789#step:7:11574, which is the nightly build for R 3.3. Please notice that R 3.4 and 3.5 pass the build on bionic. One of the errors is: {code:java} ── Error (test-na-omit.R:32:3): na.omit on Array and ChunkedArray ── Error: attempt to replicate an object of type 'closure' Backtrace: █ 1. └─arrow:::expect_vector_equal(na.omit(input), data_na, ignore_attr = TRUE) test-na-omit.R:32:2 2. └─arrow:::expect_as_vector(via_array, expected, ignore_attr, ...) helper-expectation.R:170:4 3. └─base::ifelse(ignore_attr, expect_equivalent, expect_equal) helper-expectation.R:19:2 ── Error (test-na-omit.R:37:3): na.exclude on Array and ChunkedArray ─── {code} R without Arrow See https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=4117=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=532. This is a different error which happens t appear with test-na-omit. In the case the error is: {code:java} ── Error (test-na-omit.R:20:1): (code run outside of `test_that()`) Error: Cannot call vec_to_arrow(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries. Backtrace: █ 1. └─Scalar$create(NA) test-na-omit.R:20:0 2. ├─arrow:::Array__GetScalar(Array$create(x, type = type), 0) 3. └─Array$create(x, type = type) 4. └─arrow:::vec_to_arrow(x, type) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12512) [C++][Dataset] Implement CSV writing support
David Li created ARROW-12512: Summary: [C++][Dataset] Implement CSV writing support Key: ARROW-12512 URL: https://issues.apache.org/jira/browse/ARROW-12512 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Now that there's a CSV writer, we should hook it up to Datasets. It seems some refactoring will be needed to expose a full writer class for CSV so that Datasets can write batches incrementally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12513) Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls
David Beach created ARROW-12513: --- Summary: Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls Key: ARROW-12513 URL: https://issues.apache.org/jira/browse/ARROW-12513 Project: Apache Arrow Issue Type: Bug Components: C++, Parquet, Python Affects Versions: 3.0.0, 2.0.0, 1.0.1 Environment: RHEL6 Reporter: David Beach When writing a Table as Parquet, when the table contains columns represented as dictionary-encoded arrays, those columns show an incorrect null_count of 0 in the Parquet metadata. If the same data is saved without dictionary-encoding the array, then the null_count is correct. Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0. NOTE: I'm a PyArrow user, but I believe this but is actually in the C++ implementation of the Arrow/Parquet writer. h3. Setup {code:python} import pyarrow as pa from pyarrow import parquet{code} h3. Bug (writes a dictionary encoded Arrow array to parquet) {code:python} array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string()) assert array1.null_count == 5 array1dict = array1.dictionary_encode() assert array1dict.null_count == 5 table = pa.Table.from_arrays([array1dict], ["mycol"]) parquet.write_table(table, "testtable.parquet") meta = parquet.read_metadata("testtable.parquet") meta.row_group(0).column(0).statistics.null_count # RESULT: 0 (WRONG!){code} h3. Correct (writes same data without dictionary encoding the Arrow array) {code:python} array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string()) assert array1.null_count == 5 table = pa.Table.from_arrays([array1], ["mycol"]) parquet.write_table(table, "testtable.parquet") meta = parquet.read_metadata("testtable.parquet") meta.row_group(0).column(0).statistics.null_count # RESULT: 5 (CORRECT) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12515) [Dev][Wiki][Release] Fix and update Windows RC verify script
Ian Cook created ARROW-12515: Summary: [Dev][Wiki][Release] Fix and update Windows RC verify script Key: ARROW-12515 URL: https://issues.apache.org/jira/browse/ARROW-12515 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools, Wiki Reporter: Ian Cook Assignee: Ian Cook Fix For: 5.0.0 There are some small issues with {{dev/release/verify-release-candidate.bat}}: * Uses VS 2017 (2019 is current) * Uses Python 3.6 (others use 3.8) * {{conda create}} command uses relative paths to YML files; these cannot be found Fix these and update the instructions on the Confluence wiki accordingly: [https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates] But first fix ARROW-11675 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12514) [Release] Don't run Gandiva related Ruby test with ARROW_GANDIVA=OFF
Kouhei Sutou created ARROW-12514: Summary: [Release] Don't run Gandiva related Ruby test with ARROW_GANDIVA=OFF Key: ARROW-12514 URL: https://issues.apache.org/jira/browse/ARROW-12514 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12516) [C++][Gandiva] Implements castINTERVALDAY(varchar) and castINTERVALYEAR(varchar) functions
Anthony Louis Gotlib Ferreira created ARROW-12516: - Summary: [C++][Gandiva] Implements castINTERVALDAY(varchar) and castINTERVALYEAR(varchar) functions Key: ARROW-12516 URL: https://issues.apache.org/jira/browse/ARROW-12516 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Anthony Louis Gotlib Ferreira Assignee: Anthony Louis Gotlib Ferreira The functions get a string, that can be a number or a [period using the ISO8601 format|https://en.wikipedia.org/wiki/ISO_8601#Durations] and returns the respective time interval. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12517) [Go] Expose App Metadata in Flight client
Paul Whalen created ARROW-12517: --- Summary: [Go] Expose App Metadata in Flight client Key: ARROW-12517 URL: https://issues.apache.org/jira/browse/ARROW-12517 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Go Reporter: Paul Whalen There isn't a convenient way to access the App Metadata from a Flight stream via the Go client, because the `ipc.Reader` returned from calling `flight.NewRecordReader()` only exposes the `array.Record` as you read data from it. This should expose a Flight-specific reader so the client can also access the metadata, perhaps. Modified `record_batch_reader.go` workaround/idea [here|https://gist.github.com/pgwhalen/ed768e18917610b2de7942144068f205]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12503) Cannot call io___MemoryMappedFile__Open()
bers created ARROW-12503: Summary: Cannot call io___MemoryMappedFile__Open() Key: ARROW-12503 URL: https://issues.apache.org/jira/browse/ARROW-12503 Project: Apache Arrow Issue Type: Bug Environment: R4.0.5 openSUSE Leap 15.2 Reporter: bers I have checked [https://arrow.apache.org/docs/r/articles/install.html#package-installed-without-c-dependencies|https://arrow.apache.org/docs/r/articles/install.html#package-installed-without-c-dependencies,] and that none of the known issues apply to me. So then it's telling me to issue `[arrow::install_arrow(verbose = TRUE)|https://arrow.apache.org/docs/r/reference/install_arrow.html]`, which I did. Here's the output: ``` > arrow::install_arrow(verbose = TRUE) Installing package into ‘/data2/bers/opt/R/4.0/library’ (as ‘lib’ is unspecified) trying URL 'https://cran.r-project.org/src/contrib/arrow_3.0.0.tar.gz' Content type 'application/x-gzip' length 344814 bytes (336 KB) == downloaded 336 KB * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation trying URL 'https://dl.bintray.com/ursalabs/arrow-r/libarrow/bin/opensuse-15/arrow-3.0.0.zip' Error in download.file(from_url, to_file, quiet = quietly) : cannot open URL 'https://dl.bintray.com/ursalabs/arrow-r/libarrow/bin/opensuse-15/arrow-3.0.0.zip' *** No C++ binaries found for opensuse-15 trying URL 'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-3.0.0.zip' Error in download.file(from_url, to_file, quiet = quietly) : cannot open URL 'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-3.0.0.zip' trying URL 'https://www.apache.org/dyn/closer.lua?action=download=arrow/arrow-3.0.0/apache-arrow-3.0.0.tar.gz' Content type 'application/x-gzip' length 8200790 bytes (7.8 MB) == downloaded 7.8 MB *** Successfully retrieved C++ source *** Building C++ libraries *** Building with MAKEFLAGS= -j2 arrow with SOURCE_DIR="/tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp" BUILD_DIR="/tmp/RtmpOXZGhl/file1568178b21a6" DEST_DIR="libarrow/arrow-3.0.0" CMAKE="/data2/bers/opt/cmake/bin/cmake" CC="gcc" CXX="g++ -std=gnu++11" LDFLAGS="-L/usr/local/lib64" ARROW_S3=ON ARROW_MIMALLOC=ON ++ pwd + : /tmp/RtmppXbaGR/R.INSTALL155322509007/arrow + : /tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp + : /tmp/RtmpOXZGhl/file1568178b21a6 + : libarrow/arrow-3.0.0 + : /data2/bers/opt/cmake/bin/cmake ++ cd /tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp ++ pwd + SOURCE_DIR=/tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp ++ mkdir -p libarrow/arrow-3.0.0 ++ cd libarrow/arrow-3.0.0 ++ pwd + DEST_DIR=/tmp/RtmppXbaGR/R.INSTALL155322509007/arrow/libarrow/arrow-3.0.0 + '[' '' = '' ']' + which ninja + '[' FALSE = false ']' + ARROW_DEFAULT_PARAM=OFF + mkdir -p /tmp/RtmpOXZGhl/file1568178b21a6 + pushd /tmp/RtmpOXZGhl/file1568178b21a6 /tmp/RtmpOXZGhl/file1568178b21a6 /tmp/RtmppXbaGR/R.INSTALL155322509007/arrow + /data2/bers/opt/cmake/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON -DARROW_MIMALLOC=ON -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=ON -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=OFF -DARROW_WITH_SNAPPY=OFF -DARROW_WITH_UTF8PROC=OFF -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_PREFIX=/tmp/RtmppXbaGR/R.INSTALL155322509007/arrow/libarrow/arrow-3.0.0 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON -G 'Unix Makefiles' /tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp -- Building using CMake version: 3.19.5 -- The C compiler identification is GNU 7.5.0 -- The CXX compiler identification is GNU 7.5.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/gcc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/g++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Arrow version: 3.0.0 (full: '3.0.0') -- Arrow SO version: 300 (full: 300.0.0) -- clang-tidy not found -- clang-format not found -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) -- infer not found fatal: not a git repository (or any of the parent directories): .git -- Could NOT find Python3 (missing: Python3_EXECUTABLE Interpreter) Reason given by package: Interpreter: Cannot use the interpreter
[jira] [Created] (ARROW-12502) [R] Download of C++ sources is broken
Roland Weber created ARROW-12502: Summary: [R] Download of C++ sources is broken Key: ARROW-12502 URL: https://issues.apache.org/jira/browse/ARROW-12502 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 3.0.0 Reporter: Roland Weber I'm installing Arrow 3.0.0 from CRAN on RedHat UBI. On 2021-04-21, my post-installation unit tests for Arrow started to fail. I found this error message in the build logs: *** Successfully retrieved C++ source /bin/gtar: This does not look like a tar archive /bin/gtar: Skipping to next header /bin/gtar: Exiting with failure status due to previous errors *** Proceeding without C++ dependencies Warning message: In untar(tf1, exdir = src_dir) : ‘/bin/gtar -xf '/tmp/RtmpNhfLVX/file23a66db9da04' -C '/tmp/RtmpNhfLVX/file23a640ab53bd'’ returned error code 2 {{Other installation steps and downloads are working, so I don't think this is a network connectivity issue. My guess is that the mirror selection logic changed on the server side, so that the source download now saves an HTML error page instead of the source archive.}} [https://github.com/apache/arrow/blob/maint-3.0.x/r/tools/linuxlibs.R#L221-L224] I'm fixing my build break by switching to binary downloads. But I thought you might want to have a look at that source download logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)