[jira] [Created] (ARROW-12505) [Python] Reconcile LICENSE.txt with top-level LICENSE.txt

2021-04-22 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-12505:
--

 Summary: [Python] Reconcile LICENSE.txt with top-level LICENSE.txt
 Key: ARROW-12505
 URL: https://issues.apache.org/jira/browse/ARROW-12505
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Antoine Pitrou
 Fix For: 5.0.0


The {{python}} directory has a {{LICENSE.txt}} file that seems intermittently 
maintained.
Instead, PyArrow should always refer to the top-level {{LICENSE.txt}} (i.e. 
remove {{python/LICENSE.txt}}?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12504) [Rust] Buffer::from_slice_ref incorrect capacity

2021-04-22 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-12504:
-

 Summary: [Rust] Buffer::from_slice_ref incorrect capacity
 Key: ARROW-12504
 URL: https://issues.apache.org/jira/browse/ARROW-12504
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies


Buffer::from_slice_ref sets the capacity without taking into account the size 
of the slice elements



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12506) [Python] Improve modularity of pyarrow codebase to speedup compile time

2021-04-22 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-12506:
-

 Summary: [Python] Improve modularity of pyarrow codebase to 
speedup compile time
 Key: ARROW-12506
 URL: https://issues.apache.org/jira/browse/ARROW-12506
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alessandro Molina


There are some modules in pyarrow that end up being fairly big to compile 
because they are mostly based on including other `pxi` / `pxd` files.

That means that when a change to those files is done a big module has to be 
recompiled slowing down the development process when experimenting (seems it's 
not uncommon that when a change is done it takes less time to recompile 
`libarrow` than `pyarrow` )

It would be convenient to divide those into separate modules that can lead to 
separate object files which would allow the compiler to recompile smaller 
chunks at the time, so that when a change is done we don't have to recompile 
the whole `lib.pyx` but can just recompile the module where the change is 
isolated to.

The goal is to allow faster iteration over pyarrow by reducing time spent on 
waiting for cython compilation on each change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12507) [CI] Remove duplicated cron/nightly builds

2021-04-22 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-12507:
---

 Summary: [CI] Remove duplicated cron/nightly builds
 Key: ARROW-12507
 URL: https://issues.apache.org/jira/browse/ARROW-12507
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 5.0.0


There are builds duplicated between the GHA cron jobs and crossbow nightlies.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12509) [C++] More fine-grained control of file creation in filesystem layer

2021-04-22 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-12509:
--

 Summary: [C++] More fine-grained control of file creation in 
filesystem layer
 Key: ARROW-12509
 URL: https://issues.apache.org/jira/browse/ARROW-12509
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou


{{FileSystem::OpenOutputStream}} silently truncates an existing file.

It would be better to give more control to the user. Ideally, one could choose 
between several options: "always overwrite and fail if doesn't exist", 
"overwrite if exists, otherwise create", "creates if doesn't exist, otherwise 
fails".

One should research whether e.g. S3 supports such control.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12508) [R] expect_as_vector implementation causes test failure on R <= 3.3

2021-04-22 Thread Nic Crane (Jira)
Nic Crane created ARROW-12508:
-

 Summary: [R] expect_as_vector implementation causes test failure 
on R <= 3.3 
 Key: ARROW-12508
 URL: https://issues.apache.org/jira/browse/ARROW-12508
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nic Crane
Assignee: Nic Crane


See [https://github.com/ursacomputing/crossbow/runs/2407283789] for details; it 
only causes issues for R 3.3 but not later versions, and a quick search implies 
that it's to do with the use of `ifelse`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12510) [C++][Python][CSV] Allow quoted values to be null

2021-04-22 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-12510:
--

 Summary: [C++][Python][CSV] Allow quoted values to be null
 Key: ARROW-12510
 URL: https://issues.apache.org/jira/browse/ARROW-12510
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Antoine Pitrou
 Fix For: 5.0.0


We should add an option such that quoted CSV values also undergo null detection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12511) [R] na.omit test error on Array and ChunkedArray

2021-04-22 Thread Jira
Mauricio 'Pachá' Vargas Sepúlveda created ARROW-12511:
-

 Summary: [R] na.omit test error on Array and ChunkedArray
 Key: ARROW-12511
 URL: https://issues.apache.org/jira/browse/ARROW-12511
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 3.0.0
Reporter: Mauricio 'Pachá' Vargas Sepúlveda


_This is linked to https://github.com/apache/arrow/pull/10056._

*R 3.3. nightly*

See https://github.com/ursacomputing/crossbow/runs/2407283789#step:7:11574, 
which is the nightly build for R 3.3. Please notice that R 3.4 and 3.5 pass the 
build on bionic.

One of the errors is:

{code:java}
── Error (test-na-omit.R:32:3): na.omit on Array and ChunkedArray ──
Error: attempt to replicate an object of type 'closure'
Backtrace:
█
 1. └─arrow:::expect_vector_equal(na.omit(input), data_na, ignore_attr = TRUE) 
test-na-omit.R:32:2
 2.   └─arrow:::expect_as_vector(via_array, expected, ignore_attr, ...) 
helper-expectation.R:170:4
 3. └─base::ifelse(ignore_attr, expect_equivalent, expect_equal) 
helper-expectation.R:19:2
── Error (test-na-omit.R:37:3): na.exclude on Array and ChunkedArray ───
{code}

R without Arrow

See 
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=4117=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=532.
 This is a different error which happens t appear with test-na-omit. In the 
case the error is:

{code:java}
── Error (test-na-omit.R:20:1): (code run outside of `test_that()`) 
Error: Cannot call vec_to_arrow(). See 
https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow 
C++ libraries. 
Backtrace:
█
 1. └─Scalar$create(NA) test-na-omit.R:20:0
 2.   ├─arrow:::Array__GetScalar(Array$create(x, type = type), 0)
 3.   └─Array$create(x, type = type)
 4. └─arrow:::vec_to_arrow(x, type)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12512) [C++][Dataset] Implement CSV writing support

2021-04-22 Thread David Li (Jira)
David Li created ARROW-12512:


 Summary: [C++][Dataset] Implement CSV writing support
 Key: ARROW-12512
 URL: https://issues.apache.org/jira/browse/ARROW-12512
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li


Now that there's a CSV writer, we should hook it up to Datasets.

It seems some refactoring will be needed to expose a full writer class for CSV 
so that Datasets can write batches incrementally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12513) Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls

2021-04-22 Thread David Beach (Jira)
David Beach created ARROW-12513:
---

 Summary: Parquet Writer always puts null_count=0 in Parquet 
statistics for dictionary-encoded array with nulls
 Key: ARROW-12513
 URL: https://issues.apache.org/jira/browse/ARROW-12513
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Parquet, Python
Affects Versions: 3.0.0, 2.0.0, 1.0.1
 Environment: RHEL6
Reporter: David Beach


When writing a Table as Parquet, when the table contains columns represented as 
dictionary-encoded arrays, those columns show an incorrect null_count of 0 in 
the Parquet metadata.  If the same data is saved without dictionary-encoding 
the array, then the null_count is correct.

Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.

NOTE: I'm a PyArrow user, but I believe this but is actually in the C++ 
implementation of the Arrow/Parquet writer.
h3. Setup
{code:python}
import pyarrow as pa
from pyarrow import parquet{code}
h3. Bug

(writes a dictionary encoded Arrow array to parquet)
{code:python}
array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
assert array1.null_count == 5
array1dict = array1.dictionary_encode()
assert array1dict.null_count == 5
table = pa.Table.from_arrays([array1dict], ["mycol"])
parquet.write_table(table, "testtable.parquet")
meta = parquet.read_metadata("testtable.parquet")
meta.row_group(0).column(0).statistics.null_count  # RESULT: 0 (WRONG!){code}
h3. Correct

(writes same data without dictionary encoding the Arrow array)
{code:python}
array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
assert array1.null_count == 5
table = pa.Table.from_arrays([array1], ["mycol"])
parquet.write_table(table, "testtable.parquet")
meta = parquet.read_metadata("testtable.parquet")
meta.row_group(0).column(0).statistics.null_count  # RESULT: 5 (CORRECT)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12515) [Dev][Wiki][Release] Fix and update Windows RC verify script

2021-04-22 Thread Ian Cook (Jira)
Ian Cook created ARROW-12515:


 Summary: [Dev][Wiki][Release] Fix and update Windows RC verify 
script
 Key: ARROW-12515
 URL: https://issues.apache.org/jira/browse/ARROW-12515
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools, Wiki
Reporter: Ian Cook
Assignee: Ian Cook
 Fix For: 5.0.0


There are some small issues with {{dev/release/verify-release-candidate.bat}}:
 * Uses VS 2017 (2019 is current)
 * Uses Python 3.6 (others use 3.8)
 * {{conda create}} command uses relative paths to YML files; these cannot be 
found

Fix these and update the instructions on the Confluence wiki accordingly: 
[https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates]

But first fix ARROW-11675



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12514) [Release] Don't run Gandiva related Ruby test with ARROW_GANDIVA=OFF

2021-04-22 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-12514:


 Summary: [Release] Don't run Gandiva related Ruby test with 
ARROW_GANDIVA=OFF
 Key: ARROW-12514
 URL: https://issues.apache.org/jira/browse/ARROW-12514
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12516) [C++][Gandiva] Implements castINTERVALDAY(varchar) and castINTERVALYEAR(varchar) functions

2021-04-22 Thread Anthony Louis Gotlib Ferreira (Jira)
Anthony Louis Gotlib Ferreira created ARROW-12516:
-

 Summary: [C++][Gandiva] Implements castINTERVALDAY(varchar) and 
castINTERVALYEAR(varchar) functions
 Key: ARROW-12516
 URL: https://issues.apache.org/jira/browse/ARROW-12516
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Anthony Louis Gotlib Ferreira
Assignee: Anthony Louis Gotlib Ferreira


The functions get a string, that can be a number or a [period using the ISO8601 
format|https://en.wikipedia.org/wiki/ISO_8601#Durations] and returns the 
respective time interval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12517) [Go] Expose App Metadata in Flight client

2021-04-22 Thread Paul Whalen (Jira)
Paul Whalen created ARROW-12517:
---

 Summary: [Go] Expose App Metadata in Flight client
 Key: ARROW-12517
 URL: https://issues.apache.org/jira/browse/ARROW-12517
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Go
Reporter: Paul Whalen


There isn't a convenient way to access the App Metadata from a Flight stream 
via the Go client, because the `ipc.Reader` returned from calling 
`flight.NewRecordReader()` only exposes the `array.Record` as you read data 
from it.  This should expose a Flight-specific reader so the client can also 
access the metadata, perhaps.

Modified `record_batch_reader.go` workaround/idea 
[here|https://gist.github.com/pgwhalen/ed768e18917610b2de7942144068f205].





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12503) Cannot call io___MemoryMappedFile__Open()

2021-04-22 Thread bers (Jira)
bers created ARROW-12503:


 Summary: Cannot call io___MemoryMappedFile__Open()
 Key: ARROW-12503
 URL: https://issues.apache.org/jira/browse/ARROW-12503
 Project: Apache Arrow
  Issue Type: Bug
 Environment: R4.0.5
openSUSE Leap 15.2
Reporter: bers


I have checked 
[https://arrow.apache.org/docs/r/articles/install.html#package-installed-without-c-dependencies|https://arrow.apache.org/docs/r/articles/install.html#package-installed-without-c-dependencies,]
 and that none of the known issues apply to me.

So then it's telling me to issue `[arrow::install_arrow(verbose = 
TRUE)|https://arrow.apache.org/docs/r/reference/install_arrow.html]`, which I 
did. Here's the output:

 

 

```

> arrow::install_arrow(verbose = TRUE)
Installing package into ‘/data2/bers/opt/R/4.0/library’
(as ‘lib’ is unspecified)
trying URL 'https://cran.r-project.org/src/contrib/arrow_3.0.0.tar.gz'
Content type 'application/x-gzip' length 344814 bytes (336 KB)
==
downloaded 336 KB

* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
trying URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/bin/opensuse-15/arrow-3.0.0.zip'
Error in download.file(from_url, to_file, quiet = quietly) : 
 cannot open URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/bin/opensuse-15/arrow-3.0.0.zip'
*** No C++ binaries found for opensuse-15
trying URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-3.0.0.zip'
Error in download.file(from_url, to_file, quiet = quietly) : 
 cannot open URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-3.0.0.zip'
trying URL 
'https://www.apache.org/dyn/closer.lua?action=download=arrow/arrow-3.0.0/apache-arrow-3.0.0.tar.gz'
Content type 'application/x-gzip' length 8200790 bytes (7.8 MB)
==
downloaded 7.8 MB

*** Successfully retrieved C++ source
*** Building C++ libraries
*** Building with MAKEFLAGS= -j2 
 arrow with 
SOURCE_DIR="/tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp" 
BUILD_DIR="/tmp/RtmpOXZGhl/file1568178b21a6" DEST_DIR="libarrow/arrow-3.0.0" 
CMAKE="/data2/bers/opt/cmake/bin/cmake" CC="gcc" CXX="g++ -std=gnu++11" 
LDFLAGS="-L/usr/local/lib64" ARROW_S3=ON ARROW_MIMALLOC=ON 
++ pwd
+ : /tmp/RtmppXbaGR/R.INSTALL155322509007/arrow
+ : /tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp
+ : /tmp/RtmpOXZGhl/file1568178b21a6
+ : libarrow/arrow-3.0.0
+ : /data2/bers/opt/cmake/bin/cmake
++ cd /tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp
++ pwd
+ SOURCE_DIR=/tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp
++ mkdir -p libarrow/arrow-3.0.0
++ cd libarrow/arrow-3.0.0
++ pwd
+ DEST_DIR=/tmp/RtmppXbaGR/R.INSTALL155322509007/arrow/libarrow/arrow-3.0.0
+ '[' '' = '' ']'
+ which ninja
+ '[' FALSE = false ']'
+ ARROW_DEFAULT_PARAM=OFF
+ mkdir -p /tmp/RtmpOXZGhl/file1568178b21a6
+ pushd /tmp/RtmpOXZGhl/file1568178b21a6
/tmp/RtmpOXZGhl/file1568178b21a6 /tmp/RtmppXbaGR/R.INSTALL155322509007/arrow
+ /data2/bers/opt/cmake/bin/cmake -DARROW_BOOST_USE_SHARED=OFF 
-DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON 
-DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON 
-DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON 
-DARROW_MIMALLOC=ON -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=ON 
-DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=OFF 
-DARROW_WITH_SNAPPY=OFF -DARROW_WITH_UTF8PROC=OFF -DARROW_WITH_ZLIB=OFF 
-DARROW_WITH_ZSTD=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/RtmppXbaGR/R.INSTALL155322509007/arrow/libarrow/arrow-3.0.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON -G 'Unix 
Makefiles' /tmp/RtmpOXZGhl/file156868fe52ec/apache-arrow-3.0.0/cpp
-- Building using CMake version: 3.19.5
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 3.0.0 (full: '3.0.0')
-- Arrow SO version: 300 (full: 300.0.0)
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
fatal: not a git repository (or any of the parent directories): .git
-- Could NOT find Python3 (missing: Python3_EXECUTABLE Interpreter) 
 Reason given by package: 
 Interpreter: Cannot use the interpreter 

[jira] [Created] (ARROW-12502) [R] Download of C++ sources is broken

2021-04-22 Thread Roland Weber (Jira)
Roland Weber created ARROW-12502:


 Summary: [R] Download of C++ sources is broken
 Key: ARROW-12502
 URL: https://issues.apache.org/jira/browse/ARROW-12502
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 3.0.0
Reporter: Roland Weber


I'm installing Arrow 3.0.0 from CRAN on RedHat UBI. On 2021-04-21, my 
post-installation unit tests for Arrow started to fail. I found this error 
message in the build logs:

*** Successfully retrieved C++ source
/bin/gtar: This does not look like a tar archive
/bin/gtar: Skipping to next header
/bin/gtar: Exiting with failure status due to previous errors
***
 Proceeding without C++ dependencies
Warning message:
In untar(tf1, exdir = src_dir) :
  ‘/bin/gtar -xf '/tmp/RtmpNhfLVX/file23a66db9da04' -C 
'/tmp/RtmpNhfLVX/file23a640ab53bd'’ returned error code 2

 

{{Other installation steps and downloads are working, so I don't think this is 
a network connectivity issue. My guess is that the mirror selection logic 
changed on the server side, so that the source download now saves an HTML error 
page instead of the source archive.}}

[https://github.com/apache/arrow/blob/maint-3.0.x/r/tools/linuxlibs.R#L221-L224]

I'm fixing my build break by switching to binary downloads. But I thought you 
might want to have a look at that source download logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)