[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r446483934



##
File path: r/tests/testthat/test-Array.R
##
@@ -18,16 +18,16 @@
 context("Array")
 
 expect_array_roundtrip <- function(x, type) {
-  a <- Array$create(x)
+  a <- Array$create(x, type = type)

Review comment:
   ✅ 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-26 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-650474759


   RE: @jorisvandenbossche 
   > Same question as in the other PR: does setting the batch size also 
influence existing methods like `read` or `read_row_group` ? Should we add that 
keyword there as well?
   
   My one hesitation to adding them is that it's not clear to me what the 
effect would be on the execution. For `iter_batches()` the effect of 
`batch_size` is quite obvious, but I'm not sure about these other methods. 
   
   After quick search of the Apache Arrow docs the only explanation I saw on 
the batch size parameter was this:
   
   >  [The maximum row count for scanned record batches. If scanned record 
batches are overflowing memory then this method can be called to reduce their 
size.](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html?highlight=batch_size)
   
   
   
   If I can find a good explanation for it or if you have one, I'd be happy to 
add the `batch_size` parameter to the `read()` and `read_row_group()` methods.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer

2020-06-26 Thread GitBox


kou closed pull request #7553:
URL: https://github.com/apache/arrow/pull/7553


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer

2020-06-26 Thread GitBox


kou commented on pull request #7553:
URL: https://github.com/apache/arrow/pull/7553#issuecomment-650473985


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-26 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-650472498


   Apologies, I've been away for a bit. I thought I had invited @sonthonaxrk as 
a collaborator on my fork, but perhaps that did go through.
   
   Addressed the minor feedback.
   
   Also retested the issue I was having where tests would fail if the 
`batch_size` and `chunk_size` weren't aligned, and I can't reproduce that error 
any more. Hopefully that means it was fixed and I'm not forgetting something 
important.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-26 Thread GitBox


wjones1 commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r446469923



##
File path: python/pyarrow/parquet.py
##
@@ -310,6 +310,44 @@ def read_row_groups(self, row_groups, columns=None, 
use_threads=True,
column_indices=column_indices,
use_threads=use_threads)
 
+def iter_batches(self, batch_size, row_groups=None, columns=None,

Review comment:
   Agreed. I added the same default value for `iter_batches()`

##
File path: python/pyarrow/_parquet.pyx
##
@@ -1077,6 +1078,54 @@ cdef class ParquetReader:
 def set_use_threads(self, bint use_threads):
 self.reader.get().set_use_threads(use_threads)
 
+def set_batch_size(self, int64_t batch_size):
+self.reader.get().set_batch_size(batch_size)
+
+def iter_batches(self, int64_t batch_size, row_groups, column_indices=None,
+ bint use_threads=True):
+cdef:
+vector[int] c_row_groups
+vector[int] c_column_indices
+shared_ptr[CRecordBatch] record_batch
+shared_ptr[TableBatchReader] batch_reader
+unique_ptr[CRecordBatchReader] recordbatchreader
+
+self.set_batch_size(batch_size)
+
+if use_threads:
+self.set_use_threads(use_threads)
+
+for row_group in row_groups:
+c_row_groups.push_back(row_group)
+
+if column_indices is not None:
+for index in column_indices:
+c_column_indices.push_back(index)
+check_status(
+self.reader.get().GetRecordBatchReader(
+c_row_groups, c_column_indices, 
+)
+)
+else:
+check_status(
+self.reader.get().GetRecordBatchReader(
+c_row_groups, 
+)
+)
+
+while True:
+with nogil:
+check_status(
+recordbatchreader.get().ReadNext(_batch)
+)
+
+if record_batch.get() == NULL:
+break
+
+py_record_batch = pyarrow_wrap_batch(record_batch)

Review comment:
    





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer

2020-06-26 Thread GitBox


github-actions[bot] commented on pull request #7553:
URL: https://github.com/apache/arrow/pull/7553#issuecomment-650466293


   Revision: 6d42269b47130742abe2719a469410a214b87de6
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-361](https://github.com/ursa-labs/crossbow/branches/all?query=actions-361)
   
   |Task|Status|
   ||--|
   |centos-6-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-centos-6-amd64)|
   
|centos-7-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-centos-7-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-7-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-centos-7-amd64)|
   
|centos-8-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-centos-8-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-8-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-centos-8-amd64)|
   |debian-buster-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-debian-buster-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-debian-buster-amd64)|
   
|debian-buster-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-debian-buster-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |debian-stretch-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-debian-stretch-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-debian-stretch-amd64)|
   
|debian-stretch-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-debian-stretch-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-bionic-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-ubuntu-bionic-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-ubuntu-bionic-amd64)|
   
|ubuntu-bionic-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-ubuntu-bionic-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-eoan-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-ubuntu-eoan-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-ubuntu-eoan-amd64)|
   
|ubuntu-eoan-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-ubuntu-eoan-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-focal-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-ubuntu-focal-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-ubuntu-focal-amd64)|
   
|ubuntu-focal-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-ubuntu-focal-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-xenial-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-ubuntu-xenial-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-ubuntu-xenial-amd64)|
   
|ubuntu-xenial-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-ubuntu-xenial-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer

2020-06-26 Thread GitBox


kou commented on pull request #7553:
URL: https://github.com/apache/arrow/pull/7553#issuecomment-650466094


   @github-actions crossbow submit -g linux



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] scampi commented on a change in pull request #6402: ARROW-7831: [Java] do not allocate a new offset buffer if the slice starts at 0 since the relative offset pointer would be unchange

2020-06-26 Thread GitBox


scampi commented on a change in pull request #6402:
URL: https://github.com/apache/arrow/pull/6402#discussion_r446429063



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
##
@@ -751,55 +757,57 @@ private void splitAndTransferOffsetBuffer(int startIndex, 
int length, BaseVariab
*/
   private void splitAndTransferValidityBuffer(int startIndex, int length,
   BaseVariableWidthVector target) {
-int firstByteSource = BitVectorHelper.byteIndex(startIndex);
-int lastByteSource = BitVectorHelper.byteIndex(valueCount - 1);
-int byteSizeTarget = getValidityBufferSizeFromCount(length);
-int offset = startIndex % 8;
+if (length <= 0) {

Review comment:
   >  It seems like you're leaving this function with a different number of 
references than the other exit points.
   
   I'm not sure I understand.
   
   Before, the body of the function was contained in a big if: I thought it was 
good to return early for clarity and avoid declaring variables like 
`firstByteSource` or `lastByteSource` for nothing.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer

2020-06-26 Thread GitBox


github-actions[bot] commented on pull request #7553:
URL: https://github.com/apache/arrow/pull/7553#issuecomment-650412393


   https://issues.apache.org/jira/browse/ARROW-9234



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou opened a new pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer

2020-06-26 Thread GitBox


kou opened a new pull request #7553:
URL: https://github.com/apache/arrow/pull/7553


   This is a follow up task for https://github.com/apache/arrow/pull/7263 .



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques closed pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-26 Thread GitBox


fsaintjacques closed pull request #7526:
URL: https://github.com/apache/arrow/pull/7526


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #6725: ARROW-8226: [Go] Implement 64 bit offsets binary builder

2020-06-26 Thread GitBox


wesm closed pull request #6725:
URL: https://github.com/apache/arrow/pull/6725


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6725: ARROW-8226: [Go] Implement 64 bit offsets binary builder

2020-06-26 Thread GitBox


wesm commented on pull request #6725:
URL: https://github.com/apache/arrow/pull/6725#issuecomment-650356376


   Well, that is definitely a bummer. I hope to see some growth in the Go 
developer community here in the future. I'll close this PR for now then



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-26 Thread GitBox


nealrichardson commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-650289073


   I'll merge after Appveyor passes, ignoring Travis



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-26 Thread GitBox


kszucs commented on pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#issuecomment-650267609


   @ursabot build



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm merged pull request #7552: [CI] Set allow_failures on Travis CI jobs until they stop being broken

2020-06-26 Thread GitBox


wesm merged pull request #7552:
URL: https://github.com/apache/arrow/pull/7552


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7552: [CI] Set allow_failures on Travis CI jobs until they stop being broken

2020-06-26 Thread GitBox


wesm commented on pull request #7552:
URL: https://github.com/apache/arrow/pull/7552#issuecomment-650260460


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7552: [CI] Set allow_failures on Travis CI jobs until they stop being broken

2020-06-26 Thread GitBox


github-actions[bot] commented on pull request #7552:
URL: https://github.com/apache/arrow/pull/7552#issuecomment-650259300


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r446269132



##
File path: r/tests/testthat/test-Array.R
##
@@ -18,16 +18,16 @@
 context("Array")
 
 expect_array_roundtrip <- function(x, type) {
-  a <- Array$create(x)
+  a <- Array$create(x, type = type)
   expect_type_equal(a$type, type)
   expect_identical(length(a), length(x))
-  if (!inherits(type, "ListType")) {
+  if (!inherits(type, "ListType") && !inherits(type, "LargeListType")) {

Review comment:
   I don't think there is inheritance down the C++ code





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r446269445



##
File path: r/tests/testthat/test-Array.R
##
@@ -18,16 +18,16 @@
 context("Array")
 
 expect_array_roundtrip <- function(x, type) {
-  a <- Array$create(x)
+  a <- Array$create(x, type = type)

Review comment:
   Fair enough. I'll update 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7552: [CI] Set allow_failures on Travis CI jobs until they stop being broken

2020-06-26 Thread GitBox


wesm opened a new pull request #7552:
URL: https://github.com/apache/arrow/pull/7552


   Travis CI has been flaky for days. This is adding a lot of noise to our 
workflows so I think we should allow them to fail until they become more 
consistently happy.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-26 Thread GitBox


kszucs commented on a change in pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#discussion_r446266184



##
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##
@@ -399,15 +434,59 @@ class TestNumericMinMaxKernel : public ::testing::Test {
 };
 
 template 
-class TestFloatingMinMaxKernel : public TestNumericMinMaxKernel {};
+class TestBooleanMinMaxKernel : public TestPrimitiveMinMaxKernel {};
+
+template 
+class TestIntegerMinMaxKernel : public TestPrimitiveMinMaxKernel {};
+
+template 
+class TestFloatingMinMaxKernel : public TestPrimitiveMinMaxKernel 
{};
+
+TYPED_TEST_SUITE(TestBooleanMinMaxKernel, BooleanArrowType);

Review comment:
   Right.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-26 Thread GitBox


kszucs commented on a change in pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#discussion_r446264863



##
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##
@@ -397,24 +452,26 @@ struct MinMaxImpl : public ScalarAggregator {
 
 ArrayType arr(batch[0].array());
 
-local.has_nulls = arr.null_count() > 0;
+const auto null_count = arr.null_count();
+local.has_nulls = null_count > 0;
+local.has_values = (arr.length() - null_count) > 0;
+
 if (local.has_nulls && options.null_handling == 
MinMaxOptions::OUTPUT_NULL) {
   this->state = local;
   return;
 }
 
-const auto values = arr.raw_values();
-if (arr.null_count() > 0) {
+if (local.has_nulls) {
   BitmapReader reader(arr.null_bitmap_data(), arr.offset(), arr.length());
   for (int64_t i = 0; i < arr.length(); i++) {
 if (reader.IsSet()) {
-  local.MergeOne(values[i]);
+  local.MergeOne(arr.Value(i));
 }
 reader.Next();
   }
 } else {
   for (int64_t i = 0; i < arr.length(); i++) {
-local.MergeOne(values[i]);
+local.MergeOne(arr.Value(i));
   }

Review comment:
   We could optimize it further by early returning on the first occurrence 
of true/false values. Going to create a jira once it's merged.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-26 Thread GitBox


nealrichardson commented on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-650249655


   @romainfrancois regarding tests, I think a fixture something like 
   
   ```r
   df <- tibble::tibble(
 a = structure("one", class = "special_string"),
 b = 2,
 c = tibble::tibble(
   c1 = structure("inner", extra_attr = "something"),
   c2 = 4
 )
   )
   ```
   
   could be sufficient. 
   
   ```r
   expect_identical(as.data.frame(Table$create(df)), df)
   expect_identical(as.data.frame(record_batch(df)), df)
   ```
   
   and then also confirm that it's identical round-tripping to Feather and 
Parquet.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-26 Thread GitBox


wesm commented on a change in pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#discussion_r446262302



##
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##
@@ -399,15 +434,59 @@ class TestNumericMinMaxKernel : public ::testing::Test {
 };
 
 template 
-class TestFloatingMinMaxKernel : public TestNumericMinMaxKernel {};
+class TestBooleanMinMaxKernel : public TestPrimitiveMinMaxKernel {};
+
+template 
+class TestIntegerMinMaxKernel : public TestPrimitiveMinMaxKernel {};
+
+template 
+class TestFloatingMinMaxKernel : public TestPrimitiveMinMaxKernel 
{};
+
+TYPED_TEST_SUITE(TestBooleanMinMaxKernel, BooleanArrowType);

Review comment:
   Just define
   
   ```
   class TestBooleanMinMaxKernel : public 
TestPrimitiveMinMaxKernel {};
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran commented on a change in pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…

2020-06-26 Thread GitBox


paddyhoran commented on a change in pull request #7500:
URL: https://github.com/apache/arrow/pull/7500#discussion_r446258243



##
File path: rust/parquet/src/record/api.rs
##
@@ -893,16 +893,6 @@ mod tests {
 assert_eq!(row, Field::TimestampMillis(123854406));
 }
 
-#[test]
-#[should_panic(expected = "Expected non-negative milliseconds when 
converting Int96")]
-fn test_row_convert_int96_invalid() {
-// INT96 value does not depend on logical type
-let descr = make_column_descr![PhysicalType::INT96, LogicalType::NONE];
-
-let value = Int96::from(vec![0, 0, 0]);
-Field::convert_int96(, value);
-}
-

Review comment:
   Yea, I agree with the direction.  I was just wondering if we were 
relying on this assumption (non-negative milliseconds) elsewhere in the 
codebase.  I'll merge this soon if there are no other comments.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-26 Thread GitBox


nealrichardson commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-650242207


   This has a python lint failure: 
https://github.com/apache/arrow/pull/7526/checks?check_run_id=811470274#step:7:1377



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7550: ARROW-9219: [R] coerce_timestamps in Parquet write options does not work

2020-06-26 Thread GitBox


nealrichardson closed pull request #7550:
URL: https://github.com/apache/arrow/pull/7550


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-26 Thread GitBox


nealrichardson closed pull request #7527:
URL: https://github.com/apache/arrow/pull/7527


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-26 Thread GitBox


nealrichardson commented on pull request #7527:
URL: https://github.com/apache/arrow/pull/7527#issuecomment-650237185


   Thanks, I think we should just revisit further work once `cpp11` is 
available, see how much of that is handled.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-26 Thread GitBox


nealrichardson commented on a change in pull request #7527:
URL: https://github.com/apache/arrow/pull/7527#discussion_r446247176



##
File path: r/R/schema.R
##
@@ -83,16 +83,21 @@ Schema <- R6Class("Schema",
 }
   ),
   active = list(
-names = function() Schema__field_names(self),
+names = function() {
+  out <- Schema__field_names(self)
+  # Hack: Rcpp should set the encoding

Review comment:
   Yes, I did a little dance when I saw 
https://cpp11.r-lib.org/articles/motivations.html#utf-8-everywhere





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


nealrichardson commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r446237067



##
File path: r/tests/testthat/test-Array.R
##
@@ -18,16 +18,16 @@
 context("Array")
 
 expect_array_roundtrip <- function(x, type) {
-  a <- Array$create(x)
+  a <- Array$create(x, type = type)

Review comment:
   I don't think we want to do this in general because it means we're not 
testing the automatic type detection code. Instead, how about changing the 
function signature to:
   
   ```r
   expect_array_roundtrip <- function(x, type, as = NULL) {
   ```
   and then here
   
   ```suggestion
 a <- Array$create(x, type = as)
   ```
   
   then adding changing the Large* tests to pass `as` as well.

##
File path: r/tests/testthat/test-Array.R
##
@@ -36,10 +36,10 @@ expect_array_roundtrip <- function(x, type) {
 x_sliced <- x[-1]
 expect_type_equal(a_sliced$type, type)
 expect_identical(length(a_sliced), length(x_sliced))
-if (!inherits(type, "ListType")) {
+if (!inherits(type, "ListType") && !inherits(type, "LargeListType")) {

Review comment:
   ```suggestion
   if (!inherits(type, c("ListType", "LargeListType"))) {
   ```

##
File path: r/tests/testthat/test-Array.R
##
@@ -18,16 +18,16 @@
 context("Array")
 
 expect_array_roundtrip <- function(x, type) {
-  a <- Array$create(x)
+  a <- Array$create(x, type = type)
   expect_type_equal(a$type, type)
   expect_identical(length(a), length(x))
-  if (!inherits(type, "ListType")) {
+  if (!inherits(type, "ListType") && !inherits(type, "LargeListType")) {

Review comment:
   Can/should LargeListType inherit from ListType?
   
   ```suggestion
 if (!inherits(type, c("ListType", "LargeListType"))) {
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jacques-n commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-26 Thread GitBox


jacques-n commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-650227890


   
   > 1. We are package-hacking `org.apache.arrow.memory` in module 
`arrow-dataset`.
   >Yes `Ownerships.java` uses some hacks, so I've removed the class in 
latest commits (the whole class is not necessary for current use case). Now 
only `NativeUnderlyingMemory.java` is still in the package 
`org.apache.arrow.memory`.
   
   We need to avoid this entirely. If you need some functionality, let's figure 
out what should be exposed. BaseAllocator should not be referenced.
   
   > 2. We lose java native heap caps.
   >Can we clarify this? I think the explanation can be either we lose caps 
from JVM direct memory, or we lose caps from BufferAllocator. For the latter 
one I think buffers are already registered to allocator.
   
   JVM should respect the max direct memory setting. I added more comments on 
your other comment to that end.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jacques-n commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-26 Thread GitBox


jacques-n commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r446237545



##
File path: 
java/dataset/src/main/java/org/apache/arrow/memory/NativeUnderlingMemory.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.memory;
+
+import org.apache.arrow.dataset.jni.JniWrapper;
+
+/**
+ * AllocationManager implementation for Native allocated memory.
+ */
+public class NativeUnderlingMemory extends AllocationManager {

Review comment:
   In reality, this is not at all the case. There are two ways that you run 
out of memory in Java: you run out of available allocation space OR you hit the 
limit at the JVM level. The limit at the JVM level can be influenced by things 
like fragmentation so in nearly all cases you hit that limit before the 
allocator managed limit. In your patch, you only are respecting the allocator 
set limit, not the jvm limit. This means a user could configure java to allow X 
of memory and the process actually will use 2X of memory, leading to issues in 
many systems.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-26 Thread GitBox


romainfrancois commented on a change in pull request #7527:
URL: https://github.com/apache/arrow/pull/7527#discussion_r446220256



##
File path: r/src/array_from_vector.cpp
##
@@ -159,6 +159,9 @@ struct VectorToArrayConverter {
   if (s == NA_STRING) {
 RETURN_NOT_OK(binary_builder->AppendNull());
 continue;
+  } else {
+// Make sure we're ingesting UTF-8
+s = Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8);

Review comment:
   yeah maybe some sort of `Rf_mkCharUtf8()` or `Rf_mkUtf8()`

##
File path: r/src/recordbatch.cpp
##
@@ -246,6 +246,7 @@ std::shared_ptr 
RecordBatch__from_arrays__known_schema(
   SEXP names = Rf_getAttrib(lst, R_NamesSymbol);
 
   auto fill_array = [, ](int j, SEXP x, SEXP name) {
+name = Rf_mkCharCE(Rf_translateCharUTF8(name), CE_UTF8);

Review comment:
   I don't think R offers api for this

##
File path: r/R/schema.R
##
@@ -83,16 +83,21 @@ Schema <- R6Class("Schema",
 }
   ),
   active = list(
-names = function() Schema__field_names(self),
+names = function() {
+  out <- Schema__field_names(self)
+  # Hack: Rcpp should set the encoding

Review comment:
   I believe `cpp11` will rescue us from that sort of trouble. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-26 Thread GitBox


kszucs commented on a change in pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#discussion_r446219965



##
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##
@@ -399,15 +434,59 @@ class TestNumericMinMaxKernel : public ::testing::Test {
 };
 
 template 
-class TestFloatingMinMaxKernel : public TestNumericMinMaxKernel {};
+class TestBooleanMinMaxKernel : public TestPrimitiveMinMaxKernel {};
+
+template 
+class TestIntegerMinMaxKernel : public TestPrimitiveMinMaxKernel {};
+
+template 
+class TestFloatingMinMaxKernel : public TestPrimitiveMinMaxKernel 
{};
+
+TYPED_TEST_SUITE(TestBooleanMinMaxKernel, BooleanArrowType);

Review comment:
   I chose it in order to reuse `AssertMinMaxIsNull` and `AssertMinMaxIs` 
methods. 
   @bkietz can I instantiate `TestPrimitiveMinMaxKernel` without TYPED_TEST 
macro?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


romainfrancois commented on pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#issuecomment-650204663


   Added support for LargeList: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   a <- Array$create(list(integer()), type = large_list_of(int32()))
   a
   #> LargeListArray
   #> >
   #> [
   #>   []
   #> ]
   a$as_vector()
   #> [1]>
   #> [[1]]
   #> integer(0)
   ```
   
   Created on 2020-06-26 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7551: ARROW-9132: [C++] Support Unique and ValueCounts on dictionary data with non-changing dictionaries, add ChunkedArray::Make validat

2020-06-26 Thread GitBox


github-actions[bot] commented on pull request #7551:
URL: https://github.com/apache/arrow/pull/7551#issuecomment-650203303


   https://issues.apache.org/jira/browse/ARROW-9132



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7551: ARROW-9132: [C++] Support Unique and ValueCounts on dictionary data with non-changing dictionaries, add ChunkedArray::Make validating

2020-06-26 Thread GitBox


wesm commented on a change in pull request #7551:
URL: https://github.com/apache/arrow/pull/7551#discussion_r446206568



##
File path: cpp/src/arrow/chunked_array.cc
##
@@ -64,6 +64,24 @@ ChunkedArray::ChunkedArray(ArrayVector chunks, 
std::shared_ptr type)
   }
 }
 
+Result> ChunkedArray::Make(ArrayVector chunks,
+ 
std::shared_ptr type) {

Review comment:
   @pitrou if you have thoughts/opinions about this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7551: ARROW-9132: [C++] Support Unique and ValueCounts on dictionary data with non-changing dictionaries, add ChunkedArray::Make validating constructor

2020-06-26 Thread GitBox


wesm opened a new pull request #7551:
URL: https://github.com/apache/arrow/pull/7551


   This dispatches to the hash function for the indices while checking that the 
dictionaries stay the same on each processed chunk



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-26 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r446160460



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,704 @@
 # under the License.
 
 
-_NULL = NA = None
-
-
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
 
-self.type = null()
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
 
 def __eq__(self, other):
-return NA
+# TODO(kszucs): use c++ Equals
+if isinstance(other, Scalar):
+other = other.as_py()
+return self.as_py() == other
 
+def __hash__(self):
+# TODO(kszucs): use C++ hash if implemented for the type
+return hash(self.as_py())
+
+def as_py(self):
+raise NotImplementedError()
 
-_NULL = NA = NullType()
+
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
-
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __init__(self):
+pass
 
-def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+def __eq__(self, other):
+return NA
 
-def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+def as_py(self):
+"""
+Return this value as a Python None.
+"""
+return None
 
-def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
 
-def __hash__(self):
-return hash(self.as_py())
+_NULL = NA = NullScalar()
 
 
-cdef class BooleanValue(ArrayValue):
+cdef class BooleanScalar(Scalar):
 """
-Concrete class for boolean array elements.
+Concrete class for boolean scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python bool.
 """
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CBooleanScalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid else None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class UInt8Scalar(Scalar):
 """
-Concrete class for int8 array elements.
+Concrete class for uint8 scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python int.
 """
-cdef CInt8Array* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CUInt8Scalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid 

[GitHub] [arrow] kszucs commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-26 Thread GitBox


kszucs commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-650158531


   I'm considering to apply 
[cython.freelist](https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#fast-instantiation)
 on the scalar extension classes.
   
   @wesm @pitrou @jorisvandenbossche do you have any positive experience with 
it?
   
   I assume we should measure its impact before using it, so I can defer it to 
a follow-up.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


wesm commented on pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#issuecomment-650151220


   > @wesm I don't know what you mean by the `BitBlockCounter` treatment.
   
   Ah sorry, welcome back =) We created some facilities to improve the 
performance of processing validity bitmaps
   
   
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit_block_counter.h
   
   The idea is that you can scan forward 64 bits at a time and determine 
efficiently (using hardware popcount) whether a "block" of 64 values has any 
nulls or not (or if the block is all null, too). For those 64 values you can 
skip null checking all together and use the fast path. This speeds up 
processing _significantly_ for data that is mostly not null or mostly null



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7541: ARROW-9224: [Dev][Archery] clone local source with --shared

2020-06-26 Thread GitBox


wesm closed pull request #7541:
URL: https://github.com/apache/arrow/pull/7541


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7541: ARROW-9224: [Dev][Archery] clone local source with --shared

2020-06-26 Thread GitBox


wesm commented on pull request #7541:
URL: https://github.com/apache/arrow/pull/7541#issuecomment-650150564


   thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-26 Thread GitBox


kszucs commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-650149745


   > Could we add a `is_valid` attribute to the python scalar as well? Now the 
only way to check for a null value is to do `.as_py() is None` ?
   
   Added.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-26 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r446148546



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,704 @@
 # under the License.
 
 
-_NULL = NA = None
-
-
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
 
-self.type = null()
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
 
 def __eq__(self, other):
-return NA
+# TODO(kszucs): use c++ Equals
+if isinstance(other, Scalar):
+other = other.as_py()
+return self.as_py() == other
 
+def __hash__(self):
+# TODO(kszucs): use C++ hash if implemented for the type
+return hash(self.as_py())
+
+def as_py(self):
+raise NotImplementedError()
 
-_NULL = NA = NullType()
+
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
-
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __init__(self):
+pass
 
-def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+def __eq__(self, other):
+return NA
 
-def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+def as_py(self):
+"""
+Return this value as a Python None.
+"""
+return None
 
-def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
 
-def __hash__(self):
-return hash(self.as_py())
+_NULL = NA = NullScalar()
 
 
-cdef class BooleanValue(ArrayValue):
+cdef class BooleanScalar(Scalar):
 """
-Concrete class for boolean array elements.
+Concrete class for boolean scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python bool.
 """
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CBooleanScalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid else None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class UInt8Scalar(Scalar):
 """
-Concrete class for int8 array elements.
+Concrete class for uint8 scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python int.
 """
-cdef CInt8Array* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CUInt8Scalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid 

[GitHub] [arrow] mrkn edited a comment on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-26 Thread GitBox


mrkn edited a comment on pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#issuecomment-650148002


   > > Without the canonical flag, we need to make a copy and sort the data of 
non-canonical sparse tensor when serializing it because the current 
SparseCOOIndex has the constraint that the indices matrix is sorted.
   > 
   > Sory, but I don't understand this sentence. If SparseCOOIndex has the 
constraint that the indices matrix is sorted, then it is canonical already. Why 
do you need to add a flag?
   
   Oh, I didn’t explain that this pull request removes that constraint. I’m 
sorry.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-26 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r446148105



##
File path: python/pyarrow/tests/test_parquet.py
##
@@ -2028,7 +2028,7 @@ def test_filters_invalid_pred_op(tempdir, 
use_legacy_dataset):
 use_legacy_dataset=use_legacy_dataset)
 assert dataset.read().num_rows == 0
 
-with pytest.raises(ValueError if use_legacy_dataset else TypeError):
+with pytest.raises(ValueError if use_legacy_dataset else pa.ArrowInvalid):
 # dataset API returns TypeError when trying create invalid comparison

Review comment:
   Nice, updated.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-26 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r446147913



##
File path: python/pyarrow/_dataset.pyx
##
@@ -216,22 +216,18 @@ cdef class Expression:
 @staticmethod
 def _scalar(value):
 cdef:
-shared_ptr[CScalar] scalar
-
-if value is None:
-scalar.reset(new CNullScalar())
-elif isinstance(value, bool):
-scalar = MakeScalar(value)
-elif isinstance(value, float):
-scalar = MakeScalar(value)
-elif isinstance(value, int):
-scalar = MakeScalar(value)
-elif isinstance(value, (bytes, str)):
-scalar = MakeStringScalar(tobytes(value))

Review comment:
   Removed it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-26 Thread GitBox


mrkn commented on pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#issuecomment-650148002


   > > Without the canonical flag, we need to make a copy and sort the data of 
non-canonical sparse tensor when serializing it because the current 
SparseCOOIndex has the constraint that the indices matrix is sorted.
   > 
   > Sory, but I don't understand this sentence. If SparseCOOIndex has the 
constraint that the indices matrix is sorted, then it is canonical already. Why 
do you need to add a flag?
   
   Oh, I missed to explain that this pull request removes that constraint.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


romainfrancois commented on pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#issuecomment-650122091


   Also deals with LargeBinary now, e.g. 
https://issues.apache.org/jira/browse/ARROW-6543
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   a <- Array$create(list(raw()), type = large_binary())
   a
   #> Array
   #> 
   #> [
   #>   
   #> ]
   a$as_vector()
   #> [1]>
   #> [[1]]
   #> raw(0)
   ```
   
   Created on 2020-06-26 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-26 Thread GitBox


rymurr commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r446103501



##
File path: java/dataset/src/test/resources/avroschema/user.avsc
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+{

Review comment:
   I personally agree with @emkornfield  on this one. While harder to read 
the code than json its more compact and isolated.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-26 Thread GitBox


pitrou commented on pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#issuecomment-650091038


   > Without the canonical flag, we need to make a copy and sort the data of 
non-canonical sparse tensor when serializing it because the current 
SparseCOOIndex has the constraint that the indices matrix is sorted. 
   
   Sory, but I don't understand this sentence. If SparseCOOIndex has the 
constraint that the indices matrix is sorted, then it is canonical already. Why 
do you need to add a flag?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-26 Thread GitBox


mrkn commented on pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#issuecomment-650089924


   @pitrou Without the canonical flag, we need to make a copy and sort the data 
of non-canonical sparse tensor when serializing it because the current 
SparseCOOIndex has the constraint that the indices matrix is sorted.  Moreover, 
we lose the information that the original sparse tensor is not canonical.
   
   In contrast, with the canonical flag, we can preserve the canonicality of 
the original sparse tensor and can share the data without a copy and sort.
   
   Unfortunately, SciPy's coo_matrix manages its indices by separated arrays, 
we need to make a copy of the indices for serialization even if with the 
canonical flag. This issue can be solved by SparseSplitCOOTensor proposed in 
#7044.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r446041588



##
File path: r/src/array_from_vector.cpp
##
@@ -1067,12 +1110,22 @@ std::shared_ptr 
InferArrowTypeFromVector(SEXP x) {
   if (Rf_inherits(x, "data.frame")) {
 return InferArrowTypeFromDataFrame(x);
   } else {
-if (XLENGTH(x) == 0) {
-  Rcpp::stop(
-  "Requires at least one element to infer the values' type of a list 
vector");
-}
+SEXP ptype = Rf_getAttrib(x, symbols::ptype);
+if (ptype == R_NilValue) {
+  if (XLENGTH(x) == 0) {
+Rcpp::stop(
+"Requires at least one element to infer the values' type of a list 
vector");
+  }
 
-return arrow::list(InferArrowType(VECTOR_ELT(x, 0)));
+  return arrow::list(InferArrowType(VECTOR_ELT(x, 0)));
+} else {
+  // special case list(raw()) -> BinaryArray
+  if (TYPEOF(ptype) == RAWSXP) {
+return arrow::binary();
+  }
+
+  return arrow::list(InferArrowType(ptype));

Review comment:
   Done. I had to modify the roundtrip checks to use `expect_equivalent()` 
because a roundtrip might add information: 
   
   ```
   list() -> List Array -> list_of( ptype = ) 
   ```
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rdettai commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-26 Thread GitBox


rdettai commented on pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#issuecomment-650027175


   Great! AWS is going to be surprised to see its worldwide S3 HEAD request 
rate drop by half overnight ! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rdettai commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-26 Thread GitBox


rdettai commented on a change in pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#discussion_r446015106



##
File path: cpp/src/arrow/filesystem/s3fs.cc
##
@@ -397,9 +399,17 @@ class ObjectInputFile : public io::RandomAccessFile {
   ObjectInputFile(Aws::S3::S3Client* client, const S3Path& path)
   : client_(client), path_(path) {}
 
+  ObjectInputFile(Aws::S3::S3Client* client, const S3Path& path, int64_t size)
+  : client_(client), path_(path), content_length_(size) {}
+
   Status Init() {
 // Issue a HEAD Object to get the content-length and ensure any
 // errors (e.g. file not found) don't wait until the first Read() call.

Review comment:
   This comment got a little bit out of sync with the code below ;-)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rdettai commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-26 Thread GitBox


rdettai commented on a change in pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#discussion_r446015106



##
File path: cpp/src/arrow/filesystem/s3fs.cc
##
@@ -397,9 +399,17 @@ class ObjectInputFile : public io::RandomAccessFile {
   ObjectInputFile(Aws::S3::S3Client* client, const S3Path& path)
   : client_(client), path_(path) {}
 
+  ObjectInputFile(Aws::S3::S3Client* client, const S3Path& path, int64_t size)
+  : client_(client), path_(path), content_length_(size) {}
+
   Status Init() {
 // Issue a HEAD Object to get the content-length and ensure any
 // errors (e.g. file not found) don't wait until the first Read() call.

Review comment:
   This comments got a little bit out of sync with the code below ;-)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] houqp commented on a change in pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…

2020-06-26 Thread GitBox


houqp commented on a change in pull request #7500:
URL: https://github.com/apache/arrow/pull/7500#discussion_r445998007



##
File path: rust/parquet/src/record/api.rs
##
@@ -893,16 +893,6 @@ mod tests {
 assert_eq!(row, Field::TimestampMillis(123854406));
 }
 
-#[test]
-#[should_panic(expected = "Expected non-negative milliseconds when 
converting Int96")]
-fn test_row_convert_int96_invalid() {
-// INT96 value does not depend on logical type
-let descr = make_column_descr![PhysicalType::INT96, LogicalType::NONE];
-
-let value = Int96::from(vec![0, 0, 0]);
-Field::convert_int96(, value);
-}
-

Review comment:
   I agree that it's better to implement timestamp as signed, not only does 
it address the support for time before 1970, but also make it much easier to 
work with subtractions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


romainfrancois commented on pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#issuecomment-650002855


   @wesm I don't know what you mean by the `BitBlockCounter` treatment. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-26 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r445994055



##
File path: r/src/array_from_vector.cpp
##
@@ -915,6 +924,39 @@ class Time64Converter : public TimeConverter {
   }
 };
 
+class BinaryVectorConverter : public VectorConverter {
+ public:
+  ~BinaryVectorConverter() {}
+
+  Status Init(ArrayBuilder* builder) {
+typed_builder_ = checked_cast(builder);
+return Status::OK();
+  }
+
+  Status Ingest(SEXP obj) {
+ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list"));
+R_xlen_t n = XLENGTH(obj);
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP obj_i = VECTOR_ELT(obj, i);
+  if (Rf_isNull(obj_i)) {
+RETURN_NOT_OK(typed_builder_->AppendNull());
+  } else {
+ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP,
+Status::RError("Expecting a raw vector"));
+RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i)));

Review comment:
   Thanks. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-26 Thread GitBox


romainfrancois commented on a change in pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#discussion_r445988639



##
File path: r/src/table.cpp
##
@@ -172,11 +172,20 @@ std::shared_ptr Table__from_dots(SEXP lst, 
SEXP schema_sxp) {
   std::shared_ptr schema;
 
   if (Rf_isNull(schema_sxp)) {
-// infer the schema from the ...
+// infer the schema from the `...`

Review comment:
   Thanks. They share some code now. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org