[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r446483934 ## File path: r/tests/testthat/test-Array.R ## @@ -18,16 +18,16 @@ context("Array") expect_array_roundtrip <- function(x, type) { - a <- Array$create(x) + a <- Array$create(x, type = type) Review comment: ✅ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-650474759 RE: @jorisvandenbossche > Same question as in the other PR: does setting the batch size also influence existing methods like `read` or `read_row_group` ? Should we add that keyword there as well? My one hesitation to adding them is that it's not clear to me what the effect would be on the execution. For `iter_batches()` the effect of `batch_size` is quite obvious, but I'm not sure about these other methods. After quick search of the Apache Arrow docs the only explanation I saw on the batch size parameter was this: > [The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html?highlight=batch_size) If I can find a good explanation for it or if you have one, I'd be happy to add the `batch_size` parameter to the `read()` and `read_row_group()` methods. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer
kou closed pull request #7553: URL: https://github.com/apache/arrow/pull/7553 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer
kou commented on pull request #7553: URL: https://github.com/apache/arrow/pull/7553#issuecomment-650473985 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-650472498 Apologies, I've been away for a bit. I thought I had invited @sonthonaxrk as a collaborator on my fork, but perhaps that did go through. Addressed the minor feedback. Also retested the issue I was having where tests would fail if the `batch_size` and `chunk_size` weren't aligned, and I can't reproduce that error any more. Hopefully that means it was fixed and I'm not forgetting something important. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r446469923 ## File path: python/pyarrow/parquet.py ## @@ -310,6 +310,44 @@ def read_row_groups(self, row_groups, columns=None, use_threads=True, column_indices=column_indices, use_threads=use_threads) +def iter_batches(self, batch_size, row_groups=None, columns=None, Review comment: Agreed. I added the same default value for `iter_batches()` ## File path: python/pyarrow/_parquet.pyx ## @@ -1077,6 +1078,54 @@ cdef class ParquetReader: def set_use_threads(self, bint use_threads): self.reader.get().set_use_threads(use_threads) +def set_batch_size(self, int64_t batch_size): +self.reader.get().set_batch_size(batch_size) + +def iter_batches(self, int64_t batch_size, row_groups, column_indices=None, + bint use_threads=True): +cdef: +vector[int] c_row_groups +vector[int] c_column_indices +shared_ptr[CRecordBatch] record_batch +shared_ptr[TableBatchReader] batch_reader +unique_ptr[CRecordBatchReader] recordbatchreader + +self.set_batch_size(batch_size) + +if use_threads: +self.set_use_threads(use_threads) + +for row_group in row_groups: +c_row_groups.push_back(row_group) + +if column_indices is not None: +for index in column_indices: +c_column_indices.push_back(index) +check_status( +self.reader.get().GetRecordBatchReader( +c_row_groups, c_column_indices, +) +) +else: +check_status( +self.reader.get().GetRecordBatchReader( +c_row_groups, +) +) + +while True: +with nogil: +check_status( +recordbatchreader.get().ReadNext(_batch) +) + +if record_batch.get() == NULL: +break + +py_record_batch = pyarrow_wrap_batch(record_batch) Review comment: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer
github-actions[bot] commented on pull request #7553: URL: https://github.com/apache/arrow/pull/7553#issuecomment-650466293 Revision: 6d42269b47130742abe2719a469410a214b87de6 Submitted crossbow builds: [ursa-labs/crossbow @ actions-361](https://github.com/ursa-labs/crossbow/branches/all?query=actions-361) |Task|Status| ||--| |centos-6-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-centos-6-amd64)| |centos-7-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-centos-7-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |centos-7-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-centos-7-amd64)| |centos-8-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-centos-8-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |centos-8-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-centos-8-amd64)| |debian-buster-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-debian-buster-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-debian-buster-amd64)| |debian-buster-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-debian-buster-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |debian-stretch-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-debian-stretch-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-debian-stretch-amd64)| |debian-stretch-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-debian-stretch-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |ubuntu-bionic-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-ubuntu-bionic-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-ubuntu-bionic-amd64)| |ubuntu-bionic-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-ubuntu-bionic-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |ubuntu-eoan-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-ubuntu-eoan-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-ubuntu-eoan-amd64)| |ubuntu-eoan-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-ubuntu-eoan-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |ubuntu-focal-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-ubuntu-focal-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-ubuntu-focal-amd64)| |ubuntu-focal-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-ubuntu-focal-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |ubuntu-xenial-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-361-github-ubuntu-xenial-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-361-github-ubuntu-xenial-amd64)| |ubuntu-xenial-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-361-travis-ubuntu-xenial-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer
kou commented on pull request #7553: URL: https://github.com/apache/arrow/pull/7553#issuecomment-650466094 @github-actions crossbow submit -g linux This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer
github-actions[bot] commented on pull request #7553: URL: https://github.com/apache/arrow/pull/7553#issuecomment-650412393 https://issues.apache.org/jira/browse/ARROW-9234 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new pull request #7553: ARROW-9234: [GLib][CUDA] Add support for dictionary memo on reading record batch from buffer
kou opened a new pull request #7553: URL: https://github.com/apache/arrow/pull/7553 This is a follow up task for https://github.com/apache/arrow/pull/7263 . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques closed pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema
fsaintjacques closed pull request #7526: URL: https://github.com/apache/arrow/pull/7526 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #6725: ARROW-8226: [Go] Implement 64 bit offsets binary builder
wesm closed pull request #6725: URL: https://github.com/apache/arrow/pull/6725 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #6725: ARROW-8226: [Go] Implement 64 bit offsets binary builder
wesm commented on pull request #6725: URL: https://github.com/apache/arrow/pull/6725#issuecomment-650356376 Well, that is definitely a bummer. I hope to see some growth in the Go developer community here in the future. I'll close this PR for now then This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema
nealrichardson commented on pull request #7526: URL: https://github.com/apache/arrow/pull/7526#issuecomment-650289073 I'll merge after Appveyor passes, ignoring Travis This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type
kszucs commented on pull request #7478: URL: https://github.com/apache/arrow/pull/7478#issuecomment-650267609 @ursabot build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm merged pull request #7552: [CI] Set allow_failures on Travis CI jobs until they stop being broken
wesm merged pull request #7552: URL: https://github.com/apache/arrow/pull/7552 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7552: [CI] Set allow_failures on Travis CI jobs until they stop being broken
wesm commented on pull request #7552: URL: https://github.com/apache/arrow/pull/7552#issuecomment-650260460 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7552: [CI] Set allow_failures on Travis CI jobs until they stop being broken
github-actions[bot] commented on pull request #7552: URL: https://github.com/apache/arrow/pull/7552#issuecomment-650259300 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r446269132 ## File path: r/tests/testthat/test-Array.R ## @@ -18,16 +18,16 @@ context("Array") expect_array_roundtrip <- function(x, type) { - a <- Array$create(x) + a <- Array$create(x, type = type) expect_type_equal(a$type, type) expect_identical(length(a), length(x)) - if (!inherits(type, "ListType")) { + if (!inherits(type, "ListType") && !inherits(type, "LargeListType")) { Review comment: I don't think there is inheritance down the C++ code This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r446269445 ## File path: r/tests/testthat/test-Array.R ## @@ -18,16 +18,16 @@ context("Array") expect_array_roundtrip <- function(x, type) { - a <- Array$create(x) + a <- Array$create(x, type = type) Review comment: Fair enough. I'll update This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7552: [CI] Set allow_failures on Travis CI jobs until they stop being broken
wesm opened a new pull request #7552: URL: https://github.com/apache/arrow/pull/7552 Travis CI has been flaky for days. This is adding a lot of noise to our workflows so I think we should allow them to fail until they become more consistently happy. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type
kszucs commented on a change in pull request #7478: URL: https://github.com/apache/arrow/pull/7478#discussion_r446266184 ## File path: cpp/src/arrow/compute/kernels/aggregate_test.cc ## @@ -399,15 +434,59 @@ class TestNumericMinMaxKernel : public ::testing::Test { }; template -class TestFloatingMinMaxKernel : public TestNumericMinMaxKernel {}; +class TestBooleanMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +template +class TestIntegerMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +template +class TestFloatingMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +TYPED_TEST_SUITE(TestBooleanMinMaxKernel, BooleanArrowType); Review comment: Right. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type
kszucs commented on a change in pull request #7478: URL: https://github.com/apache/arrow/pull/7478#discussion_r446264863 ## File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc ## @@ -397,24 +452,26 @@ struct MinMaxImpl : public ScalarAggregator { ArrayType arr(batch[0].array()); -local.has_nulls = arr.null_count() > 0; +const auto null_count = arr.null_count(); +local.has_nulls = null_count > 0; +local.has_values = (arr.length() - null_count) > 0; + if (local.has_nulls && options.null_handling == MinMaxOptions::OUTPUT_NULL) { this->state = local; return; } -const auto values = arr.raw_values(); -if (arr.null_count() > 0) { +if (local.has_nulls) { BitmapReader reader(arr.null_bitmap_data(), arr.offset(), arr.length()); for (int64_t i = 0; i < arr.length(); i++) { if (reader.IsSet()) { - local.MergeOne(values[i]); + local.MergeOne(arr.Value(i)); } reader.Next(); } } else { for (int64_t i = 0; i < arr.length(); i++) { -local.MergeOne(values[i]); +local.MergeOne(arr.Value(i)); } Review comment: We could optimize it further by early returning on the first occurrence of true/false values. Going to create a jira once it's merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson commented on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-650249655 @romainfrancois regarding tests, I think a fixture something like ```r df <- tibble::tibble( a = structure("one", class = "special_string"), b = 2, c = tibble::tibble( c1 = structure("inner", extra_attr = "something"), c2 = 4 ) ) ``` could be sufficient. ```r expect_identical(as.data.frame(Table$create(df)), df) expect_identical(as.data.frame(record_batch(df)), df) ``` and then also confirm that it's identical round-tripping to Feather and Parquet. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type
wesm commented on a change in pull request #7478: URL: https://github.com/apache/arrow/pull/7478#discussion_r446262302 ## File path: cpp/src/arrow/compute/kernels/aggregate_test.cc ## @@ -399,15 +434,59 @@ class TestNumericMinMaxKernel : public ::testing::Test { }; template -class TestFloatingMinMaxKernel : public TestNumericMinMaxKernel {}; +class TestBooleanMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +template +class TestIntegerMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +template +class TestFloatingMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +TYPED_TEST_SUITE(TestBooleanMinMaxKernel, BooleanArrowType); Review comment: Just define ``` class TestBooleanMinMaxKernel : public TestPrimitiveMinMaxKernel {}; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…
paddyhoran commented on a change in pull request #7500: URL: https://github.com/apache/arrow/pull/7500#discussion_r446258243 ## File path: rust/parquet/src/record/api.rs ## @@ -893,16 +893,6 @@ mod tests { assert_eq!(row, Field::TimestampMillis(123854406)); } -#[test] -#[should_panic(expected = "Expected non-negative milliseconds when converting Int96")] -fn test_row_convert_int96_invalid() { -// INT96 value does not depend on logical type -let descr = make_column_descr![PhysicalType::INT96, LogicalType::NONE]; - -let value = Int96::from(vec![0, 0, 0]); -Field::convert_int96(, value); -} - Review comment: Yea, I agree with the direction. I was just wondering if we were relying on this assumption (non-negative milliseconds) elsewhere in the codebase. I'll merge this soon if there are no other comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema
nealrichardson commented on pull request #7526: URL: https://github.com/apache/arrow/pull/7526#issuecomment-650242207 This has a python lint failure: https://github.com/apache/arrow/pull/7526/checks?check_run_id=811470274#step:7:1377 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7550: ARROW-9219: [R] coerce_timestamps in Parquet write options does not work
nealrichardson closed pull request #7550: URL: https://github.com/apache/arrow/pull/7550 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion
nealrichardson closed pull request #7527: URL: https://github.com/apache/arrow/pull/7527 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion
nealrichardson commented on pull request #7527: URL: https://github.com/apache/arrow/pull/7527#issuecomment-650237185 Thanks, I think we should just revisit further work once `cpp11` is available, see how much of that is handled. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion
nealrichardson commented on a change in pull request #7527: URL: https://github.com/apache/arrow/pull/7527#discussion_r446247176 ## File path: r/R/schema.R ## @@ -83,16 +83,21 @@ Schema <- R6Class("Schema", } ), active = list( -names = function() Schema__field_names(self), +names = function() { + out <- Schema__field_names(self) + # Hack: Rcpp should set the encoding Review comment: Yes, I did a little dance when I saw https://cpp11.r-lib.org/articles/motivations.html#utf-8-everywhere This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
nealrichardson commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r446237067 ## File path: r/tests/testthat/test-Array.R ## @@ -18,16 +18,16 @@ context("Array") expect_array_roundtrip <- function(x, type) { - a <- Array$create(x) + a <- Array$create(x, type = type) Review comment: I don't think we want to do this in general because it means we're not testing the automatic type detection code. Instead, how about changing the function signature to: ```r expect_array_roundtrip <- function(x, type, as = NULL) { ``` and then here ```suggestion a <- Array$create(x, type = as) ``` then adding changing the Large* tests to pass `as` as well. ## File path: r/tests/testthat/test-Array.R ## @@ -36,10 +36,10 @@ expect_array_roundtrip <- function(x, type) { x_sliced <- x[-1] expect_type_equal(a_sliced$type, type) expect_identical(length(a_sliced), length(x_sliced)) -if (!inherits(type, "ListType")) { +if (!inherits(type, "ListType") && !inherits(type, "LargeListType")) { Review comment: ```suggestion if (!inherits(type, c("ListType", "LargeListType"))) { ``` ## File path: r/tests/testthat/test-Array.R ## @@ -18,16 +18,16 @@ context("Array") expect_array_roundtrip <- function(x, type) { - a <- Array$create(x) + a <- Array$create(x, type = type) expect_type_equal(a$type, type) expect_identical(length(a), length(x)) - if (!inherits(type, "ListType")) { + if (!inherits(type, "ListType") && !inherits(type, "LargeListType")) { Review comment: Can/should LargeListType inherit from ListType? ```suggestion if (!inherits(type, c("ListType", "LargeListType"))) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jacques-n commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
jacques-n commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-650227890 > 1. We are package-hacking `org.apache.arrow.memory` in module `arrow-dataset`. >Yes `Ownerships.java` uses some hacks, so I've removed the class in latest commits (the whole class is not necessary for current use case). Now only `NativeUnderlyingMemory.java` is still in the package `org.apache.arrow.memory`. We need to avoid this entirely. If you need some functionality, let's figure out what should be exposed. BaseAllocator should not be referenced. > 2. We lose java native heap caps. >Can we clarify this? I think the explanation can be either we lose caps from JVM direct memory, or we lose caps from BufferAllocator. For the latter one I think buffers are already registered to allocator. JVM should respect the max direct memory setting. I added more comments on your other comment to that end. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jacques-n commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
jacques-n commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r446237545 ## File path: java/dataset/src/main/java/org/apache/arrow/memory/NativeUnderlingMemory.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.memory; + +import org.apache.arrow.dataset.jni.JniWrapper; + +/** + * AllocationManager implementation for Native allocated memory. + */ +public class NativeUnderlingMemory extends AllocationManager { Review comment: In reality, this is not at all the case. There are two ways that you run out of memory in Java: you run out of available allocation space OR you hit the limit at the JVM level. The limit at the JVM level can be influenced by things like fragmentation so in nearly all cases you hit that limit before the allocator managed limit. In your patch, you only are respecting the allocator set limit, not the jvm limit. This means a user could configure java to allow X of memory and the process actually will use 2X of memory, leading to issues in many systems. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on a change in pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion
romainfrancois commented on a change in pull request #7527: URL: https://github.com/apache/arrow/pull/7527#discussion_r446220256 ## File path: r/src/array_from_vector.cpp ## @@ -159,6 +159,9 @@ struct VectorToArrayConverter { if (s == NA_STRING) { RETURN_NOT_OK(binary_builder->AppendNull()); continue; + } else { +// Make sure we're ingesting UTF-8 +s = Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8); Review comment: yeah maybe some sort of `Rf_mkCharUtf8()` or `Rf_mkUtf8()` ## File path: r/src/recordbatch.cpp ## @@ -246,6 +246,7 @@ std::shared_ptr RecordBatch__from_arrays__known_schema( SEXP names = Rf_getAttrib(lst, R_NamesSymbol); auto fill_array = [, ](int j, SEXP x, SEXP name) { +name = Rf_mkCharCE(Rf_translateCharUTF8(name), CE_UTF8); Review comment: I don't think R offers api for this ## File path: r/R/schema.R ## @@ -83,16 +83,21 @@ Schema <- R6Class("Schema", } ), active = list( -names = function() Schema__field_names(self), +names = function() { + out <- Schema__field_names(self) + # Hack: Rcpp should set the encoding Review comment: I believe `cpp11` will rescue us from that sort of trouble. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type
kszucs commented on a change in pull request #7478: URL: https://github.com/apache/arrow/pull/7478#discussion_r446219965 ## File path: cpp/src/arrow/compute/kernels/aggregate_test.cc ## @@ -399,15 +434,59 @@ class TestNumericMinMaxKernel : public ::testing::Test { }; template -class TestFloatingMinMaxKernel : public TestNumericMinMaxKernel {}; +class TestBooleanMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +template +class TestIntegerMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +template +class TestFloatingMinMaxKernel : public TestPrimitiveMinMaxKernel {}; + +TYPED_TEST_SUITE(TestBooleanMinMaxKernel, BooleanArrowType); Review comment: I chose it in order to reuse `AssertMinMaxIsNull` and `AssertMinMaxIs` methods. @bkietz can I instantiate `TestPrimitiveMinMaxKernel` without TYPED_TEST macro? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on pull request #7514: URL: https://github.com/apache/arrow/pull/7514#issuecomment-650204663 Added support for LargeList: ``` r library(arrow, warn.conflicts = FALSE) a <- Array$create(list(integer()), type = large_list_of(int32())) a #> LargeListArray #> > #> [ #> [] #> ] a$as_vector() #> [1]> #> [[1]] #> integer(0) ``` Created on 2020-06-26 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7551: ARROW-9132: [C++] Support Unique and ValueCounts on dictionary data with non-changing dictionaries, add ChunkedArray::Make validat
github-actions[bot] commented on pull request #7551: URL: https://github.com/apache/arrow/pull/7551#issuecomment-650203303 https://issues.apache.org/jira/browse/ARROW-9132 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7551: ARROW-9132: [C++] Support Unique and ValueCounts on dictionary data with non-changing dictionaries, add ChunkedArray::Make validating
wesm commented on a change in pull request #7551: URL: https://github.com/apache/arrow/pull/7551#discussion_r446206568 ## File path: cpp/src/arrow/chunked_array.cc ## @@ -64,6 +64,24 @@ ChunkedArray::ChunkedArray(ArrayVector chunks, std::shared_ptr type) } } +Result> ChunkedArray::Make(ArrayVector chunks, + std::shared_ptr type) { Review comment: @pitrou if you have thoughts/opinions about this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7551: ARROW-9132: [C++] Support Unique and ValueCounts on dictionary data with non-changing dictionaries, add ChunkedArray::Make validating constructor
wesm opened a new pull request #7551: URL: https://github.com/apache/arrow/pull/7551 This dispatches to the hash function for the indices while checking that the dictionaries stay the same on each processed chunk This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r446160460 ## File path: python/pyarrow/scalar.pxi ## @@ -16,1198 +16,704 @@ # under the License. -_NULL = NA = None - - cdef class Scalar: """ -The base class for all array elements. +The base class for scalars. """ +def __init__(self): +raise TypeError("Do not call {}'s constructor directly, use " +"pa.scalar() instead.".format(self.__class__.__name__)) -cdef class NullType(Scalar): -""" -Singleton for null array elements. -""" -# TODO rename this NullValue? +cdef void init(self, const shared_ptr[CScalar]& wrapped): +self.wrapped = wrapped -def __cinit__(self): -global NA -if NA is not None: -raise Exception('Cannot create multiple NAType instances') +@staticmethod +cdef wrap(const shared_ptr[CScalar]& wrapped): +cdef: +Scalar self +Type type_id = wrapped.get().type.get().id() + +if type_id == _Type_NA: +return _NULL + +typ = _scalar_classes[type_id] +self = typ.__new__(typ) +self.init(wrapped) + +return self + +cdef inline shared_ptr[CScalar] unwrap(self) nogil: +return self.wrapped -self.type = null() +@property +def type(self): +return pyarrow_wrap_data_type(self.wrapped.get().type) def __repr__(self): -return 'NULL' +return ''.format( +self.__class__.__name__, self.as_py() +) -def as_py(self): -""" -Return None -""" -return None +def __str__(self): +return str(self.as_py()) def __eq__(self, other): -return NA +# TODO(kszucs): use c++ Equals +if isinstance(other, Scalar): +other = other.as_py() +return self.as_py() == other +def __hash__(self): +# TODO(kszucs): use C++ hash if implemented for the type +return hash(self.as_py()) + +def as_py(self): +raise NotImplementedError() -_NULL = NA = NullType() + +_NULL = NA = None -cdef class ArrayValue(Scalar): +cdef class NullScalar(Scalar): """ -The base class for non-null array elements. +Concrete class for null scalars. """ -def __init__(self): -raise TypeError("Do not call {}'s constructor directly, use array " -"subscription instead." -.format(self.__class__.__name__)) - -cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, - int64_t index): -self.type = type -self.index = index -self._set_array(sp_array) +def __cinit__(self): +global NA +if NA is not None: +raise Exception('Cannot create multiple NAType instances') +self.init(shared_ptr[CScalar](new CNullScalar())) -cdef void _set_array(self, const shared_ptr[CArray]& sp_array): -self.sp_array = sp_array +def __init__(self): +pass -def __repr__(self): -if hasattr(self, 'as_py'): -return repr(self.as_py()) -else: -return super(Scalar, self).__repr__() +def __eq__(self, other): +return NA -def __str__(self): -if hasattr(self, 'as_py'): -return str(self.as_py()) -else: -return super(Scalar, self).__str__() +def as_py(self): +""" +Return this value as a Python None. +""" +return None -def __eq__(self, other): -if hasattr(self, 'as_py'): -if isinstance(other, ArrayValue): -other = other.as_py() -return self.as_py() == other -else: -raise NotImplementedError( -"Cannot compare Arrow values that don't support as_py()") -def __hash__(self): -return hash(self.as_py()) +_NULL = NA = NullScalar() -cdef class BooleanValue(ArrayValue): +cdef class BooleanScalar(Scalar): """ -Concrete class for boolean array elements. +Concrete class for boolean scalars. """ def as_py(self): """ Return this value as a Python bool. """ -cdef CBooleanArray* ap = self.sp_array.get() -return ap.Value(self.index) +cdef CBooleanScalar* sp = self.wrapped.get() +return sp.value if sp.is_valid else None -cdef class Int8Value(ArrayValue): +cdef class UInt8Scalar(Scalar): """ -Concrete class for int8 array elements. +Concrete class for uint8 scalars. """ def as_py(self): """ Return this value as a Python int. """ -cdef CInt8Array* ap = self.sp_array.get() -return ap.Value(self.index) +cdef CUInt8Scalar* sp = self.wrapped.get() +return sp.value if sp.is_valid
[GitHub] [arrow] kszucs commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-650158531 I'm considering to apply [cython.freelist](https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#fast-instantiation) on the scalar extension classes. @wesm @pitrou @jorisvandenbossche do you have any positive experience with it? I assume we should measure its impact before using it, so I can defer it to a follow-up. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
wesm commented on pull request #7514: URL: https://github.com/apache/arrow/pull/7514#issuecomment-650151220 > @wesm I don't know what you mean by the `BitBlockCounter` treatment. Ah sorry, welcome back =) We created some facilities to improve the performance of processing validity bitmaps https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit_block_counter.h The idea is that you can scan forward 64 bits at a time and determine efficiently (using hardware popcount) whether a "block" of 64 values has any nulls or not (or if the block is all null, too). For those 64 values you can skip null checking all together and use the fast path. This speeds up processing _significantly_ for data that is mostly not null or mostly null This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7541: ARROW-9224: [Dev][Archery] clone local source with --shared
wesm closed pull request #7541: URL: https://github.com/apache/arrow/pull/7541 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7541: ARROW-9224: [Dev][Archery] clone local source with --shared
wesm commented on pull request #7541: URL: https://github.com/apache/arrow/pull/7541#issuecomment-650150564 thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-650149745 > Could we add a `is_valid` attribute to the python scalar as well? Now the only way to check for a null value is to do `.as_py() is None` ? Added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r446148546 ## File path: python/pyarrow/scalar.pxi ## @@ -16,1198 +16,704 @@ # under the License. -_NULL = NA = None - - cdef class Scalar: """ -The base class for all array elements. +The base class for scalars. """ +def __init__(self): +raise TypeError("Do not call {}'s constructor directly, use " +"pa.scalar() instead.".format(self.__class__.__name__)) -cdef class NullType(Scalar): -""" -Singleton for null array elements. -""" -# TODO rename this NullValue? +cdef void init(self, const shared_ptr[CScalar]& wrapped): +self.wrapped = wrapped -def __cinit__(self): -global NA -if NA is not None: -raise Exception('Cannot create multiple NAType instances') +@staticmethod +cdef wrap(const shared_ptr[CScalar]& wrapped): +cdef: +Scalar self +Type type_id = wrapped.get().type.get().id() + +if type_id == _Type_NA: +return _NULL + +typ = _scalar_classes[type_id] +self = typ.__new__(typ) +self.init(wrapped) + +return self + +cdef inline shared_ptr[CScalar] unwrap(self) nogil: +return self.wrapped -self.type = null() +@property +def type(self): +return pyarrow_wrap_data_type(self.wrapped.get().type) def __repr__(self): -return 'NULL' +return ''.format( +self.__class__.__name__, self.as_py() +) -def as_py(self): -""" -Return None -""" -return None +def __str__(self): +return str(self.as_py()) def __eq__(self, other): -return NA +# TODO(kszucs): use c++ Equals +if isinstance(other, Scalar): +other = other.as_py() +return self.as_py() == other +def __hash__(self): +# TODO(kszucs): use C++ hash if implemented for the type +return hash(self.as_py()) + +def as_py(self): +raise NotImplementedError() -_NULL = NA = NullType() + +_NULL = NA = None -cdef class ArrayValue(Scalar): +cdef class NullScalar(Scalar): """ -The base class for non-null array elements. +Concrete class for null scalars. """ -def __init__(self): -raise TypeError("Do not call {}'s constructor directly, use array " -"subscription instead." -.format(self.__class__.__name__)) - -cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, - int64_t index): -self.type = type -self.index = index -self._set_array(sp_array) +def __cinit__(self): +global NA +if NA is not None: +raise Exception('Cannot create multiple NAType instances') +self.init(shared_ptr[CScalar](new CNullScalar())) -cdef void _set_array(self, const shared_ptr[CArray]& sp_array): -self.sp_array = sp_array +def __init__(self): +pass -def __repr__(self): -if hasattr(self, 'as_py'): -return repr(self.as_py()) -else: -return super(Scalar, self).__repr__() +def __eq__(self, other): +return NA -def __str__(self): -if hasattr(self, 'as_py'): -return str(self.as_py()) -else: -return super(Scalar, self).__str__() +def as_py(self): +""" +Return this value as a Python None. +""" +return None -def __eq__(self, other): -if hasattr(self, 'as_py'): -if isinstance(other, ArrayValue): -other = other.as_py() -return self.as_py() == other -else: -raise NotImplementedError( -"Cannot compare Arrow values that don't support as_py()") -def __hash__(self): -return hash(self.as_py()) +_NULL = NA = NullScalar() -cdef class BooleanValue(ArrayValue): +cdef class BooleanScalar(Scalar): """ -Concrete class for boolean array elements. +Concrete class for boolean scalars. """ def as_py(self): """ Return this value as a Python bool. """ -cdef CBooleanArray* ap = self.sp_array.get() -return ap.Value(self.index) +cdef CBooleanScalar* sp = self.wrapped.get() +return sp.value if sp.is_valid else None -cdef class Int8Value(ArrayValue): +cdef class UInt8Scalar(Scalar): """ -Concrete class for int8 array elements. +Concrete class for uint8 scalars. """ def as_py(self): """ Return this value as a Python int. """ -cdef CInt8Array* ap = self.sp_array.get() -return ap.Value(self.index) +cdef CUInt8Scalar* sp = self.wrapped.get() +return sp.value if sp.is_valid
[GitHub] [arrow] mrkn edited a comment on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index
mrkn edited a comment on pull request #7477: URL: https://github.com/apache/arrow/pull/7477#issuecomment-650148002 > > Without the canonical flag, we need to make a copy and sort the data of non-canonical sparse tensor when serializing it because the current SparseCOOIndex has the constraint that the indices matrix is sorted. > > Sory, but I don't understand this sentence. If SparseCOOIndex has the constraint that the indices matrix is sorted, then it is canonical already. Why do you need to add a flag? Oh, I didn’t explain that this pull request removes that constraint. I’m sorry. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r446148105 ## File path: python/pyarrow/tests/test_parquet.py ## @@ -2028,7 +2028,7 @@ def test_filters_invalid_pred_op(tempdir, use_legacy_dataset): use_legacy_dataset=use_legacy_dataset) assert dataset.read().num_rows == 0 -with pytest.raises(ValueError if use_legacy_dataset else TypeError): +with pytest.raises(ValueError if use_legacy_dataset else pa.ArrowInvalid): # dataset API returns TypeError when trying create invalid comparison Review comment: Nice, updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r446147913 ## File path: python/pyarrow/_dataset.pyx ## @@ -216,22 +216,18 @@ cdef class Expression: @staticmethod def _scalar(value): cdef: -shared_ptr[CScalar] scalar - -if value is None: -scalar.reset(new CNullScalar()) -elif isinstance(value, bool): -scalar = MakeScalar(value) -elif isinstance(value, float): -scalar = MakeScalar(value) -elif isinstance(value, int): -scalar = MakeScalar(value) -elif isinstance(value, (bytes, str)): -scalar = MakeStringScalar(tobytes(value)) Review comment: Removed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index
mrkn commented on pull request #7477: URL: https://github.com/apache/arrow/pull/7477#issuecomment-650148002 > > Without the canonical flag, we need to make a copy and sort the data of non-canonical sparse tensor when serializing it because the current SparseCOOIndex has the constraint that the indices matrix is sorted. > > Sory, but I don't understand this sentence. If SparseCOOIndex has the constraint that the indices matrix is sorted, then it is canonical already. Why do you need to add a flag? Oh, I missed to explain that this pull request removes that constraint. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on pull request #7514: URL: https://github.com/apache/arrow/pull/7514#issuecomment-650122091 Also deals with LargeBinary now, e.g. https://issues.apache.org/jira/browse/ARROW-6543 ``` r library(arrow, warn.conflicts = FALSE) a <- Array$create(list(raw()), type = large_binary()) a #> Array #> #> [ #> #> ] a$as_vector() #> [1]> #> [[1]] #> raw(0) ``` Created on 2020-06-26 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
rymurr commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r446103501 ## File path: java/dataset/src/test/resources/avroschema/user.avsc ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +{ Review comment: I personally agree with @emkornfield on this one. While harder to read the code than json its more compact and isolated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index
pitrou commented on pull request #7477: URL: https://github.com/apache/arrow/pull/7477#issuecomment-650091038 > Without the canonical flag, we need to make a copy and sort the data of non-canonical sparse tensor when serializing it because the current SparseCOOIndex has the constraint that the indices matrix is sorted. Sory, but I don't understand this sentence. If SparseCOOIndex has the constraint that the indices matrix is sorted, then it is canonical already. Why do you need to add a flag? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index
mrkn commented on pull request #7477: URL: https://github.com/apache/arrow/pull/7477#issuecomment-650089924 @pitrou Without the canonical flag, we need to make a copy and sort the data of non-canonical sparse tensor when serializing it because the current SparseCOOIndex has the constraint that the indices matrix is sorted. Moreover, we lose the information that the original sparse tensor is not canonical. In contrast, with the canonical flag, we can preserve the canonicality of the original sparse tensor and can share the data without a copy and sort. Unfortunately, SciPy's coo_matrix manages its indices by separated arrays, we need to make a copy of the indices for serialization even if with the canonical flag. This issue can be solved by SparseSplitCOOTensor proposed in #7044. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r446041588 ## File path: r/src/array_from_vector.cpp ## @@ -1067,12 +1110,22 @@ std::shared_ptr InferArrowTypeFromVector(SEXP x) { if (Rf_inherits(x, "data.frame")) { return InferArrowTypeFromDataFrame(x); } else { -if (XLENGTH(x) == 0) { - Rcpp::stop( - "Requires at least one element to infer the values' type of a list vector"); -} +SEXP ptype = Rf_getAttrib(x, symbols::ptype); +if (ptype == R_NilValue) { + if (XLENGTH(x) == 0) { +Rcpp::stop( +"Requires at least one element to infer the values' type of a list vector"); + } -return arrow::list(InferArrowType(VECTOR_ELT(x, 0))); + return arrow::list(InferArrowType(VECTOR_ELT(x, 0))); +} else { + // special case list(raw()) -> BinaryArray + if (TYPEOF(ptype) == RAWSXP) { +return arrow::binary(); + } + + return arrow::list(InferArrowType(ptype)); Review comment: Done. I had to modify the roundtrip checks to use `expect_equivalent()` because a roundtrip might add information: ``` list() -> List Array -> list_of( ptype = ) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem
rdettai commented on pull request #7547: URL: https://github.com/apache/arrow/pull/7547#issuecomment-650027175 Great! AWS is going to be surprised to see its worldwide S3 HEAD request rate drop by half overnight ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem
rdettai commented on a change in pull request #7547: URL: https://github.com/apache/arrow/pull/7547#discussion_r446015106 ## File path: cpp/src/arrow/filesystem/s3fs.cc ## @@ -397,9 +399,17 @@ class ObjectInputFile : public io::RandomAccessFile { ObjectInputFile(Aws::S3::S3Client* client, const S3Path& path) : client_(client), path_(path) {} + ObjectInputFile(Aws::S3::S3Client* client, const S3Path& path, int64_t size) + : client_(client), path_(path), content_length_(size) {} + Status Init() { // Issue a HEAD Object to get the content-length and ensure any // errors (e.g. file not found) don't wait until the first Read() call. Review comment: This comment got a little bit out of sync with the code below ;-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem
rdettai commented on a change in pull request #7547: URL: https://github.com/apache/arrow/pull/7547#discussion_r446015106 ## File path: cpp/src/arrow/filesystem/s3fs.cc ## @@ -397,9 +399,17 @@ class ObjectInputFile : public io::RandomAccessFile { ObjectInputFile(Aws::S3::S3Client* client, const S3Path& path) : client_(client), path_(path) {} + ObjectInputFile(Aws::S3::S3Client* client, const S3Path& path, int64_t size) + : client_(client), path_(path), content_length_(size) {} + Status Init() { // Issue a HEAD Object to get the content-length and ensure any // errors (e.g. file not found) don't wait until the first Read() call. Review comment: This comments got a little bit out of sync with the code below ;-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp commented on a change in pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…
houqp commented on a change in pull request #7500: URL: https://github.com/apache/arrow/pull/7500#discussion_r445998007 ## File path: rust/parquet/src/record/api.rs ## @@ -893,16 +893,6 @@ mod tests { assert_eq!(row, Field::TimestampMillis(123854406)); } -#[test] -#[should_panic(expected = "Expected non-negative milliseconds when converting Int96")] -fn test_row_convert_int96_invalid() { -// INT96 value does not depend on logical type -let descr = make_column_descr![PhysicalType::INT96, LogicalType::NONE]; - -let value = Int96::from(vec![0, 0, 0]); -Field::convert_int96(, value); -} - Review comment: I agree that it's better to implement timestamp as signed, not only does it address the support for time before 1970, but also make it much easier to work with subtractions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on pull request #7514: URL: https://github.com/apache/arrow/pull/7514#issuecomment-650002855 @wesm I don't know what you mean by the `BitBlockCounter` treatment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r445994055 ## File path: r/src/array_from_vector.cpp ## @@ -915,6 +924,39 @@ class Time64Converter : public TimeConverter { } }; +class BinaryVectorConverter : public VectorConverter { + public: + ~BinaryVectorConverter() {} + + Status Init(ArrayBuilder* builder) { +typed_builder_ = checked_cast(builder); +return Status::OK(); + } + + Status Ingest(SEXP obj) { +ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list")); +R_xlen_t n = XLENGTH(obj); +for (R_xlen_t i = 0; i < n; i++) { + SEXP obj_i = VECTOR_ELT(obj, i); + if (Rf_isNull(obj_i)) { +RETURN_NOT_OK(typed_builder_->AppendNull()); + } else { +ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP, +Status::RError("Expecting a raw vector")); +RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i))); Review comment: Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on a change in pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
romainfrancois commented on a change in pull request #7524: URL: https://github.com/apache/arrow/pull/7524#discussion_r445988639 ## File path: r/src/table.cpp ## @@ -172,11 +172,20 @@ std::shared_ptr Table__from_dots(SEXP lst, SEXP schema_sxp) { std::shared_ptr schema; if (Rf_isNull(schema_sxp)) { -// infer the schema from the ... +// infer the schema from the `...` Review comment: Thanks. They share some code now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org