[GitHub] [arrow] rok commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string
rok commented on a change in pull request #9606: URL: https://github.com/apache/arrow/pull/9606#discussion_r587790806 ## File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc ## @@ -231,6 +252,25 @@ TEST_F(TestIsInKernel, Decimal) { /*skip_nulls=*/true); } +TEST_F(TestIsInKernel, DictionaryArray) { + for (auto index_ty : all_dictionary_index_types()) { +CheckIsInDictionary(/*type=*/utf8(), +/*index_type=*/index_ty, +/*input_dictionary_json=*/R"(["A", "B", "C", "D"])", +/*input_index_json=*/"[1, 2, null, 0]", +/*value_set_json=*/R"(["A", "B", "C"])", +/*expected_json=*/"[true, true, false, true]", +/*skip_nulls=*/false); +CheckIsInDictionary(/*type=*/float32(), Review comment: Added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string
rok commented on a change in pull request #9606: URL: https://github.com/apache/arrow/pull/9606#discussion_r587790691 ## File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc ## @@ -72,6 +72,27 @@ void CheckIsInChunked(const std::shared_ptr& input, AssertChunkedEqual(*expected, *actual); } +void CheckIsInDictionary(const std::shared_ptr& type, + const std::shared_ptr& index_type, + const std::string& input_dictionary_json, + const std::string& input_index_json, + const std::string& value_set_json, + const std::string& expected_json, bool skip_nulls = false) { + auto dict_type = dictionary(index_type, type); + auto indices = ArrayFromJSON(index_type, input_index_json); + auto dict = ArrayFromJSON(type, value_set_json); Review comment: Ugh, thanks for catching that. Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9630: ARROW-11860: [Rust] [DataFusion] Add DataFusion logos
github-actions[bot] commented on pull request #9630: URL: https://github.com/apache/arrow/pull/9630#issuecomment-790895297 https://issues.apache.org/jira/browse/ARROW-11860 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ericwburden commented on pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes
ericwburden commented on pull request #9624: URL: https://github.com/apache/arrow/pull/9624#issuecomment-790893609 Sorry, just learned that the "Close" button on my phone didn't mean the window... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ericwburden closed pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes
ericwburden closed pull request #9624: URL: https://github.com/apache/arrow/pull/9624 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9494: ARROW-11626: [Rust][DataFusion] Move [DataFusion] examples to own project
alamb commented on a change in pull request #9494: URL: https://github.com/apache/arrow/pull/9494#discussion_r58078 ## File path: rust/datafusion-examples/Cargo.toml ## @@ -0,0 +1,38 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +[package] +name = "datafusion-examples" Review comment: FYI @andygrove -- I am not sure what you think about possibly releasing the examples as a crate of their own. I personally think it would be necessary. However perhaps it would be the right way to to eventually to somehow link to the example code from the docs.rs page https://docs.rs/datafusion/3.0.0/datafusion/ I filed this JIRA https://issues.apache.org/jira/browse/ARROW-11863 to track that ## File path: rust/datafusion/Cargo.toml ## @@ -71,9 +71,6 @@ unicode-segmentation = "^1.7.1" rand = "0.8" criterion = "0.3" tempfile = "3" -prost = "0.7" Review comment: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on pull request #9610: URL: https://github.com/apache/arrow/pull/9610#issuecomment-790887608 @github-actions crossbow submit -g r This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r587765098 ## File path: r/tools/autobrew ## @@ -48,7 +48,8 @@ fi # Hardcode this for my custom autobrew build rm -f $BREWDIR/lib/*.dylib AWS_LIBS="-laws-cpp-sdk-config -laws-cpp-sdk-transfer -laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity -laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream -laws-checksums -laws-c-common -lpthread -lcurl" -PKG_LIBS="-L$BREWDIR/lib -lparquet -larrow_dataset -larrow -larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS" +PKG_LIBS="-lparquet -larrow_dataset -larrow -larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS" +PKG_DIRS="-L$BREWDIR/lib" Review comment: I created ARROW-11861 to remind us of this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9588: ARROW-11799: [Rust] fix len of string and binary arrays created from unbound iterator
alamb commented on a change in pull request #9588: URL: https://github.com/apache/arrow/pull/9588#discussion_r587764593 ## File path: rust/arrow/src/array/array_binary.rs ## @@ -258,6 +258,8 @@ where } } +// calculate actual data_len, which may be different from the iterator's upper bound +let data_len = offsets.len() - 1; Review comment: After some more study I agree (that further improvements could be made, but this PR makes the situation better so it is good as it is) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9588: ARROW-11799: [Rust] fix len of string and binary arrays created from unbound iterator
alamb commented on a change in pull request #9588: URL: https://github.com/apache/arrow/pull/9588#discussion_r587764593 ## File path: rust/arrow/src/array/array_binary.rs ## @@ -258,6 +258,8 @@ where } } +// calculate actual data_len, which may be different from the iterator's upper bound +let data_len = offsets.len() - 1; Review comment: After some more study I agree This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r587761603 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: Done in 26753e4e0ed08739467588cfb3ddcfe09f5aebcc. I will change the `\dontrun`s back to `\donttest` (or remove them altogether if possible) when I implement `@examplesIf`, as noted in ARROW-11849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex
alamb commented on a change in pull request #9625: URL: https://github.com/apache/arrow/pull/9625#discussion_r587742457 ## File path: rust/datafusion/tests/sql.rs ## @@ -2051,13 +2054,19 @@ async fn test_string_expressions() -> Result<()> { test_expression!("character_length('chars')", "5"); test_expression!("character_length('josé')", "4"); test_expression!("character_length(NULL)", "NULL"); +test_expression!("chr(CAST(120 AS int))", "x"); +test_expression!("chr(CAST(128175 AS int))", ""); Review comment: indeed! ## File path: rust/datafusion/src/physical_plan/functions.rs ## @@ -1033,6 +1163,54 @@ mod tests { #[test] fn test_functions() -> Result<()> { +test_function!( +Ascii, +&[lit(ScalarValue::Utf8(Some("x".to_string(], +Ok(Some(120)), +i32, +Int32, +Int32Array +); +test_function!( +Ascii, +&[lit(ScalarValue::Utf8(Some("ésoj".to_string(], +Ok(Some(233)), +i32, +Int32, +Int32Array +); +test_function!( +Ascii, +&[lit(ScalarValue::Utf8(Some("".to_string(], +Ok(Some(128175)), Review comment: It is strange to me that a function called `ascii` can return something larger than `127`. However, it seems that quirkiness / awesomeness came from `postgres` :) ``` alamb=# select ascii(''); ascii 128175 (1 row) ``` ## File path: rust/datafusion/src/physical_plan/string_expressions.rs ## @@ -658,32 +709,9 @@ pub fn rpad(args: &[ArrayRef]) -> Result { Ok(Arc::new(result) as ArrayRef) } 3 => { -let string_array: = args[0] -.as_any() -.downcast_ref::>() -.ok_or_else(|| { -DataFusionError::Internal( -"could not cast string to StringArray".to_string(), -) -})?; - -let length_array: = args[1] -.as_any() -.downcast_ref::() -.ok_or_else(|| { -DataFusionError::Internal( -"could not cast length to Int64Array".to_string(), -) -})?; - -let fill_array: = args[2] -.as_any() -.downcast_ref::>() -.ok_or_else(|| { -DataFusionError::Internal( -"could not cast fill to StringArray".to_string(), -) -})?; +let string_array = downcast_string_arg!(args[0], "string", T); Review comment: these macros certainly make the code nicer to read. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9633: ARROW-11560: [C++][FlightRPC] fix mutex error on SIGINT
pitrou commented on a change in pull request #9633: URL: https://github.com/apache/arrow/pull/9633#discussion_r587752941 ## File path: cpp/src/arrow/flight/server.cc ## @@ -899,6 +921,13 @@ Status FlightServerBase::Serve() { ARROW_ASSIGN_OR_RAISE(old_handler, SetSignalHandler(signum, new_handler)); impl_->old_signal_handlers_.push_back(std::move(old_handler)); } +#ifndef _WIN32 + if (sem_init(_->waiting_sem_, /*pshared=*/0, /*value=*/0) != 0) { Review comment: Unfortunately I don't think macOS supports anonymous semaphores. An alternative would be to use a non-blocking anonymous pipe. `write` is async-signal-safe. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm opened a new pull request #9633: ARROW-11560: [C++][FlightRPC] fix mutex error on SIGINT
lidavidm opened a new pull request #9633: URL: https://github.com/apache/arrow/pull/9633 Recently, interrupting a Python Flight server started aborting the process instead, because gRPC on Linux started using some code which is not signal-safe. This PR fixes that by spawning a separate thread that waits on a semaphore, then shuts down the server; the semaphore is notified by the signal handler (which is a signal-safe operation). This is a little janky, but fixes the issue. This still builds on Windows in my tests, but I couldn't actually import the resulting PyArrow to test it. However, PyArrow 3.0 via Conda still works on Windows. I'm not able to test on MacOS myself. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete
lidavidm commented on a change in pull request #9613: URL: https://github.com/apache/arrow/pull/9613#discussion_r587733815 ## File path: cpp/src/arrow/io/caching.cc ## @@ -193,6 +195,36 @@ Result> ReadRangeCache::Read(ReadRange range) { return Status::Invalid("ReadRangeCache did not find matching cache entry"); } +Future<> ReadRangeCache::Wait() { + struct State { +explicit State(std::vector>> f) +: futures(std::move(f)), remaining(futures.size()) {} +std::vector>> futures; +std::atomic remaining; + }; + + std::vector>> futures; + for (const auto& entry : impl_->entries) { +futures.push_back(entry.future); + } + auto out = Future<>::Make(); + auto state = std::make_shared(std::move(futures)); + for (const auto& future : state->futures) { +future.AddCallback([state, out](const Result>&) mutable { + if (state->remaining.fetch_sub(1) != 1) return; Review comment: Future is already convertible to Future<>. So what you suggest would work, but we'd need the fail-fast behavior, yes. A flag like that, though, does change the semantics rather dramatically, I'd rather keep them as separate functions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]
alamb commented on pull request #9243: URL: https://github.com/apache/arrow/pull/9243#issuecomment-790849774 switching to draft so it is clear this is not being merged as is and instead is being broken up This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
alamb closed pull request #9595: URL: https://github.com/apache/arrow/pull/9595 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete
westonpace commented on a change in pull request #9613: URL: https://github.com/apache/arrow/pull/9613#discussion_r587731880 ## File path: cpp/src/arrow/io/caching.cc ## @@ -193,6 +195,36 @@ Result> ReadRangeCache::Read(ReadRange range) { return Status::Invalid("ReadRangeCache did not find matching cache entry"); } +Future<> ReadRangeCache::Wait() { + struct State { +explicit State(std::vector>> f) +: futures(std::move(f)), remaining(futures.size()) {} +std::vector>> futures; +std::atomic remaining; + }; + + std::vector>> futures; + for (const auto& entry : impl_->entries) { +futures.push_back(entry.future); + } + auto out = Future<>::Make(); + auto state = std::make_shared(std::move(futures)); + for (const auto& future : state->futures) { +future.AddCallback([state, out](const Result>&) mutable { + if (state->remaining.fetch_sub(1) != 1) return; Review comment: I wonder if it might be better to add `fail_fast` as an optional flag to the existing method. We may someday want `Future>>` but also want the `fail_fast` behavior. Also, just speculation, but I wonder if it would be possible to have a cast (not sure if explicit or implicit) from Future -> Future<>. Under the hood it could be `base.Then([] (const Result& result) -> Status {return Status::OK();})` (or perhaps there is a more efficient way to achieve the same thing). Then it could just be `static_cast>(All(futures, Future::FailFast))` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9630: ARROW-11860: [Rust] [DataFusion] Add DataFusion logos
alamb closed pull request #9630: URL: https://github.com/apache/arrow/pull/9630 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations
pitrou commented on pull request #9632: URL: https://github.com/apache/arrow/pull/9632#issuecomment-790847276 We do use AWS async APIs at a couple places, but you're right this is not mandatory. OTOH, if we don't use them then we may have a double indirection (first our own executor, then AWS would use its own executor? or are the sync AWS APIs really sync?). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb edited a comment on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos
alamb edited a comment on pull request #9630: URL: https://github.com/apache/arrow/pull/9630#issuecomment-790846558 I think we can get this logo in and then add it to the docs as a follow on PR as @Dandandan suggests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos
alamb commented on pull request #9630: URL: https://github.com/apache/arrow/pull/9630#issuecomment-790846558 I think we can get this logo in and then add it to the docs as a follow on PR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos
alamb commented on pull request #9630: URL: https://github.com/apache/arrow/pull/9630#issuecomment-790845573 > I don't think this needs a JIRA? @andygrove I will apply my new JIRA making script (https://github.com/apache/arrow/pull/9598) to make one anyways :) so there is no question :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations
westonpace commented on pull request #9632: URL: https://github.com/apache/arrow/pull/9632#issuecomment-790845365 @pitrou Looks good to me. I wonder if we will even need to use AWS' async mode? It seems that calling into S3 this way is pretty much equivalent to overriding their executor with our own. Callers will have to take care to know that these futures will complete on the IOContext and transfer them off quickly but looking at it now I agree that the transfer doesn't belong in here. Looks great. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes
alamb commented on a change in pull request #9624: URL: https://github.com/apache/arrow/pull/9624#discussion_r587725662 ## File path: rust/arrow/src/datatypes/native.rs ## @@ -50,6 +50,12 @@ pub trait ArrowNativeType: None } +/// Convert native type to isize. Review comment: I think this is a good pattern and follows the existing `to_usize` etc that convert the arrow types to rust types. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete
lidavidm commented on a change in pull request #9613: URL: https://github.com/apache/arrow/pull/9613#discussion_r587706050 ## File path: cpp/src/arrow/io/caching.cc ## @@ -193,6 +195,36 @@ Result> ReadRangeCache::Read(ReadRange range) { return Status::Invalid("ReadRangeCache did not find matching cache entry"); } +Future<> ReadRangeCache::Wait() { + struct State { +explicit State(std::vector>> f) +: futures(std::move(f)), remaining(futures.size()) {} +std::vector>> futures; +std::atomic remaining; + }; + + std::vector>> futures; + for (const auto& entry : impl_->entries) { +futures.push_back(entry.future); + } + auto out = Future<>::Make(); + auto state = std::make_shared(std::move(futures)); + for (const auto& future : state->futures) { +future.AddCallback([state, out](const Result>&) mutable { + if (state->remaining.fetch_sub(1) != 1) return; Review comment: All gives `Future>>` and never marks the future failed; this gives `Future<>` and marks the future failed if any individual result failed. Additionally this 'fails fast' in that if any fail, you immediately get a result, instead of waiting for the rest of the futures. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete
westonpace commented on a change in pull request #9613: URL: https://github.com/apache/arrow/pull/9613#discussion_r587704738 ## File path: cpp/src/arrow/io/caching.cc ## @@ -193,6 +195,36 @@ Result> ReadRangeCache::Read(ReadRange range) { return Status::Invalid("ReadRangeCache did not find matching cache entry"); } +Future<> ReadRangeCache::Wait() { + struct State { +explicit State(std::vector>> f) +: futures(std::move(f)), remaining(futures.size()) {} +std::vector>> futures; +std::atomic remaining; + }; + + std::vector>> futures; + for (const auto& entry : impl_->entries) { +futures.push_back(entry.future); + } + auto out = Future<>::Make(); + auto state = std::make_shared(std::move(futures)); + for (const auto& future : state->futures) { +future.AddCallback([state, out](const Result>&) mutable { + if (state->remaining.fetch_sub(1) != 1) return; Review comment: What is the difference between this and `All` inside `future.h`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan edited a comment on pull request #9494: ARROW-11626: [Rust][DataFusion] Move [DataFusion] examples to own project
Dandandan edited a comment on pull request #9494: URL: https://github.com/apache/arrow/pull/9494#issuecomment-790819802 PR is resurrected! (I think) @alamb This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9494: ARROW-11626: [Rust][DataFusion] Move [DataFusion] examples to own project
Dandandan commented on pull request #9494: URL: https://github.com/apache/arrow/pull/9494#issuecomment-790819802 PR is resurrected! (I think) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations
pitrou commented on pull request #9632: URL: https://github.com/apache/arrow/pull/9632#issuecomment-790818363 I tested HDFS here: https://github.com/ursacomputing/crossbow/runs/2033412067 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on a change in pull request #9528: URL: https://github.com/apache/arrow/pull/9528#discussion_r587691144 ## File path: cpp/src/arrow/util/cancel.cc ## @@ -0,0 +1,167 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/util/cancel.h" + +#include +#include +#include +#include + +#include "arrow/result.h" +#include "arrow/util/atomic_shared_ptr.h" +#include "arrow/util/io_util.h" +#include "arrow/util/logging.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +#if ATOMIC_INT_LOCK_FREE != 2 +#error Lock-free atomic int required for signal safety +#endif + +using internal::ReinstateSignalHandler; +using internal::SetSignalHandler; +using internal::SignalHandler; + +// NOTE: We care mainly about the making the common case (not cancelled) fast. + +struct StopSourceImpl { + std::atomic requested_{0}; // will be -1 or signal number if requested + std::mutex mutex_; + Status cancel_error_; +}; + +StopSource::StopSource() : impl_(new StopSourceImpl) {} + +StopSource::~StopSource() = default; + +void StopSource::RequestStop() { RequestStop(Status::Cancelled("Operation cancelled")); } + +void StopSource::RequestStop(Status st) { + std::lock_guard lock(impl_->mutex_); + DCHECK(!st.ok()); + if (!impl_->requested_) { +impl_->requested_ = -1; +impl_->cancel_error_ = std::move(st); + } +} + +void StopSource::RequestStopFromSignal(int signum) { + // Only async-signal-safe code allowed here + impl_->requested_.store(signum); +} + +StopToken StopSource::token() { return StopToken(impl_); } + +bool StopToken::IsStopRequested() { + if (!impl_) { +return false; + } + return impl_->requested_.load() != 0; +} + +Status StopToken::Poll() { + if (!impl_) { +return Status::OK(); + } + if (!impl_->requested_.load()) { +return Status::OK(); + } + + std::lock_guard lock(impl_->mutex_); + if (impl_->cancel_error_.ok()) { +auto signum = impl_->requested_.load(); +DCHECK_GT(signum, 0); +impl_->cancel_error_ = internal::CancelledFromSignal(signum, "Operation cancelled"); + } Review comment: If you call Poll() twice, then the cancel_error corresponding to the signal will have been cached. So you can have both a cancel_error and a positive signal number. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos
Dandandan commented on pull request #9630: URL: https://github.com/apache/arrow/pull/9630#issuecomment-790812860 LGTM! maybe we can also add it to /rust/datafusion/README.md like here https://github.com/andygrove/datafusion ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on a change in pull request #9528: URL: https://github.com/apache/arrow/pull/9528#discussion_r587690293 ## File path: cpp/src/arrow/util/thread_pool_test.cc ## @@ -137,30 +136,47 @@ class TestThreadPool : public ::testing::Test { return *ThreadPool::Make(threads); } - void SpawnAdds(ThreadPool* pool, int nadds, AddTaskFunc add_func) { -AddTester add_tester(nadds); + void SpawnAdds(ThreadPool* pool, int nadds, AddTaskFunc add_func, + StopToken stop_token = StopToken::Unstoppable(), + StopSource* stop_source = nullptr) { Review comment: Passing the token without the source tests a source that's not stopped, so it's deliberate. I could try to add overloads though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on a change in pull request #9528: URL: https://github.com/apache/arrow/pull/9528#discussion_r587693382 ## File path: cpp/src/arrow/csv/reader.cc ## @@ -833,9 +843,10 @@ class AsyncThreadedTableReader AsyncThreadedTableReader(MemoryPool* pool, std::shared_ptr input, const ReadOptions& read_options, const ParseOptions& parse_options, - const ConvertOptions& convert_options, Executor* cpu_executor, - Executor* io_executor) - : BaseTableReader(pool, input, read_options, parse_options, convert_options), + const ConvertOptions& convert_options, StopToken stop_token, + Executor* cpu_executor, Executor* io_executor) Review comment: We must support the legacy `TableReader::Make` taking both a `MemoryPool*` and an `IOContext`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on a change in pull request #9528: URL: https://github.com/apache/arrow/pull/9528#discussion_r587692069 ## File path: cpp/src/arrow/util/cancel.cc ## @@ -0,0 +1,167 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/util/cancel.h" + +#include +#include +#include +#include + +#include "arrow/result.h" +#include "arrow/util/atomic_shared_ptr.h" +#include "arrow/util/io_util.h" +#include "arrow/util/logging.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +#if ATOMIC_INT_LOCK_FREE != 2 +#error Lock-free atomic int required for signal safety +#endif + +using internal::ReinstateSignalHandler; +using internal::SetSignalHandler; +using internal::SignalHandler; + +// NOTE: We care mainly about the making the common case (not cancelled) fast. + +struct StopSourceImpl { + std::atomic requested_{0}; // will be -1 or signal number if requested + std::mutex mutex_; + Status cancel_error_; +}; + +StopSource::StopSource() : impl_(new StopSourceImpl) {} + +StopSource::~StopSource() = default; + +void StopSource::RequestStop() { RequestStop(Status::Cancelled("Operation cancelled")); } + +void StopSource::RequestStop(Status st) { + std::lock_guard lock(impl_->mutex_); + DCHECK(!st.ok()); + if (!impl_->requested_) { +impl_->requested_ = -1; +impl_->cancel_error_ = std::move(st); + } +} + +void StopSource::RequestStopFromSignal(int signum) { + // Only async-signal-safe code allowed here + impl_->requested_.store(signum); +} + +StopToken StopSource::token() { return StopToken(impl_); } + +bool StopToken::IsStopRequested() { + if (!impl_) { +return false; + } + return impl_->requested_.load() != 0; +} + +Status StopToken::Poll() { + if (!impl_) { +return Status::OK(); + } + if (!impl_->requested_.load()) { +return Status::OK(); + } + + std::lock_guard lock(impl_->mutex_); + if (impl_->cancel_error_.ok()) { +auto signum = impl_->requested_.load(); +DCHECK_GT(signum, 0); +impl_->cancel_error_ = internal::CancelledFromSignal(signum, "Operation cancelled"); + } + return impl_->cancel_error_; +} + +namespace { + +void HandleSignal(int signum); + +struct SignalStopState { + struct SavedSignalHandler { +int signum; +SignalHandler handler; + }; + + Status RegisterHandlers(const std::vector& signals) { +if (!saved_handlers.empty()) { + return Status::Invalid("Signal handlers already registered"); +} +for (int signum : signals) { + ARROW_ASSIGN_OR_RAISE(auto handler, +SetSignalHandler(signum, SignalHandler{})); + saved_handlers.push_back({signum, handler}); +} +return Status::OK(); + } + + void UnregisterHandlers() { +auto handlers = std::move(saved_handlers); +for (const auto& h : handlers) { + ARROW_CHECK_OK(SetSignalHandler(h.signum, h.handler).status()); +} + } + + ~SignalStopState() { UnregisterHandlers(); } + + StopSource stop_source; + std::vector saved_handlers; +}; + +std::shared_ptr g_signal_stop_state; + +void HandleSignal(int signum) { + ReinstateSignalHandler(signum, ); + std::shared_ptr state = internal::atomic_load(_signal_stop_state); + if (state) { +state->stop_source.RequestStopFromSignal(signum); + } +} + +} // namespace + +Result SetSignalStopSource() { + if (g_signal_stop_state) { +return Status::Invalid("Signal stop source already set up"); + } + internal::atomic_store(_signal_stop_state, std::make_shared()); + return _signal_stop_state->stop_source; +} + +void ResetSignalStopSource() { + internal::atomic_store(_signal_stop_state, std::shared_ptr{}); Review comment: Good point, will change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on a change in pull request #9528: URL: https://github.com/apache/arrow/pull/9528#discussion_r587691144 ## File path: cpp/src/arrow/util/cancel.cc ## @@ -0,0 +1,167 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/util/cancel.h" + +#include +#include +#include +#include + +#include "arrow/result.h" +#include "arrow/util/atomic_shared_ptr.h" +#include "arrow/util/io_util.h" +#include "arrow/util/logging.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +#if ATOMIC_INT_LOCK_FREE != 2 +#error Lock-free atomic int required for signal safety +#endif + +using internal::ReinstateSignalHandler; +using internal::SetSignalHandler; +using internal::SignalHandler; + +// NOTE: We care mainly about the making the common case (not cancelled) fast. + +struct StopSourceImpl { + std::atomic requested_{0}; // will be -1 or signal number if requested + std::mutex mutex_; + Status cancel_error_; +}; + +StopSource::StopSource() : impl_(new StopSourceImpl) {} + +StopSource::~StopSource() = default; + +void StopSource::RequestStop() { RequestStop(Status::Cancelled("Operation cancelled")); } + +void StopSource::RequestStop(Status st) { + std::lock_guard lock(impl_->mutex_); + DCHECK(!st.ok()); + if (!impl_->requested_) { +impl_->requested_ = -1; +impl_->cancel_error_ = std::move(st); + } +} + +void StopSource::RequestStopFromSignal(int signum) { + // Only async-signal-safe code allowed here + impl_->requested_.store(signum); +} + +StopToken StopSource::token() { return StopToken(impl_); } + +bool StopToken::IsStopRequested() { + if (!impl_) { +return false; + } + return impl_->requested_.load() != 0; +} + +Status StopToken::Poll() { + if (!impl_) { +return Status::OK(); + } + if (!impl_->requested_.load()) { +return Status::OK(); + } + + std::lock_guard lock(impl_->mutex_); + if (impl_->cancel_error_.ok()) { +auto signum = impl_->requested_.load(); +DCHECK_GT(signum, 0); +impl_->cancel_error_ = internal::CancelledFromSignal(signum, "Operation cancelled"); + } Review comment: If you call Poll() twice, then the cancel_error corresponding to the signal will have been cached. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on a change in pull request #9528: URL: https://github.com/apache/arrow/pull/9528#discussion_r587690293 ## File path: cpp/src/arrow/util/thread_pool_test.cc ## @@ -137,30 +136,47 @@ class TestThreadPool : public ::testing::Test { return *ThreadPool::Make(threads); } - void SpawnAdds(ThreadPool* pool, int nadds, AddTaskFunc add_func) { -AddTester add_tester(nadds); + void SpawnAdds(ThreadPool* pool, int nadds, AddTaskFunc add_func, + StopToken stop_token = StopToken::Unstoppable(), + StopSource* stop_source = nullptr) { Review comment: Passing the token with not the source tests a source that's not stopped, so it's deliberate. I could try to add overloads though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations
github-actions[bot] commented on pull request #9632: URL: https://github.com/apache/arrow/pull/9632#issuecomment-790802618 https://issues.apache.org/jira/browse/ARROW-10846 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations
pitrou commented on pull request #9632: URL: https://github.com/apache/arrow/pull/9632#issuecomment-790802568 @westonpace Would you like to review this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #9632: ARROW-10846: [C++] Add async filesystem operations
pitrou opened a new pull request #9632: URL: https://github.com/apache/arrow/pull/9632 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API
bkietz commented on a change in pull request #9528: URL: https://github.com/apache/arrow/pull/9528#discussion_r587606004 ## File path: cpp/src/arrow/util/thread_pool_test.cc ## @@ -192,32 +208,50 @@ TEST_F(TestThreadPool, StressSpawnThreaded) { TEST_F(TestThreadPool, SpawnSlow) { // This checks that Shutdown() waits for all tasks to finish auto pool = this->MakeThreadPool(2); - SpawnAdds(pool.get(), 7, [](int x, int y, int* out) { -return task_slow_add(0.02 /* seconds */, x, y, out); - }); + SpawnAdds(pool.get(), 7, task_slow_add{0.02 /* seconds */}); Review comment: ```suggestion SpawnAdds(pool.get(), 7, task_slow_add{ /*seconds=*/0.02}); ``` ## File path: cpp/src/arrow/util/io_util.cc ## @@ -1608,6 +1640,39 @@ Result SetSignalHandler(int signum, const SignalHandler& handler) return Status::OK(); } +void ReinstateSignalHandler(int signum, SignalHandler::Callback handler) { +#if !ARROW_HAVE_SIGACTION + // Cannot report any errors from signal() (but there shouldn't be any) + signal(signum, handler); +#endif +} + +Status SendSignal(int signum) { + if (raise(signum) == 0) { +return Status::OK(); + } + if (errno == EINVAL) { +return Status::Invalid("Invalid signal number"); + } + return IOErrorFromErrno(errno, "Failed to raise signal"); +} + +Status SendSignalToThread(int signum, uint64_t thread_id) { +#ifdef _WIN32 + return Status::NotImplemented("Cannot send signal to specific thread on Windows"); +#else + // Have to use a C-style cast because pthread_t can be a pointer *or* integer type + int r = pthread_kill((pthread_t)thread_id, signum); + if (r == 0) { +return Status::OK(); + } + if (r == EINVAL) { +return Status::Invalid("Invalid signal number"); Review comment: ```suggestion return Status::Invalid("Invalid signal number ", signum); ``` ## File path: cpp/src/arrow/csv/reader.cc ## @@ -833,9 +843,10 @@ class AsyncThreadedTableReader AsyncThreadedTableReader(MemoryPool* pool, std::shared_ptr input, const ReadOptions& read_options, const ParseOptions& parse_options, - const ConvertOptions& convert_options, Executor* cpu_executor, - Executor* io_executor) - : BaseTableReader(pool, input, read_options, parse_options, convert_options), + const ConvertOptions& convert_options, StopToken stop_token, + Executor* cpu_executor, Executor* io_executor) Review comment: Seems we might prefer to rewrite this constructor to take an IOContext ## File path: cpp/src/arrow/util/io_util_test.cc ## @@ -623,5 +655,46 @@ TEST(FileUtils, LongPaths) { } #endif +static std::atomic signal_received; + +static void handle_signal(int signum) { + ReinstateSignalHandler(signum, _signal); + signal_received.store(signum); +} + +TEST(SendSignal, Generic) { + signal_received.store(0); + SignalHandlerGuard guard(SIGINT, _signal); + + ASSERT_EQ(signal_received.load(), 0); + ASSERT_OK(SendSignal(SIGINT)); + BusyWait(1.0, [&]() { return signal_received.load() != 0; }); + ASSERT_EQ(signal_received.load(), SIGINT); + + // Re-try (exercise ReinstateSignalHandler) + signal_received.store(0); + ASSERT_OK(SendSignal(SIGINT)); + BusyWait(1.0, [&]() { return signal_received.load() != 0; }); + ASSERT_EQ(signal_received.load(), SIGINT); +} + +TEST(SendSignal, ToThread) { +#ifdef _WIN32 + uint64_t dummy_thread_id = 42; + ASSERT_RAISES(NotImplemented, SendSignalToThread(SIGINT, dummy_thread_id)); +#else + // Have to use a C-style cast because pthread_t can be a pointer *or* integer type + uint64_t thread_id = (uint64_t)(pthread_self()); Review comment: ```suggestion uint64_t thread_id = (uint64_t)(pthread_self()); // NOLINT readability-casting ``` ## File path: cpp/src/arrow/util/cancel.cc ## @@ -0,0 +1,167 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/util/cancel.h" + +#include +#include +#include +#include + +#include "arrow/result.h" +#include "arrow/util/atomic_shared_ptr.h"
[GitHub] [arrow] github-actions[bot] commented on pull request #9631: ARROW-11644: [Python][Parquet] Low-level Parquet encryption in Python, initial sketch for feedback
github-actions[bot] commented on pull request #9631: URL: https://github.com/apache/arrow/pull/9631#issuecomment-790779100 https://issues.apache.org/jira/browse/ARROW-11644 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] itamarst opened a new pull request #9631: ARROW-11644: [Python][Parquet] Low-level Parquet encryption in Python, initial sketch for feedback
itamarst opened a new pull request #9631: URL: https://github.com/apache/arrow/pull/9631 This is still a work in progress, but some feedback would be helpful. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on pull request #9582: URL: https://github.com/apache/arrow/pull/9582#issuecomment-790754904 I thought there should be assert scalar equals but I was expecting it to be a macro which is why I think I missed it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9579: ARROW-11774: [R] macos one line install
nealrichardson commented on a change in pull request #9579: URL: https://github.com/apache/arrow/pull/9579#discussion_r587611515 ## File path: r/tools/nixlibs.R ## @@ -396,6 +401,7 @@ cmake_version <- function(cmd = "cmake") { with_s3_support <- function(env_vars) { arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false" if (arrow_s3) { +# TODO: add in brew versions of these? Review comment: If it's trouble then don't bother, this is a rather esoteric case isn't it? Also it seems pretty likely that you would already have openssl and curl (I do and I didn't install them myself). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
pitrou closed pull request #9582: URL: https://github.com/apache/arrow/pull/9582 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r587609001 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: Thanks, please also remove the conditionals from the other examples. I don't feel strongly about `\donttest` vs `\dontrun` (CRAN complains about dontrun examples on initial submission, which is why they were switched to donttest in the first place, but that's not our concern now) so that's fine, but FWIW `\donttest` is safe on Solaris: [we currently "pass" Solaris checks](https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/arrow-00check.html) with the fake arrow-without-arrow build. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r587597922 ## File path: r/tools/autobrew ## @@ -48,7 +48,8 @@ fi # Hardcode this for my custom autobrew build rm -f $BREWDIR/lib/*.dylib AWS_LIBS="-laws-cpp-sdk-config -laws-cpp-sdk-transfer -laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity -laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream -laws-checksums -laws-c-common -lpthread -lcurl" -PKG_LIBS="-L$BREWDIR/lib -lparquet -larrow_dataset -larrow -larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS" +PKG_LIBS="-lparquet -larrow_dataset -larrow -larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS" +PKG_DIRS="-L$BREWDIR/lib" Review comment: It's based on an earlier version of Jeroen's autobrew, and it's closest now to how he actually builds the C++ dependencies. It looks like the new autobrew script is based on some artifacts that are bundled up in a separate process (see https://github.com/autobrew/bundler/blob/master/apache-arrow.sh). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on a change in pull request #9579: ARROW-11774: [R] macos one line install
jonkeane commented on a change in pull request #9579: URL: https://github.com/apache/arrow/pull/9579#discussion_r587591793 ## File path: r/tools/nixlibs.R ## @@ -396,6 +401,7 @@ cmake_version <- function(cmd = "cmake") { with_s3_support <- function(env_vars) { arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false" if (arrow_s3) { +# TODO: add in brew versions of these? Review comment: I'm working on this, it ends up being a bit more complicated than simply `brew install`ing, but I'll get something working. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos
github-actions[bot] commented on pull request #9630: URL: https://github.com/apache/arrow/pull/9630#issuecomment-790718123 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove opened a new pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos
andygrove opened a new pull request #9630: URL: https://github.com/apache/arrow/pull/9630 I don't think this needs a JIRA? https://user-images.githubusercontent.com/934084/109990656-d55ddf80-7cc6-11eb-8bbc-f21946fd1dfc.png;> https://user-images.githubusercontent.com/934084/109990665-d68f0c80-7cc6-11eb-891c-bf367cb5f447.png;> This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r587557839 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: Done in 6918441bba2b0ca09f2103eaf5631adc0a076a6c This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r587557393 ## File path: r/configure ## @@ -26,13 +26,13 @@ # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib' # Library settings -PKG_CONFIG_NAME="arrow parquet arrow-dataset" +PKG_CONFIG_NAME="arrow" PKG_DEB_NAME="(unsuppored)" PKG_RPM_NAME="(unsuppored)" PKG_BREW_NAME="apache-arrow" PKG_TEST_HEADER="" -# These must be the same order as $(pkg-config --libs arrow-dataset) -PKG_LIBS="-larrow_dataset -lparquet -larrow" +PKG_LIBS="-larrow" +BUNDLED_LIBS="" Review comment: Done in 13edc1478e5003e3a2b3dc27612c6d85f2aa9269 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on pull request #9610: URL: https://github.com/apache/arrow/pull/9610#issuecomment-790693193 > Oh, one other thing: in `inst/build_arrow_static/sh`, you'll want to make ARROW_PARQUET and ARROW_DATASET configurable by env var but default ON, just like ARROW_JEMALLOC. That way, we can make a CI job that sets them to OFF in the env so we can test this. Done in 8c5b235dddac47833b229323b4a216049eae5fea This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r587550424 ## File path: r/tools/autobrew ## @@ -48,7 +48,8 @@ fi # Hardcode this for my custom autobrew build rm -f $BREWDIR/lib/*.dylib AWS_LIBS="-laws-cpp-sdk-config -laws-cpp-sdk-transfer -laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity -laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream -laws-checksums -laws-c-common -lpthread -lcurl" -PKG_LIBS="-L$BREWDIR/lib -lparquet -larrow_dataset -larrow -larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS" +PKG_LIBS="-lparquet -larrow_dataset -larrow -larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS" +PKG_DIRS="-L$BREWDIR/lib" Review comment: What is the relationship of this script (`r/tools/autobrew`) to https://github.com/autobrew/scripts/blob/master/apache-arrow? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r587526925 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: Gotcha. For now, I'll remove the conditional in the example and I'll also change the `\donttest` to `\dontrun`. That way the example checks will pass on Solaris. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
pitrou commented on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790619759 That said, it's just a matter of implementing said generation properly :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
pitrou commented on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790619135 > Shall random decimal128 generation with precision > 18 be impossible? It used to be perfectly fine. Did it actually work? I think it would use random fixed size binary generation, which wouldn't guarantee the generated values fit in the given precision... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou commented on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790615411 @pitrou It is due to some new issue in Decimal128s. Starting from https://github.com/apache/arrow/pull/8648/commits/68fd76fc5eba3350f341eb0fd7e4f83f7c83c51e we suddenly began to get "NotImplemented: random decimal128 generation with precision > 18". I reduced precision to 18 which caused ORC to misbehave. Shall random decimal128 generation with precision > 18 be impossible? It used to be perfectly fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete
lidavidm commented on a change in pull request #9613: URL: https://github.com/apache/arrow/pull/9613#discussion_r587452478 ## File path: cpp/src/arrow/io/caching.h ## @@ -99,6 +100,9 @@ class ARROW_EXPORT ReadRangeCache { /// \brief Read a range previously given to Cache(). Result> Read(ReadRange range); + /// \brief Wait until all ranges added so far have been cached. + Future<> Wait(); Review comment: The idea here is to cleanly separate out the I/O operations so you can schedule them on different executors. With a ReadAsync, you wouldn't be able to accomplish that unless you went and refactored the entire Parquet reader so that it was fully async-aware. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete
lidavidm commented on a change in pull request #9613: URL: https://github.com/apache/arrow/pull/9613#discussion_r587451127 ## File path: cpp/src/arrow/io/caching.cc ## @@ -193,6 +195,36 @@ Result> ReadRangeCache::Read(ReadRange range) { return Status::Invalid("ReadRangeCache did not find matching cache entry"); } +Future<> ReadRangeCache::Wait() { + struct State { +explicit State(std::vector>> f) +: futures(std::move(f)), remaining(futures.size()) {} +std::vector>> futures; +std::atomic remaining; + }; + + std::vector>> futures; + for (const auto& entry : impl_->entries) { +futures.push_back(entry.future); + } + auto out = Future<>::Make(); + auto state = std::make_shared(std::move(futures)); + for (const auto& future : state->futures) { +future.AddCallback([state, out](const Result>&) mutable { + if (state->remaining.fetch_sub(1) != 1) return; Review comment: Agreed. I can move this to future.h/future.cc since there's already a very similar helper there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jmgpeeters commented on pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.
jmgpeeters commented on pull request #9629: URL: https://github.com/apache/arrow/pull/9629#issuecomment-790592214 To be clear, these test failures should be due to https://github.com/apache/arrow-testing/pull/59 not being merged yet - which provides the associated materialised test data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
pitrou commented on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790587488 It seems there was a crash on the "AMD64 Conda C++" CI build. Can you take a look? Note that you should be able to reproduce this locally using our docker-compose setup. One way to do this is to first install the `archery` utility: https://arrow.apache.org/docs/developers/archery.html and then run `archery docker run conda-cpp`. (or you may also directly run `docker-compose run --rm conda-cpp`) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
pitrou commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587428039 ## File path: cpp/src/parquet/statistics.cc ## @@ -133,56 +136,95 @@ struct CompareHelper { static T Max(int type_length, const T& a, const T& b) { return Compare(0, a, b) ? b : a; } -}; // namespace parquet - -template -struct ByteLikeCompareHelperBase { - using T = ByteArrayType::c_type; - using PtrType = typename std::conditional::type; +}; - static T DefaultMin() { return {}; } - static T DefaultMax() { return {}; } - static T Coalesce(T val, T fallback) { return val; } +template +struct BinaryLikeComparer {}; - static inline bool Compare(int type_length, const T& a, const T& b) { -const auto* aptr = reinterpret_cast(a.ptr); -const auto* bptr = reinterpret_cast(b.ptr); -return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + b.len); +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); +// Unsigned comparison is used for non-numeric types so straight +// lexiographic comparison makes sense. (a.ptr is always unsigned) +return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr + b_length); } +}; - static T Min(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? a : b; - } +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +// Is signed is used for integers encoded as big-endian twos +// complement integers. (e.g. decimals). +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); + +// At least of the lengths is zero. +if (a_length == 0 || b_length == 0) { + return a_length == 0 && b_length > 0; +} - static T Max(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? b : a; +int8_t first_a = *a.ptr; +int8_t first_b = *b.ptr; +// We can short circuit for different signed numbers or +// for equal length bytes arrays that have different first bytes. +if ((0x80 & first_a) != (0x80 & first_b) || +(a_length == b_length && first_a != first_b)) { + return first_a < first_b; +} +// When the lengths are unequal and the numbers are of the same +// sign we need to extend the digits. +const uint8_t* a_start = a.ptr; +const uint8_t* b_start = b.ptr; +if (a_length != b_length) { + const uint8_t* lead_start = nullptr; + const uint8_t* lead_end = nullptr; + if (a_length > b_length) { +int lead_length = a_length - b_length; +lead_start = a.ptr; +lead_end = a.ptr + lead_length; +a_start += lead_length; + } else { +DCHECK_LT(a_length, b_length); +int lead_length = b_length - a_length; +lead_start = b.ptr; +lead_end = b.ptr + lead_length; +b_start += lead_length; + } + // Compare extra bytes to the sign extension of the first + // byte of the other number. + uint8_t extension = first_a < 0 ? 0xFF : 0; + for (; lead_start != lead_end; lead_start++) { +if (*lead_start < extension) { + // The first bytes of the long value are less + // then the extended short one. So if a is the long value + // we can return true. + return a_length > b_length; +} else if (*lead_start > extension) { Review comment: Yes, thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
pitrou commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587423364 ## File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc ## @@ -2673,6 +2673,36 @@ TEST(ArrowReadWrite, Decimal256) { CheckSimpleRoundtrip(table, 2, props_store_schema); } +TEST(ArrowReadWrite, DecimalStats) { + using ::arrow::Decimal128; + using ::arrow::field; + + auto type = ::arrow::decimal128(/*precision=*/8, /*scale=*/0); + + const char* json = R"(["255", "128", null, "0", "1", "-127", "-128", "-129", "-255"])"; + auto array = ::arrow::ArrayFromJSON(type, json); + auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), {array}); + + std::shared_ptr buffer; + ASSERT_NO_FATAL_FAILURE(WriteTableToBuffer(table, /*row_grop_size=*/100, + default_arrow_writer_properties(), )); + + std::unique_ptr reader; + ASSERT_OK_NO_THROW(OpenFile(std::make_shared(buffer), + ::arrow::default_memory_pool(), )); + + std::shared_ptr min, max; + ReadSingleColumnFileStatistics(std::move(reader), , ); + + std::shared_ptr expected_min, expected_max; + ASSERT_OK_AND_ASSIGN(expected_min, array->GetScalar(array->length() - 1)); + ASSERT_OK_AND_ASSIGN(expected_max, array->GetScalar(0)); + ASSERT_TRUE(expected_min->Equals(*min)) Review comment: I think we have `AssertScalarsEqual`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.
github-actions[bot] commented on pull request #9629: URL: https://github.com/apache/arrow/pull/9629#issuecomment-790547752 https://issues.apache.org/jira/browse/ARROW-11838 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jmgpeeters opened a new pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.
jmgpeeters opened a new pull request #9629: URL: https://github.com/apache/arrow/pull/9629 The only code change required, AFAICT, is the calculation of num_dicts, which is no longer simply the number of fields, but rather the unique number of id's they point to. I'm calculating this on-demand, as it's quite cheap and not frequently called, but could also (p)re-compute this on every addField. For now, I've added tests that read materialised data generated from Java, as we don't support writing IPC with shared dictionaries in C++ either yet (and out of scope here). Down the line, I would like full read & write support for shared dictionaries across at least C++, Python, Java and Julia, so I'll be coming back to this with follow-up PR's where needed. As part of that, I'll also change the tests to no longer rely on materialised files, but use the round-trip mechanism. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string
pitrou commented on pull request #9606: URL: https://github.com/apache/arrow/pull/9606#issuecomment-790546124 Also note that the current implementation (in `scalar_set_lookup.cc`) is suboptimal as it decodes the dictionary array and runs the set lookup for each decoded dictionary value. It would be more efficient to compute and cache lookup results for distinct dictionary elements. That can be for a later JIRA, though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string
pitrou commented on a change in pull request #9606: URL: https://github.com/apache/arrow/pull/9606#discussion_r587386055 ## File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc ## @@ -280,6 +320,28 @@ class TestIndexInKernel : public ::testing::Test { ASSERT_OK(actual.chunked_array()->ValidateFull()); AssertChunkedEqual(*expected, *actual.chunked_array()); } + + void CheckIndexInDictionary(const std::shared_ptr& type, + const std::shared_ptr& index_type, + const std::string& input_dictionary_json, + const std::string& input_index_json, + const std::string& value_set_json, + const std::string& expected_json, bool skip_nulls = false) { +auto dict_type = dictionary(index_type, type); +auto indices = ArrayFromJSON(index_type, input_index_json); +auto dict = ArrayFromJSON(type, value_set_json); Review comment: Should be `input_dictionary_json` as well. ## File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc ## @@ -72,6 +72,27 @@ void CheckIsInChunked(const std::shared_ptr& input, AssertChunkedEqual(*expected, *actual); } +void CheckIsInDictionary(const std::shared_ptr& type, + const std::shared_ptr& index_type, + const std::string& input_dictionary_json, + const std::string& input_index_json, + const std::string& value_set_json, + const std::string& expected_json, bool skip_nulls = false) { + auto dict_type = dictionary(index_type, type); + auto indices = ArrayFromJSON(index_type, input_index_json); + auto dict = ArrayFromJSON(type, value_set_json); Review comment: This should use `input_dictionary_json`, not `value_set_json`. ## File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc ## @@ -231,6 +252,25 @@ TEST_F(TestIsInKernel, Decimal) { /*skip_nulls=*/true); } +TEST_F(TestIsInKernel, DictionaryArray) { + for (auto index_ty : all_dictionary_index_types()) { +CheckIsInDictionary(/*type=*/utf8(), +/*index_type=*/index_ty, +/*input_dictionary_json=*/R"(["A", "B", "C", "D"])", +/*input_index_json=*/"[1, 2, null, 0]", +/*value_set_json=*/R"(["A", "B", "C"])", +/*expected_json=*/"[true, true, false, true]", +/*skip_nulls=*/false); +CheckIsInDictionary(/*type=*/float32(), Review comment: Consider adding the following tests: ```c++ // With nulls and skip_nulls=false CheckIsInDictionary(/*type=*/utf8(), /*index_type=*/index_ty, /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", /*input_index_json=*/"[1, 3, null, 0, 1]", /*value_set_json=*/R"(["C", "B", "A", null])", /*expected_json=*/"[true, false, true, true, true]", /*skip_nulls=*/false); CheckIsInDictionary(/*type=*/utf8(), /*index_type=*/index_ty, /*input_dictionary_json=*/R"(["A", null, "C", "D"])", /*input_index_json=*/"[1, 3, null, 0, 1]", /*value_set_json=*/R"(["C", "B", "A", null])", /*expected_json=*/"[true, false, true, true, true]", /*skip_nulls=*/false); CheckIsInDictionary(/*type=*/utf8(), /*index_type=*/index_ty, /*input_dictionary_json=*/R"(["A", null, "C", "D"])", /*input_index_json=*/"[1, 3, null, 0, 1]", /*value_set_json=*/R"(["C", "B", "A"])", /*expected_json=*/"[false, false, false, true, false]", /*skip_nulls=*/false); // With nulls and skip_nulls=true CheckIsInDictionary(/*type=*/utf8(), /*index_type=*/index_ty, /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", /*input_index_json=*/"[1, 3, null, 0, 1]", /*value_set_json=*/R"(["C", "B", "A", null])", /*expected_json=*/"[true, false, false, true, true]", /*skip_nulls=*/true); CheckIsInDictionary(/*type=*/utf8(), /*index_type=*/index_ty, /*input_dictionary_json=*/R"(["A", null, "C", "D"])", /*input_index_json=*/"[1, 3, null, 0, 1]", /*value_set_json=*/R"(["C", "B", "A", null])",
[GitHub] [arrow] pitrou commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete
pitrou commented on a change in pull request #9613: URL: https://github.com/apache/arrow/pull/9613#discussion_r587361262 ## File path: cpp/src/arrow/io/caching.cc ## @@ -193,6 +195,36 @@ Result> ReadRangeCache::Read(ReadRange range) { return Status::Invalid("ReadRangeCache did not find matching cache entry"); } +Future<> ReadRangeCache::Wait() { + struct State { +explicit State(std::vector>> f) +: futures(std::move(f)), remaining(futures.size()) {} +std::vector>> futures; +std::atomic remaining; + }; + + std::vector>> futures; + for (const auto& entry : impl_->entries) { +futures.push_back(entry.future); + } + auto out = Future<>::Make(); + auto state = std::make_shared(std::move(futures)); + for (const auto& future : state->futures) { +future.AddCallback([state, out](const Result>&) mutable { + if (state->remaining.fetch_sub(1) != 1) return; Review comment: Ideally this primitive would be globally available, instead of reimplemented here. So you could write for example: ```c++ return Future::Gather(std::move(futures)); ``` Also there may be different kinds of "gather". By default you probably want to first error to immediately mark the resulting future errored. cc @westonpace This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete
pitrou commented on a change in pull request #9613: URL: https://github.com/apache/arrow/pull/9613#discussion_r587359365 ## File path: cpp/src/arrow/io/caching.h ## @@ -99,6 +100,9 @@ class ARROW_EXPORT ReadRangeCache { /// \brief Read a range previously given to Cache(). Result> Read(ReadRange range); + /// \brief Wait until all ranges added so far have been cached. + Future<> Wait(); Review comment: Hmm... Wouldn't it be better to expose a `ReadAsync` equivalent to `Read`? ```c++ Future> Read(ReadRange range); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on pull request #9528: URL: https://github.com/apache/arrow/pull/9528#issuecomment-790509864 @westonpace @bkietz Feel free to review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r587330687 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -517,20 +650,32 @@ where }, ); -if !options.descending { -valids.sort_by(|a, b| cmp_array(a.1.as_ref(), b.1.as_ref())) +let mut len = values.len(); +let descending = options.descending; + +if let Some(size) = limit { +len = size.min(len); +} + +// we are not using partial_sort here, because array is ArrayRef. Something is not working good in that. Review comment: Sorry, there is a bug in the previous version. It's fixed in the newest [commit](https://github.com/sundy-li/partial_sort/commit/372778d0011b4d647ed4ee13d1dacb0865eb266d#diff-b1a35a68f14e696205874893c07fd24fdb2b47c23cc0e0c80a30c7d53759R103-R141) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter
pitrou closed pull request #9627: URL: https://github.com/apache/arrow/pull/9627 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9626: ARROW-11855: [C++][Python] Memory leak in to_pandas when converting chunked struct array
pitrou commented on a change in pull request #9626: URL: https://github.com/apache/arrow/pull/9626#discussion_r587322650 ## File path: python/pyarrow/tests/test_pandas.py ## @@ -2272,6 +2272,30 @@ def test_to_pandas(self): series = pd.Series(arr.to_pandas()) tm.assert_series_equal(series, expected) +def test_to_pandas_multiple_chunks(self): +# ARROW-11855 +bytes_start = pa.total_allocated_bytes() Review comment: Probably want to call `gc.collect()` just before this, to avoid false positives. ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -689,7 +691,8 @@ Status ConvertStruct(PandasOptions options, const ChunkedArray& data, auto name = array_type->field(static_cast(field_idx))->name(); if (!arr->field(static_cast(field_idx))->IsNull(i)) { // Value exists in child array, obtain it -auto array = reinterpret_cast(fields_data[field_idx].obj()); +auto array = reinterpret_cast( +fields_data[field_idx + fields_data_offset].obj()); Review comment: Does this mean that conversion could give the wrong results (in addition to being leaky)? If so, can you add a test showcasing that? (I believe you need the different chunks to be unequal...). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #9626: ARROW-11855: [C++][Python] Memory leak in to_pandas when converting chunked struct array
westonpace commented on pull request #9626: URL: https://github.com/apache/arrow/pull/9626#issuecomment-790458923 Failing check appears to be ARROW-11717 / unrelated to PR. Please rerun or ignore. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yoonsikp commented on pull request #8491: ARROW-10349: [Python] linux aarch64 wheels
yoonsikp commented on pull request #8491: URL: https://github.com/apache/arrow/pull/8491#issuecomment-790429618 Would truly appreciate having aarch64 wheel support for apache arrow! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
Dandandan commented on pull request #9595: URL: https://github.com/apache/arrow/pull/9595#issuecomment-790407618 @alamb ha, thanks, fixed! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on pull request #9582: URL: https://github.com/apache/arrow/pull/9582#issuecomment-790405647 @pitrou I think I addressed all of your comments, would you mind taking another look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587235073 ## File path: cpp/src/parquet/statistics.cc ## @@ -133,56 +136,95 @@ struct CompareHelper { static T Max(int type_length, const T& a, const T& b) { return Compare(0, a, b) ? b : a; } -}; // namespace parquet - -template -struct ByteLikeCompareHelperBase { - using T = ByteArrayType::c_type; - using PtrType = typename std::conditional::type; +}; - static T DefaultMin() { return {}; } - static T DefaultMax() { return {}; } - static T Coalesce(T val, T fallback) { return val; } +template +struct BinaryLikeComparer {}; - static inline bool Compare(int type_length, const T& a, const T& b) { -const auto* aptr = reinterpret_cast(a.ptr); -const auto* bptr = reinterpret_cast(b.ptr); -return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + b.len); +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); +// Unsigned comparison is used for non-numeric types so straight +// lexiographic comparison makes sense. (a.ptr is always unsigned) +return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr + b_length); } +}; - static T Min(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? a : b; - } +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +// Is signed is used for integers encoded as big-endian twos +// complement integers. (e.g. decimals). +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); + +// At least of the lengths is zero. +if (a_length == 0 || b_length == 0) { + return a_length == 0 && b_length > 0; +} - static T Max(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? b : a; +int8_t first_a = *a.ptr; +int8_t first_b = *b.ptr; +// We can short circuit for different signed numbers or +// for equal length bytes arrays that have different first bytes. +if ((0x80 & first_a) != (0x80 & first_b) || +(a_length == b_length && first_a != first_b)) { + return first_a < first_b; +} +// When the lengths are unequal and the numbers are of the same +// sign we need to extend the digits. +const uint8_t* a_start = a.ptr; +const uint8_t* b_start = b.ptr; +if (a_length != b_length) { + const uint8_t* lead_start = nullptr; + const uint8_t* lead_end = nullptr; + if (a_length > b_length) { +int lead_length = a_length - b_length; +lead_start = a.ptr; +lead_end = a.ptr + lead_length; +a_start += lead_length; + } else { +DCHECK_LT(a_length, b_length); +int lead_length = b_length - a_length; +lead_start = b.ptr; +lead_end = b.ptr + lead_length; +b_start += lead_length; + } + // Compare extra bytes to the sign extension of the first + // byte of the other number. + uint8_t extension = first_a < 0 ? 0xFF : 0; + for (; lead_start != lead_end; lead_start++) { +if (*lead_start < extension) { + // The first bytes of the long value are less + // then the extended short one. So if a is the long value + // we can return true. + return a_length > b_length; +} else if (*lead_start > extension) { Review comment: I did a slightly different factoring let me know if you think that is clearer? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587234738 ## File path: cpp/src/parquet/statistics_test.cc ## @@ -111,21 +117,28 @@ TEST(Comparison, UnsignedByteArray) { } TEST(Comparison, SignedFLBA) { - int size = 10; + int size = 4; auto comparator = MakeComparator(Type::FIXED_LEN_BYTE_ARRAY, SortOrder::SIGNED, size); - std::string s1 = "Anti123456"; - std::string s2 = "Bunkd123456"; - FLBA s1flba = FLBAFromString(s1); - FLBA s2flba = FLBAFromString(s2); - ASSERT_TRUE(comparator->Compare(s1flba, s2flba)); + std::vector byte_values[] = { + {0x80, 0, 0, 0}, {0xFF, 0xFF, 0x01, 0},{0xFF, 0xFF, 0x80, 0}, + {0xFF, 0xFF, 0xFF, 0x80}, {0xFF, 0xFF, 0xFF, 0xFF}, {0, 0, 0x01, 0x01}, + {0, 0x01, 0x01, 0}, {0x01, 0x01, 0, 0}}; + std::vector values_to_compare; + for (auto& bytes : byte_values) { +values_to_compare.emplace_back(FLBA(bytes.data())); + } - s1 = "Bünk123456"; - s2 = "Bunk123456"; - s1flba = FLBAFromString(s1); - s2flba = FLBAFromString(s2); - ASSERT_TRUE(comparator->Compare(s1flba, s2flba)); + for (size_t x = 0; x < values_to_compare.size(); x++) { +EXPECT_FALSE(comparator->Compare(values_to_compare[x], values_to_compare[x])) << x; +for (size_t y = x + 1; y < values_to_compare.size(); y++) { + EXPECT_TRUE(comparator->Compare(values_to_compare[x], values_to_compare[y])) + << x << " " << y; + EXPECT_FALSE(comparator->Compare(values_to_compare[y], values_to_compare[x])) + << y << " " << x; +} + } Review comment: you probably shouldn't trust me. I added a test when writing a column of Decimal128 values to arrow_reader_writer_test which failed without this change. This required fixing some cases with extracting scalar values and there are still some broken aspects to the extraction. I'll open up some JIRAs to fix these: 1. Decimals encoded as Ints aren't extracted correctly. 2. The decimal type chosen might not correspond to the decoded decimal type because we don't pass the arrow schema information through to the decoding. (i.e. we depend solely on precision for deciding Decimal128Scalar or Decimal256Scalar). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587233240 ## File path: cpp/src/parquet/statistics.cc ## @@ -133,56 +136,95 @@ struct CompareHelper { static T Max(int type_length, const T& a, const T& b) { return Compare(0, a, b) ? b : a; } -}; // namespace parquet - -template -struct ByteLikeCompareHelperBase { - using T = ByteArrayType::c_type; - using PtrType = typename std::conditional::type; +}; - static T DefaultMin() { return {}; } - static T DefaultMax() { return {}; } - static T Coalesce(T val, T fallback) { return val; } +template +struct BinaryLikeComparer {}; - static inline bool Compare(int type_length, const T& a, const T& b) { -const auto* aptr = reinterpret_cast(a.ptr); -const auto* bptr = reinterpret_cast(b.ptr); -return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + b.len); +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); +// Unsigned comparison is used for non-numeric types so straight +// lexiographic comparison makes sense. (a.ptr is always unsigned) +return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr + b_length); } +}; - static T Min(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? a : b; - } +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +// Is signed is used for integers encoded as big-endian twos +// complement integers. (e.g. decimals). +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); + +// At least of the lengths is zero. +if (a_length == 0 || b_length == 0) { + return a_length == 0 && b_length > 0; +} - static T Max(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? b : a; +int8_t first_a = *a.ptr; +int8_t first_b = *b.ptr; +// We can short circuit for different signed numbers or +// for equal length bytes arrays that have different first bytes. +if ((0x80 & first_a) != (0x80 & first_b) || +(a_length == b_length && first_a != first_b)) { + return first_a < first_b; +} Review comment: Added the example above. I can add more of the explanation if you think it is worthwhile. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587230891 ## File path: cpp/src/parquet/statistics_test.cc ## @@ -71,19 +71,25 @@ static FLBA FLBAFromString(const std::string& s) { TEST(Comparison, SignedByteArray) { auto comparator = MakeComparator(Type::BYTE_ARRAY, SortOrder::SIGNED); - std::string s1 = "12345"; - std::string s2 = "12345678"; - ByteArray s1ba = ByteArrayFromString(s1); - ByteArray s2ba = ByteArrayFromString(s2); - ASSERT_TRUE(comparator->Compare(s1ba, s2ba)); + std::vector byte_values[] = { + {0x80, 0x80, 0, 0}, {/*0xFF,*/ 0x80, 0, 0}, {/*0xFF,*/ 0xFF, 0x01, 0}, + {/*0xFF,0xFF,*/ 0x80, 0}, {/*0xFF,0xFF,0xFF,*/ 0x80}, {/*0xFF, 0xFF, 0xFF,*/ 0xFF}, + {/*0, 0,*/ 0x01, 0x01}, {/*0,*/ 0x01, 0x01, 0}, {0x01, 0x01, 0, 0}}; Review comment: did something similar to this. I don't think ByteArray supports string insertion? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dmyersturnbull opened a new issue #9628: write_feather incorrectly deletes files
dmyersturnbull opened a new issue #9628: URL: https://github.com/apache/arrow/issues/9628 I'm happy to fill this out on Jira and/or to submit a pull request. I just can't log in to the Jira right now -- as soon as I log in it "forgets". Two related issues about [write_feather](https://github.com/apache/arrow/blob/b8b44197866242aa1fec6a7f76e3e1005ac3c6de/python/pyarrow/feather.py#L119). pathlib `write_feather` could accept a `pathlib.PurePath` since Python 3.6+ is already required. files are deleted On error, write_feather will delete a file that already existed. Suppose some code is supposed to read `myfile.feather`, update it, and write it back. If the writer throws an error, the original copy of `myfile.feather` is deleted. [Here's the culprit](https://github.com/apache/arrow/blob/b8b44197866242aa1fec6a7f76e3e1005ac3c6de/python/pyarrow/feather.py#L185). ```python except Exception: if isinstance(dest, str): try: os.remove(dest) except os.error: pass raise ```python This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter
github-actions[bot] commented on pull request #9627: URL: https://github.com/apache/arrow/pull/9627#issuecomment-790353990 https://issues.apache.org/jira/browse/ARROW-11856 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace opened a new pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter
westonpace opened a new pull request #9627: URL: https://github.com/apache/arrow/pull/9627 The type RecordBatchStreamWriter was in a type_fwd but never implemented anywhere. The property type would be RecordBatchWriter This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou edited a comment on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527 @pitrou Yes now it is ready for another review. I have fixed all the issues you mentioned, found and fixed a previously hidden bug and shortened the tests to about 650 lines (with more tests!) It should be a lot neater this time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou edited a comment on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527 @pitrou Yes now it is ready for another review. I have fixed all the issues you mentioned and shortened the tests to about 650 lines (with more tests!) It should be a lot neater this time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou commented on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527 @pitrou Yes now it is ready for another review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.
kou closed pull request #9622: URL: https://github.com/apache/arrow/pull/9622 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-790239103 > Congratulations @liyafan82 ! Do you have an idea how hard it will be to add zstd support? @pitrou Support for zstd should be much easier, as you can see, most of the effort is devoted to framework change and library selection. Such effort can be saved when supporting zstd. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sighingnow commented on a change in pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.
sighingnow commented on a change in pull request #9622: URL: https://github.com/apache/arrow/pull/9622#discussion_r586993756 ## File path: cpp/src/arrow/ArrowConfig.cmake.in ## @@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static)) get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION) get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY) -add_library(arrow_bundled_dependencies STATIC IMPORTED) -set_target_properties( - arrow_bundled_dependencies - PROPERTIES -IMPORTED_LOCATION - "${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}" - ) +if(@ARROW_BUNDLED_STATIC_LIBS@) Review comment: @kou Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sighingnow commented on a change in pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.
sighingnow commented on a change in pull request #9622: URL: https://github.com/apache/arrow/pull/9622#discussion_r586992064 ## File path: cpp/src/arrow/ArrowConfig.cmake.in ## @@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static)) get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION) get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY) -add_library(arrow_bundled_dependencies STATIC IMPORTED) -set_target_properties( - arrow_bundled_dependencies - PROPERTIES -IMPORTED_LOCATION - "${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}" - ) +if(@ARROW_BUNDLED_STATIC_LIBS@) Review comment: Thanks! Thanks deinitely works! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -278,12 +347,27 @@ fn sort_boolean( let valids_len = valids.len(); let nulls_len = nulls.len(); -if !descending { -valids.sort_by(|a, b| a.1.cmp()); -} else { -valids.sort_by(|a, b| a.1.cmp().reverse()); -// reverse to keep a stable ordering -nulls.reverse(); +let mut len = values.len(); +match limit { +Some(limit) => { +len = limit.min(len); +if !descending { +valids.partial_sort(len, |a, b| cmp(a.1, b.1)); Review comment: > Sorry not "fast path" but would the performance be the same as `sort_by`? I'm afraid not. The sort function in rust is merge sort, it is slightly faster than the heap sort for larger sets to sort all elements. But it requires twice the memory of the heap sort because of the second array. ``` sort 2^12 time: [799.70 us 806.41 us 814.15 us] sort 2^12 limit 2^12 time: [1.2848 ms 1.3012 ms 1.3229 ms] sort nulls 2^12 time: [647.20 us 649.27 us 651.61 us] sort nulls 2^12 limit 2^12 time: [780.17 us 788.48 us 798.04 us] ``` We can make a new function to reduce the repeated patterns or make partial_sort fallback to default sort when limit equals to len. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9626: ARROW-11855: Memory leak in to_pandas when converting chunked struct array
github-actions[bot] commented on pull request #9626: URL: https://github.com/apache/arrow/pull/9626#issuecomment-790226986 https://issues.apache.org/jira/browse/ARROW-11855 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -278,12 +347,27 @@ fn sort_boolean( let valids_len = valids.len(); let nulls_len = nulls.len(); -if !descending { -valids.sort_by(|a, b| a.1.cmp()); -} else { -valids.sort_by(|a, b| a.1.cmp().reverse()); -// reverse to keep a stable ordering -nulls.reverse(); +let mut len = values.len(); +match limit { +Some(limit) => { +len = limit.min(len); +if !descending { +valids.partial_sort(len, |a, b| cmp(a.1, b.1)); Review comment: > Sorry not "fast path" but would the performance be the same as `sort_by`? I'm afraid not. The sort function in rust is merge sort, it is slightly faster than the heap sort for larger sets to sort all elements. But it requires twice the memory of the heap sort because of the second array. ``` sort 2^12 time: [799.70 us 806.41 us 814.15 us] sort 2^12 limit 2^12 time: [1.2848 ms 1.3012 ms 1.3229 ms] sort nulls 2^12 time: [647.20 us 649.27 us 651.61 us] sort nulls 2^12 limit 2^12 time: [780.17 us 788.48 us 798.04 us] ``` We can make a new function to reduce the repeated patterns or make partial_sort fallback to default sort when limit is the len. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace opened a new pull request #9626: ARROW-11855: Memory leak in to_pandas when converting chunked struct array
westonpace opened a new pull request #9626: URL: https://github.com/apache/arrow/pull/9626 When converting a struct with chunks to python the ownership of the arrow arrays was not being properly tracked and the deletion of the resulting pandas dataframe would leave some buffers behind. I believe it was something like this (pseudo-code)... ``` array_tracker** refs[num_fields]; for chunk in chunks: *refs[chunk.index] = convert(...) for row in rows: row.owns.append(refs[chunk.index]) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -278,12 +347,27 @@ fn sort_boolean( let valids_len = valids.len(); let nulls_len = nulls.len(); -if !descending { -valids.sort_by(|a, b| a.1.cmp()); -} else { -valids.sort_by(|a, b| a.1.cmp().reverse()); -// reverse to keep a stable ordering -nulls.reverse(); +let mut len = values.len(); +match limit { +Some(limit) => { +len = limit.min(len); +if !descending { +valids.partial_sort(len, |a, b| cmp(a.1, b.1)); Review comment: > Sorry not "fast path" but would the performance be the same as `sort_by`? I'm afraid not. The sort function in rust is merge sort, it is slightly faster than the heap sort for larger sets to sort all elements. But it requires twice the memory of the heap sort because of the second array. ``` sort 2^12 time: [799.70 us 806.41 us 814.15 us] sort 2^12 limit 2^12 time: [1.2848 ms 1.3012 ms 1.3229 ms] sort nulls 2^12 time: [647.20 us 649.27 us 651.61 us] sort nulls 2^12 limit 2^12 time: [780.17 us 788.48 us 798.04 us] ``` We can make a new function to reduce the repeated patterns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -278,12 +347,27 @@ fn sort_boolean( let valids_len = valids.len(); let nulls_len = nulls.len(); -if !descending { -valids.sort_by(|a, b| a.1.cmp()); -} else { -valids.sort_by(|a, b| a.1.cmp().reverse()); -// reverse to keep a stable ordering -nulls.reverse(); +let mut len = values.len(); +match limit { +Some(limit) => { +len = limit.min(len); +if !descending { +valids.partial_sort(len, |a, b| cmp(a.1, b.1)); Review comment: > Sorry not "fast path" but would the performance be the same as `sort_by`? I'm afraid not. The sort function in rust is merge sort, it is slightly faster than the heap sort for larger sets. But it requires twice the memory of the heap sort because of the second array. ``` sort 2^12 time: [799.70 us 806.41 us 814.15 us] sort 2^12 limit 2^12 time: [1.2848 ms 1.3012 ms 1.3229 ms] sort nulls 2^12 time: [647.20 us 649.27 us 651.61 us] sort nulls 2^12 limit 2^12 time: [780.17 us 788.48 us 798.04 us] ``` We can make a new function to reduce the repeated patterns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex
seddonm1 commented on pull request #9625: URL: https://github.com/apache/arrow/pull/9625#issuecomment-790222729 Thanks @andygrove . I have a few more PRs to do to finish this first phase of work. Then I think it's time to tackle type coercion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org