[GitHub] [arrow] rok commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string

2021-03-04 Thread GitBox


rok commented on a change in pull request #9606:
URL: https://github.com/apache/arrow/pull/9606#discussion_r587790806



##
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc
##
@@ -231,6 +252,25 @@ TEST_F(TestIsInKernel, Decimal) {
 /*skip_nulls=*/true);
 }
 
+TEST_F(TestIsInKernel, DictionaryArray) {
+  for (auto index_ty : all_dictionary_index_types()) {
+CheckIsInDictionary(/*type=*/utf8(),
+/*index_type=*/index_ty,
+/*input_dictionary_json=*/R"(["A", "B", "C", "D"])",
+/*input_index_json=*/"[1, 2, null, 0]",
+/*value_set_json=*/R"(["A", "B", "C"])",
+/*expected_json=*/"[true, true, false, true]",
+/*skip_nulls=*/false);
+CheckIsInDictionary(/*type=*/float32(),

Review comment:
   Added.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string

2021-03-04 Thread GitBox


rok commented on a change in pull request #9606:
URL: https://github.com/apache/arrow/pull/9606#discussion_r587790691



##
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc
##
@@ -72,6 +72,27 @@ void CheckIsInChunked(const std::shared_ptr& 
input,
   AssertChunkedEqual(*expected, *actual);
 }
 
+void CheckIsInDictionary(const std::shared_ptr& type,
+ const std::shared_ptr& index_type,
+ const std::string& input_dictionary_json,
+ const std::string& input_index_json,
+ const std::string& value_set_json,
+ const std::string& expected_json, bool skip_nulls = 
false) {
+  auto dict_type = dictionary(index_type, type);
+  auto indices = ArrayFromJSON(index_type, input_index_json);
+  auto dict = ArrayFromJSON(type, value_set_json);

Review comment:
   Ugh, thanks for catching that. Fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9630: ARROW-11860: [Rust] [DataFusion] Add DataFusion logos

2021-03-04 Thread GitBox


github-actions[bot] commented on pull request #9630:
URL: https://github.com/apache/arrow/pull/9630#issuecomment-790895297


   https://issues.apache.org/jira/browse/ARROW-11860



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ericwburden commented on pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes

2021-03-04 Thread GitBox


ericwburden commented on pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#issuecomment-790893609


   Sorry, just learned that the "Close" button on my phone didn't mean the 
window...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ericwburden closed pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes

2021-03-04 Thread GitBox


ericwburden closed pull request #9624:
URL: https://github.com/apache/arrow/pull/9624


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9494: ARROW-11626: [Rust][DataFusion] Move [DataFusion] examples to own project

2021-03-04 Thread GitBox


alamb commented on a change in pull request #9494:
URL: https://github.com/apache/arrow/pull/9494#discussion_r58078



##
File path: rust/datafusion-examples/Cargo.toml
##
@@ -0,0 +1,38 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "datafusion-examples"

Review comment:
   FYI @andygrove  -- I am not sure what you think about possibly releasing 
the examples as a crate of their own. I personally think it would be necessary. 
However perhaps it would be the right way to to eventually to somehow link to 
the example code from the docs.rs page 
https://docs.rs/datafusion/3.0.0/datafusion/
   
   I filed this JIRA https://issues.apache.org/jira/browse/ARROW-11863 to track 
that

##
File path: rust/datafusion/Cargo.toml
##
@@ -71,9 +71,6 @@ unicode-segmentation = "^1.7.1"
 rand = "0.8"
 criterion = "0.3"
 tempfile = "3"
-prost = "0.7"

Review comment:
    





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


ianmcook commented on pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#issuecomment-790887608


   @github-actions crossbow submit -g r



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r587765098



##
File path: r/tools/autobrew
##
@@ -48,7 +48,8 @@ fi
 # Hardcode this for my custom autobrew build
 rm -f $BREWDIR/lib/*.dylib
 AWS_LIBS="-laws-cpp-sdk-config -laws-cpp-sdk-transfer 
-laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity 
-laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream 
-laws-checksums -laws-c-common -lpthread -lcurl"
-PKG_LIBS="-L$BREWDIR/lib -lparquet -larrow_dataset -larrow 
-larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS"
+PKG_LIBS="-lparquet -larrow_dataset -larrow -larrow_bundled_dependencies 
-lthrift -llz4 -lsnappy -lzstd $AWS_LIBS"
+PKG_DIRS="-L$BREWDIR/lib"

Review comment:
   I created ARROW-11861 to remind us of this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9588: ARROW-11799: [Rust] fix len of string and binary arrays created from unbound iterator

2021-03-04 Thread GitBox


alamb commented on a change in pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#discussion_r587764593



##
File path: rust/arrow/src/array/array_binary.rs
##
@@ -258,6 +258,8 @@ where
 }
 }
 
+// calculate actual data_len, which may be different from the 
iterator's upper bound
+let data_len = offsets.len() - 1;

Review comment:
   After some more study I agree (that further improvements could be made, 
but this PR makes the situation better so it is good as it is)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9588: ARROW-11799: [Rust] fix len of string and binary arrays created from unbound iterator

2021-03-04 Thread GitBox


alamb commented on a change in pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#discussion_r587764593



##
File path: rust/arrow/src/array/array_binary.rs
##
@@ -258,6 +258,8 @@ where
 }
 }
 
+// calculate actual data_len, which may be different from the 
iterator's upper bound
+let data_len = offsets.len() - 1;

Review comment:
   After some more study I agree





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r587761603



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   Done in 26753e4e0ed08739467588cfb3ddcfe09f5aebcc. I will change the 
`\dontrun`s back to `\donttest` (or remove them altogether if possible) when I 
implement `@examplesIf`, as noted in ARROW-11849





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex

2021-03-04 Thread GitBox


alamb commented on a change in pull request #9625:
URL: https://github.com/apache/arrow/pull/9625#discussion_r587742457



##
File path: rust/datafusion/tests/sql.rs
##
@@ -2051,13 +2054,19 @@ async fn test_string_expressions() -> Result<()> {
 test_expression!("character_length('chars')", "5");
 test_expression!("character_length('josé')", "4");
 test_expression!("character_length(NULL)", "NULL");
+test_expression!("chr(CAST(120 AS int))", "x");
+test_expression!("chr(CAST(128175 AS int))", "");

Review comment:
     indeed!

##
File path: rust/datafusion/src/physical_plan/functions.rs
##
@@ -1033,6 +1163,54 @@ mod tests {
 
 #[test]
 fn test_functions() -> Result<()> {
+test_function!(
+Ascii,
+&[lit(ScalarValue::Utf8(Some("x".to_string(],
+Ok(Some(120)),
+i32,
+Int32,
+Int32Array
+);
+test_function!(
+Ascii,
+&[lit(ScalarValue::Utf8(Some("ésoj".to_string(],
+Ok(Some(233)),
+i32,
+Int32,
+Int32Array
+);
+test_function!(
+Ascii,
+&[lit(ScalarValue::Utf8(Some("".to_string(],
+Ok(Some(128175)),

Review comment:
   It is strange to me that a function called `ascii` can return something 
larger than `127`. However, it seems that quirkiness / awesomeness came from 
`postgres` :)
   
   ```
   alamb=# select ascii('');
ascii  
   
128175
   (1 row)
   
   ```
    

##
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##
@@ -658,32 +709,9 @@ pub fn rpad(args: &[ArrayRef]) 
-> Result {
 Ok(Arc::new(result) as ArrayRef)
 }
 3 => {
-let string_array:  = args[0]
-.as_any()
-.downcast_ref::>()
-.ok_or_else(|| {
-DataFusionError::Internal(
-"could not cast string to StringArray".to_string(),
-)
-})?;
-
-let length_array:  = args[1]
-.as_any()
-.downcast_ref::()
-.ok_or_else(|| {
-DataFusionError::Internal(
-"could not cast length to Int64Array".to_string(),
-)
-})?;
-
-let fill_array:  = args[2]
-.as_any()
-.downcast_ref::>()
-.ok_or_else(|| {
-DataFusionError::Internal(
-"could not cast fill to StringArray".to_string(),
-)
-})?;
+let string_array = downcast_string_arg!(args[0], "string", T);

Review comment:
   these macros certainly make the code nicer to read.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9633: ARROW-11560: [C++][FlightRPC] fix mutex error on SIGINT

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9633:
URL: https://github.com/apache/arrow/pull/9633#discussion_r587752941



##
File path: cpp/src/arrow/flight/server.cc
##
@@ -899,6 +921,13 @@ Status FlightServerBase::Serve() {
 ARROW_ASSIGN_OR_RAISE(old_handler, SetSignalHandler(signum, new_handler));
 impl_->old_signal_handlers_.push_back(std::move(old_handler));
   }
+#ifndef _WIN32
+  if (sem_init(_->waiting_sem_, /*pshared=*/0, /*value=*/0) != 0) {

Review comment:
   Unfortunately I don't think macOS supports anonymous semaphores.
   An alternative would be to use a non-blocking anonymous pipe. `write` is 
async-signal-safe.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm opened a new pull request #9633: ARROW-11560: [C++][FlightRPC] fix mutex error on SIGINT

2021-03-04 Thread GitBox


lidavidm opened a new pull request #9633:
URL: https://github.com/apache/arrow/pull/9633


   Recently, interrupting a Python Flight server started aborting the process 
instead, because gRPC on Linux started using some code which is not 
signal-safe. This PR fixes that by spawning a separate thread that waits on a 
semaphore, then shuts down the server; the semaphore is notified by the signal 
handler (which is a signal-safe operation).
   
   This is a little janky, but fixes the issue. This still builds on Windows in 
my tests, but I couldn't actually import the resulting PyArrow to test it. 
However, PyArrow 3.0 via Conda still works on Windows. I'm not able to test on 
MacOS myself.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete

2021-03-04 Thread GitBox


lidavidm commented on a change in pull request #9613:
URL: https://github.com/apache/arrow/pull/9613#discussion_r587733815



##
File path: cpp/src/arrow/io/caching.cc
##
@@ -193,6 +195,36 @@ Result> 
ReadRangeCache::Read(ReadRange range) {
   return Status::Invalid("ReadRangeCache did not find matching cache entry");
 }
 
+Future<> ReadRangeCache::Wait() {
+  struct State {
+explicit State(std::vector>> f)
+: futures(std::move(f)), remaining(futures.size()) {}
+std::vector>> futures;
+std::atomic remaining;
+  };
+
+  std::vector>> futures;
+  for (const auto& entry : impl_->entries) {
+futures.push_back(entry.future);
+  }
+  auto out = Future<>::Make();
+  auto state = std::make_shared(std::move(futures));
+  for (const auto& future : state->futures) {
+future.AddCallback([state, out](const Result>&) 
mutable {
+  if (state->remaining.fetch_sub(1) != 1) return;

Review comment:
   Future is already convertible to Future<>. So what you suggest would 
work, but we'd need the fail-fast behavior, yes. 
   
   A flag like that, though, does change the semantics rather dramatically, I'd 
rather keep them as separate functions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

2021-03-04 Thread GitBox


alamb commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-790849774


   switching to draft so it is clear this is not being merged as is and instead 
is being broken up



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-04 Thread GitBox


alamb closed pull request #9595:
URL: https://github.com/apache/arrow/pull/9595


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete

2021-03-04 Thread GitBox


westonpace commented on a change in pull request #9613:
URL: https://github.com/apache/arrow/pull/9613#discussion_r587731880



##
File path: cpp/src/arrow/io/caching.cc
##
@@ -193,6 +195,36 @@ Result> 
ReadRangeCache::Read(ReadRange range) {
   return Status::Invalid("ReadRangeCache did not find matching cache entry");
 }
 
+Future<> ReadRangeCache::Wait() {
+  struct State {
+explicit State(std::vector>> f)
+: futures(std::move(f)), remaining(futures.size()) {}
+std::vector>> futures;
+std::atomic remaining;
+  };
+
+  std::vector>> futures;
+  for (const auto& entry : impl_->entries) {
+futures.push_back(entry.future);
+  }
+  auto out = Future<>::Make();
+  auto state = std::make_shared(std::move(futures));
+  for (const auto& future : state->futures) {
+future.AddCallback([state, out](const Result>&) 
mutable {
+  if (state->remaining.fetch_sub(1) != 1) return;

Review comment:
   I wonder if it might be better to add `fail_fast` as an optional flag to 
the existing method.  We may someday want `Future>>` but also 
want the `fail_fast` behavior.
   
   Also, just speculation, but I wonder if it would be possible to have a cast 
(not sure if explicit or implicit) from Future -> Future<>.  Under the 
hood it could be `base.Then([] (const Result& result) -> Status {return 
Status::OK();})` (or perhaps there is a more efficient way to achieve the same 
thing).
   
   Then it could just be `static_cast>(All(futures, Future::FailFast))`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9630: ARROW-11860: [Rust] [DataFusion] Add DataFusion logos

2021-03-04 Thread GitBox


alamb closed pull request #9630:
URL: https://github.com/apache/arrow/pull/9630


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations

2021-03-04 Thread GitBox


pitrou commented on pull request #9632:
URL: https://github.com/apache/arrow/pull/9632#issuecomment-790847276


   We do use AWS async APIs at a couple places, but you're right this is not 
mandatory.
   OTOH, if we don't use them then we may have a double indirection (first our 
own executor, then AWS would use its own executor? or are the sync AWS APIs 
really sync?).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb edited a comment on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos

2021-03-04 Thread GitBox


alamb edited a comment on pull request #9630:
URL: https://github.com/apache/arrow/pull/9630#issuecomment-790846558


   I think we can get this logo in and then add it to the docs as a follow on 
PR as @Dandandan  suggests



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos

2021-03-04 Thread GitBox


alamb commented on pull request #9630:
URL: https://github.com/apache/arrow/pull/9630#issuecomment-790846558


   I think we can get this logo in and then add it to the docs as a follow on 
PR 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos

2021-03-04 Thread GitBox


alamb commented on pull request #9630:
URL: https://github.com/apache/arrow/pull/9630#issuecomment-790845573


   > I don't think this needs a JIRA?
   
   @andygrove  I will apply my new JIRA making script 
(https://github.com/apache/arrow/pull/9598)  to make one anyways :) so there is 
no question :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations

2021-03-04 Thread GitBox


westonpace commented on pull request #9632:
URL: https://github.com/apache/arrow/pull/9632#issuecomment-790845365


   @pitrou Looks good to me.  I wonder if we will even need to use AWS' async 
mode?  It seems that calling into S3 this way is pretty much equivalent to 
overriding their executor with our own.
   
   Callers will have to take care to know that these futures will complete on 
the IOContext and transfer them off quickly but looking at it now I agree that 
the transfer doesn't belong in here. Looks great.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes

2021-03-04 Thread GitBox


alamb commented on a change in pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#discussion_r587725662



##
File path: rust/arrow/src/datatypes/native.rs
##
@@ -50,6 +50,12 @@ pub trait ArrowNativeType:
 None
 }
 
+/// Convert native type to isize.

Review comment:
   I think this is a good pattern and follows the existing `to_usize` etc 
that convert the arrow types to rust types.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete

2021-03-04 Thread GitBox


lidavidm commented on a change in pull request #9613:
URL: https://github.com/apache/arrow/pull/9613#discussion_r587706050



##
File path: cpp/src/arrow/io/caching.cc
##
@@ -193,6 +195,36 @@ Result> 
ReadRangeCache::Read(ReadRange range) {
   return Status::Invalid("ReadRangeCache did not find matching cache entry");
 }
 
+Future<> ReadRangeCache::Wait() {
+  struct State {
+explicit State(std::vector>> f)
+: futures(std::move(f)), remaining(futures.size()) {}
+std::vector>> futures;
+std::atomic remaining;
+  };
+
+  std::vector>> futures;
+  for (const auto& entry : impl_->entries) {
+futures.push_back(entry.future);
+  }
+  auto out = Future<>::Make();
+  auto state = std::make_shared(std::move(futures));
+  for (const auto& future : state->futures) {
+future.AddCallback([state, out](const Result>&) 
mutable {
+  if (state->remaining.fetch_sub(1) != 1) return;

Review comment:
   All gives `Future>>` and never marks the future failed; 
this gives `Future<>` and marks the future failed if any individual result 
failed. Additionally this 'fails fast' in that if any fail, you immediately get 
a result, instead of waiting for the rest of the futures.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete

2021-03-04 Thread GitBox


westonpace commented on a change in pull request #9613:
URL: https://github.com/apache/arrow/pull/9613#discussion_r587704738



##
File path: cpp/src/arrow/io/caching.cc
##
@@ -193,6 +195,36 @@ Result> 
ReadRangeCache::Read(ReadRange range) {
   return Status::Invalid("ReadRangeCache did not find matching cache entry");
 }
 
+Future<> ReadRangeCache::Wait() {
+  struct State {
+explicit State(std::vector>> f)
+: futures(std::move(f)), remaining(futures.size()) {}
+std::vector>> futures;
+std::atomic remaining;
+  };
+
+  std::vector>> futures;
+  for (const auto& entry : impl_->entries) {
+futures.push_back(entry.future);
+  }
+  auto out = Future<>::Make();
+  auto state = std::make_shared(std::move(futures));
+  for (const auto& future : state->futures) {
+future.AddCallback([state, out](const Result>&) 
mutable {
+  if (state->remaining.fetch_sub(1) != 1) return;

Review comment:
   What is the difference between this and `All` inside `future.h`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan edited a comment on pull request #9494: ARROW-11626: [Rust][DataFusion] Move [DataFusion] examples to own project

2021-03-04 Thread GitBox


Dandandan edited a comment on pull request #9494:
URL: https://github.com/apache/arrow/pull/9494#issuecomment-790819802


   PR is resurrected! (I think) @alamb 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9494: ARROW-11626: [Rust][DataFusion] Move [DataFusion] examples to own project

2021-03-04 Thread GitBox


Dandandan commented on pull request #9494:
URL: https://github.com/apache/arrow/pull/9494#issuecomment-790819802


   PR is resurrected! (I think)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations

2021-03-04 Thread GitBox


pitrou commented on pull request #9632:
URL: https://github.com/apache/arrow/pull/9632#issuecomment-790818363


   I tested HDFS here: https://github.com/ursacomputing/crossbow/runs/2033412067



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#discussion_r587691144



##
File path: cpp/src/arrow/util/cancel.cc
##
@@ -0,0 +1,167 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/cancel.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/result.h"
+#include "arrow/util/atomic_shared_ptr.h"
+#include "arrow/util/io_util.h"
+#include "arrow/util/logging.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+
+#if ATOMIC_INT_LOCK_FREE != 2
+#error Lock-free atomic int required for signal safety
+#endif
+
+using internal::ReinstateSignalHandler;
+using internal::SetSignalHandler;
+using internal::SignalHandler;
+
+// NOTE: We care mainly about the making the common case (not cancelled) fast.
+
+struct StopSourceImpl {
+  std::atomic requested_{0};  // will be -1 or signal number if requested
+  std::mutex mutex_;
+  Status cancel_error_;
+};
+
+StopSource::StopSource() : impl_(new StopSourceImpl) {}
+
+StopSource::~StopSource() = default;
+
+void StopSource::RequestStop() { RequestStop(Status::Cancelled("Operation 
cancelled")); }
+
+void StopSource::RequestStop(Status st) {
+  std::lock_guard lock(impl_->mutex_);
+  DCHECK(!st.ok());
+  if (!impl_->requested_) {
+impl_->requested_ = -1;
+impl_->cancel_error_ = std::move(st);
+  }
+}
+
+void StopSource::RequestStopFromSignal(int signum) {
+  // Only async-signal-safe code allowed here
+  impl_->requested_.store(signum);
+}
+
+StopToken StopSource::token() { return StopToken(impl_); }
+
+bool StopToken::IsStopRequested() {
+  if (!impl_) {
+return false;
+  }
+  return impl_->requested_.load() != 0;
+}
+
+Status StopToken::Poll() {
+  if (!impl_) {
+return Status::OK();
+  }
+  if (!impl_->requested_.load()) {
+return Status::OK();
+  }
+
+  std::lock_guard lock(impl_->mutex_);
+  if (impl_->cancel_error_.ok()) {
+auto signum = impl_->requested_.load();
+DCHECK_GT(signum, 0);
+impl_->cancel_error_ = internal::CancelledFromSignal(signum, "Operation 
cancelled");
+  }

Review comment:
   If you call Poll() twice, then the cancel_error corresponding to the 
signal will have been cached. So you can have both a cancel_error and a 
positive signal number.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos

2021-03-04 Thread GitBox


Dandandan commented on pull request #9630:
URL: https://github.com/apache/arrow/pull/9630#issuecomment-790812860


   LGTM! maybe we can also add it to /rust/datafusion/README.md like here 
https://github.com/andygrove/datafusion ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#discussion_r587690293



##
File path: cpp/src/arrow/util/thread_pool_test.cc
##
@@ -137,30 +136,47 @@ class TestThreadPool : public ::testing::Test {
 return *ThreadPool::Make(threads);
   }
 
-  void SpawnAdds(ThreadPool* pool, int nadds, AddTaskFunc add_func) {
-AddTester add_tester(nadds);
+  void SpawnAdds(ThreadPool* pool, int nadds, AddTaskFunc add_func,
+ StopToken stop_token = StopToken::Unstoppable(),
+ StopSource* stop_source = nullptr) {

Review comment:
   Passing the token without the source tests a source that's not stopped, 
so it's deliberate. I could try to add overloads though.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#discussion_r587693382



##
File path: cpp/src/arrow/csv/reader.cc
##
@@ -833,9 +843,10 @@ class AsyncThreadedTableReader
   AsyncThreadedTableReader(MemoryPool* pool, std::shared_ptr 
input,
const ReadOptions& read_options,
const ParseOptions& parse_options,
-   const ConvertOptions& convert_options, Executor* 
cpu_executor,
-   Executor* io_executor)
-  : BaseTableReader(pool, input, read_options, parse_options, 
convert_options),
+   const ConvertOptions& convert_options, StopToken 
stop_token,
+   Executor* cpu_executor, Executor* io_executor)

Review comment:
   We must support the legacy `TableReader::Make` taking both a 
`MemoryPool*` and an `IOContext`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#discussion_r587692069



##
File path: cpp/src/arrow/util/cancel.cc
##
@@ -0,0 +1,167 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/cancel.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/result.h"
+#include "arrow/util/atomic_shared_ptr.h"
+#include "arrow/util/io_util.h"
+#include "arrow/util/logging.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+
+#if ATOMIC_INT_LOCK_FREE != 2
+#error Lock-free atomic int required for signal safety
+#endif
+
+using internal::ReinstateSignalHandler;
+using internal::SetSignalHandler;
+using internal::SignalHandler;
+
+// NOTE: We care mainly about the making the common case (not cancelled) fast.
+
+struct StopSourceImpl {
+  std::atomic requested_{0};  // will be -1 or signal number if requested
+  std::mutex mutex_;
+  Status cancel_error_;
+};
+
+StopSource::StopSource() : impl_(new StopSourceImpl) {}
+
+StopSource::~StopSource() = default;
+
+void StopSource::RequestStop() { RequestStop(Status::Cancelled("Operation 
cancelled")); }
+
+void StopSource::RequestStop(Status st) {
+  std::lock_guard lock(impl_->mutex_);
+  DCHECK(!st.ok());
+  if (!impl_->requested_) {
+impl_->requested_ = -1;
+impl_->cancel_error_ = std::move(st);
+  }
+}
+
+void StopSource::RequestStopFromSignal(int signum) {
+  // Only async-signal-safe code allowed here
+  impl_->requested_.store(signum);
+}
+
+StopToken StopSource::token() { return StopToken(impl_); }
+
+bool StopToken::IsStopRequested() {
+  if (!impl_) {
+return false;
+  }
+  return impl_->requested_.load() != 0;
+}
+
+Status StopToken::Poll() {
+  if (!impl_) {
+return Status::OK();
+  }
+  if (!impl_->requested_.load()) {
+return Status::OK();
+  }
+
+  std::lock_guard lock(impl_->mutex_);
+  if (impl_->cancel_error_.ok()) {
+auto signum = impl_->requested_.load();
+DCHECK_GT(signum, 0);
+impl_->cancel_error_ = internal::CancelledFromSignal(signum, "Operation 
cancelled");
+  }
+  return impl_->cancel_error_;
+}
+
+namespace {
+
+void HandleSignal(int signum);
+
+struct SignalStopState {
+  struct SavedSignalHandler {
+int signum;
+SignalHandler handler;
+  };
+
+  Status RegisterHandlers(const std::vector& signals) {
+if (!saved_handlers.empty()) {
+  return Status::Invalid("Signal handlers already registered");
+}
+for (int signum : signals) {
+  ARROW_ASSIGN_OR_RAISE(auto handler,
+SetSignalHandler(signum, 
SignalHandler{}));
+  saved_handlers.push_back({signum, handler});
+}
+return Status::OK();
+  }
+
+  void UnregisterHandlers() {
+auto handlers = std::move(saved_handlers);
+for (const auto& h : handlers) {
+  ARROW_CHECK_OK(SetSignalHandler(h.signum, h.handler).status());
+}
+  }
+
+  ~SignalStopState() { UnregisterHandlers(); }
+
+  StopSource stop_source;
+  std::vector saved_handlers;
+};
+
+std::shared_ptr g_signal_stop_state;
+
+void HandleSignal(int signum) {
+  ReinstateSignalHandler(signum, );
+  std::shared_ptr state = 
internal::atomic_load(_signal_stop_state);
+  if (state) {
+state->stop_source.RequestStopFromSignal(signum);
+  }
+}
+
+}  // namespace
+
+Result SetSignalStopSource() {
+  if (g_signal_stop_state) {
+return Status::Invalid("Signal stop source already set up");
+  }
+  internal::atomic_store(_signal_stop_state, 
std::make_shared());
+  return _signal_stop_state->stop_source;
+}
+
+void ResetSignalStopSource() {
+  internal::atomic_store(_signal_stop_state, 
std::shared_ptr{});

Review comment:
   Good point, will change.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#discussion_r587691144



##
File path: cpp/src/arrow/util/cancel.cc
##
@@ -0,0 +1,167 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/cancel.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/result.h"
+#include "arrow/util/atomic_shared_ptr.h"
+#include "arrow/util/io_util.h"
+#include "arrow/util/logging.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+
+#if ATOMIC_INT_LOCK_FREE != 2
+#error Lock-free atomic int required for signal safety
+#endif
+
+using internal::ReinstateSignalHandler;
+using internal::SetSignalHandler;
+using internal::SignalHandler;
+
+// NOTE: We care mainly about the making the common case (not cancelled) fast.
+
+struct StopSourceImpl {
+  std::atomic requested_{0};  // will be -1 or signal number if requested
+  std::mutex mutex_;
+  Status cancel_error_;
+};
+
+StopSource::StopSource() : impl_(new StopSourceImpl) {}
+
+StopSource::~StopSource() = default;
+
+void StopSource::RequestStop() { RequestStop(Status::Cancelled("Operation 
cancelled")); }
+
+void StopSource::RequestStop(Status st) {
+  std::lock_guard lock(impl_->mutex_);
+  DCHECK(!st.ok());
+  if (!impl_->requested_) {
+impl_->requested_ = -1;
+impl_->cancel_error_ = std::move(st);
+  }
+}
+
+void StopSource::RequestStopFromSignal(int signum) {
+  // Only async-signal-safe code allowed here
+  impl_->requested_.store(signum);
+}
+
+StopToken StopSource::token() { return StopToken(impl_); }
+
+bool StopToken::IsStopRequested() {
+  if (!impl_) {
+return false;
+  }
+  return impl_->requested_.load() != 0;
+}
+
+Status StopToken::Poll() {
+  if (!impl_) {
+return Status::OK();
+  }
+  if (!impl_->requested_.load()) {
+return Status::OK();
+  }
+
+  std::lock_guard lock(impl_->mutex_);
+  if (impl_->cancel_error_.ok()) {
+auto signum = impl_->requested_.load();
+DCHECK_GT(signum, 0);
+impl_->cancel_error_ = internal::CancelledFromSignal(signum, "Operation 
cancelled");
+  }

Review comment:
   If you call Poll() twice, then the cancel_error corresponding to the 
signal will have been cached.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#discussion_r587690293



##
File path: cpp/src/arrow/util/thread_pool_test.cc
##
@@ -137,30 +136,47 @@ class TestThreadPool : public ::testing::Test {
 return *ThreadPool::Make(threads);
   }
 
-  void SpawnAdds(ThreadPool* pool, int nadds, AddTaskFunc add_func) {
-AddTester add_tester(nadds);
+  void SpawnAdds(ThreadPool* pool, int nadds, AddTaskFunc add_func,
+ StopToken stop_token = StopToken::Unstoppable(),
+ StopSource* stop_source = nullptr) {

Review comment:
   Passing the token with not the source tests a source that's not stopped, 
so it's deliberate. I could try to add overloads though.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations

2021-03-04 Thread GitBox


github-actions[bot] commented on pull request #9632:
URL: https://github.com/apache/arrow/pull/9632#issuecomment-790802618


   https://issues.apache.org/jira/browse/ARROW-10846



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9632: ARROW-10846: [C++] Add async filesystem operations

2021-03-04 Thread GitBox


pitrou commented on pull request #9632:
URL: https://github.com/apache/arrow/pull/9632#issuecomment-790802568


   @westonpace Would you like to review this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #9632: ARROW-10846: [C++] Add async filesystem operations

2021-03-04 Thread GitBox


pitrou opened a new pull request #9632:
URL: https://github.com/apache/arrow/pull/9632


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-04 Thread GitBox


bkietz commented on a change in pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#discussion_r587606004



##
File path: cpp/src/arrow/util/thread_pool_test.cc
##
@@ -192,32 +208,50 @@ TEST_F(TestThreadPool, StressSpawnThreaded) {
 TEST_F(TestThreadPool, SpawnSlow) {
   // This checks that Shutdown() waits for all tasks to finish
   auto pool = this->MakeThreadPool(2);
-  SpawnAdds(pool.get(), 7, [](int x, int y, int* out) {
-return task_slow_add(0.02 /* seconds */, x, y, out);
-  });
+  SpawnAdds(pool.get(), 7, task_slow_add{0.02 /* seconds */});

Review comment:
   ```suggestion
 SpawnAdds(pool.get(), 7, task_slow_add{ /*seconds=*/0.02});
   ```

##
File path: cpp/src/arrow/util/io_util.cc
##
@@ -1608,6 +1640,39 @@ Result SetSignalHandler(int signum, const 
SignalHandler& handler)
   return Status::OK();
 }
 
+void ReinstateSignalHandler(int signum, SignalHandler::Callback handler) {
+#if !ARROW_HAVE_SIGACTION
+  // Cannot report any errors from signal() (but there shouldn't be any)
+  signal(signum, handler);
+#endif
+}
+
+Status SendSignal(int signum) {
+  if (raise(signum) == 0) {
+return Status::OK();
+  }
+  if (errno == EINVAL) {
+return Status::Invalid("Invalid signal number");
+  }
+  return IOErrorFromErrno(errno, "Failed to raise signal");
+}
+
+Status SendSignalToThread(int signum, uint64_t thread_id) {
+#ifdef _WIN32
+  return Status::NotImplemented("Cannot send signal to specific thread on 
Windows");
+#else
+  // Have to use a C-style cast because pthread_t can be a pointer *or* 
integer type
+  int r = pthread_kill((pthread_t)thread_id, signum);
+  if (r == 0) {
+return Status::OK();
+  }
+  if (r == EINVAL) {
+return Status::Invalid("Invalid signal number");

Review comment:
   ```suggestion
   return Status::Invalid("Invalid signal number ", signum);
   ```

##
File path: cpp/src/arrow/csv/reader.cc
##
@@ -833,9 +843,10 @@ class AsyncThreadedTableReader
   AsyncThreadedTableReader(MemoryPool* pool, std::shared_ptr 
input,
const ReadOptions& read_options,
const ParseOptions& parse_options,
-   const ConvertOptions& convert_options, Executor* 
cpu_executor,
-   Executor* io_executor)
-  : BaseTableReader(pool, input, read_options, parse_options, 
convert_options),
+   const ConvertOptions& convert_options, StopToken 
stop_token,
+   Executor* cpu_executor, Executor* io_executor)

Review comment:
   Seems we might prefer to rewrite this constructor to take an IOContext

##
File path: cpp/src/arrow/util/io_util_test.cc
##
@@ -623,5 +655,46 @@ TEST(FileUtils, LongPaths) {
 }
 #endif
 
+static std::atomic signal_received;
+
+static void handle_signal(int signum) {
+  ReinstateSignalHandler(signum, _signal);
+  signal_received.store(signum);
+}
+
+TEST(SendSignal, Generic) {
+  signal_received.store(0);
+  SignalHandlerGuard guard(SIGINT, _signal);
+
+  ASSERT_EQ(signal_received.load(), 0);
+  ASSERT_OK(SendSignal(SIGINT));
+  BusyWait(1.0, [&]() { return signal_received.load() != 0; });
+  ASSERT_EQ(signal_received.load(), SIGINT);
+
+  // Re-try (exercise ReinstateSignalHandler)
+  signal_received.store(0);
+  ASSERT_OK(SendSignal(SIGINT));
+  BusyWait(1.0, [&]() { return signal_received.load() != 0; });
+  ASSERT_EQ(signal_received.load(), SIGINT);
+}
+
+TEST(SendSignal, ToThread) {
+#ifdef _WIN32
+  uint64_t dummy_thread_id = 42;
+  ASSERT_RAISES(NotImplemented, SendSignalToThread(SIGINT, dummy_thread_id));
+#else
+  // Have to use a C-style cast because pthread_t can be a pointer *or* 
integer type
+  uint64_t thread_id = (uint64_t)(pthread_self());

Review comment:
   ```suggestion
 uint64_t thread_id = (uint64_t)(pthread_self());  // NOLINT 
readability-casting
   ```

##
File path: cpp/src/arrow/util/cancel.cc
##
@@ -0,0 +1,167 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/cancel.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/result.h"
+#include "arrow/util/atomic_shared_ptr.h"

[GitHub] [arrow] github-actions[bot] commented on pull request #9631: ARROW-11644: [Python][Parquet] Low-level Parquet encryption in Python, initial sketch for feedback

2021-03-04 Thread GitBox


github-actions[bot] commented on pull request #9631:
URL: https://github.com/apache/arrow/pull/9631#issuecomment-790779100


   https://issues.apache.org/jira/browse/ARROW-11644



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] itamarst opened a new pull request #9631: ARROW-11644: [Python][Parquet] Low-level Parquet encryption in Python, initial sketch for feedback

2021-03-04 Thread GitBox


itamarst opened a new pull request #9631:
URL: https://github.com/apache/arrow/pull/9631


   This is still a work in progress, but some feedback would be helpful.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-04 Thread GitBox


emkornfield commented on pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#issuecomment-790754904


   I thought there should be assert scalar equals but I was expecting it to be 
a macro which is why I think I missed it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9579: ARROW-11774: [R] macos one line install

2021-03-04 Thread GitBox


nealrichardson commented on a change in pull request #9579:
URL: https://github.com/apache/arrow/pull/9579#discussion_r587611515



##
File path: r/tools/nixlibs.R
##
@@ -396,6 +401,7 @@ cmake_version <- function(cmd = "cmake") {
 with_s3_support <- function(env_vars) {
   arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || 
tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false"
   if (arrow_s3) {
+# TODO: add in brew versions of these?

Review comment:
   If it's trouble then don't bother, this is a rather esoteric case isn't 
it? Also it seems pretty likely that you would already have openssl and curl (I 
do and I didn't install them myself).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-04 Thread GitBox


pitrou closed pull request #9582:
URL: https://github.com/apache/arrow/pull/9582


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r587609001



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   Thanks, please also remove the conditionals from the other examples. I 
don't feel strongly about `\donttest` vs `\dontrun` (CRAN complains about 
dontrun examples on initial submission, which is why they were switched to 
donttest in the first place, but that's not our concern now) so that's fine, 
but FWIW `\donttest` is safe on Solaris: [we currently "pass" Solaris 
checks](https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/arrow-00check.html)
 with the fake arrow-without-arrow build.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r587597922



##
File path: r/tools/autobrew
##
@@ -48,7 +48,8 @@ fi
 # Hardcode this for my custom autobrew build
 rm -f $BREWDIR/lib/*.dylib
 AWS_LIBS="-laws-cpp-sdk-config -laws-cpp-sdk-transfer 
-laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity 
-laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream 
-laws-checksums -laws-c-common -lpthread -lcurl"
-PKG_LIBS="-L$BREWDIR/lib -lparquet -larrow_dataset -larrow 
-larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS"
+PKG_LIBS="-lparquet -larrow_dataset -larrow -larrow_bundled_dependencies 
-lthrift -llz4 -lsnappy -lzstd $AWS_LIBS"
+PKG_DIRS="-L$BREWDIR/lib"

Review comment:
   It's based on an earlier version of Jeroen's autobrew, and it's closest 
now to how he actually builds the C++ dependencies. It looks like the new 
autobrew script is based on some artifacts that are bundled up in a separate 
process (see https://github.com/autobrew/bundler/blob/master/apache-arrow.sh). 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jonkeane commented on a change in pull request #9579: ARROW-11774: [R] macos one line install

2021-03-04 Thread GitBox


jonkeane commented on a change in pull request #9579:
URL: https://github.com/apache/arrow/pull/9579#discussion_r587591793



##
File path: r/tools/nixlibs.R
##
@@ -396,6 +401,7 @@ cmake_version <- function(cmd = "cmake") {
 with_s3_support <- function(env_vars) {
   arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || 
tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false"
   if (arrow_s3) {
+# TODO: add in brew versions of these?

Review comment:
   I'm working on this, it ends up being a bit more complicated than simply 
`brew install`ing, but I'll get something working. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos

2021-03-04 Thread GitBox


github-actions[bot] commented on pull request #9630:
URL: https://github.com/apache/arrow/pull/9630#issuecomment-790718123


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove opened a new pull request #9630: [MINOR] [Rust] [DataFusion] Add DataFusion logos

2021-03-04 Thread GitBox


andygrove opened a new pull request #9630:
URL: https://github.com/apache/arrow/pull/9630


   I don't think this needs a JIRA?
   
   https://user-images.githubusercontent.com/934084/109990656-d55ddf80-7cc6-11eb-8bbc-f21946fd1dfc.png;>
   
   https://user-images.githubusercontent.com/934084/109990665-d68f0c80-7cc6-11eb-891c-bf367cb5f447.png;>
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r587557839



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   Done in 6918441bba2b0ca09f2103eaf5631adc0a076a6c





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r587557393



##
File path: r/configure
##
@@ -26,13 +26,13 @@
 # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
 
 # Library settings
-PKG_CONFIG_NAME="arrow parquet arrow-dataset"
+PKG_CONFIG_NAME="arrow"
 PKG_DEB_NAME="(unsuppored)"
 PKG_RPM_NAME="(unsuppored)"
 PKG_BREW_NAME="apache-arrow"
 PKG_TEST_HEADER=""
-# These must be the same order as $(pkg-config --libs arrow-dataset)
-PKG_LIBS="-larrow_dataset -lparquet -larrow"
+PKG_LIBS="-larrow"
+BUNDLED_LIBS=""

Review comment:
   Done in 13edc1478e5003e3a2b3dc27612c6d85f2aa9269





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


ianmcook commented on pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#issuecomment-790693193


   > Oh, one other thing: in `inst/build_arrow_static/sh`, you'll want to make 
ARROW_PARQUET and ARROW_DATASET configurable by env var but default ON, just 
like ARROW_JEMALLOC. That way, we can make a CI job that sets them to OFF in 
the env so we can test this.
   
   Done in 8c5b235dddac47833b229323b4a216049eae5fea



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r587550424



##
File path: r/tools/autobrew
##
@@ -48,7 +48,8 @@ fi
 # Hardcode this for my custom autobrew build
 rm -f $BREWDIR/lib/*.dylib
 AWS_LIBS="-laws-cpp-sdk-config -laws-cpp-sdk-transfer 
-laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity 
-laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream 
-laws-checksums -laws-c-common -lpthread -lcurl"
-PKG_LIBS="-L$BREWDIR/lib -lparquet -larrow_dataset -larrow 
-larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS"
+PKG_LIBS="-lparquet -larrow_dataset -larrow -larrow_bundled_dependencies 
-lthrift -llz4 -lsnappy -lzstd $AWS_LIBS"
+PKG_DIRS="-L$BREWDIR/lib"

Review comment:
   What is the relationship of this script (`r/tools/autobrew`) to 
https://github.com/autobrew/scripts/blob/master/apache-arrow?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-04 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r587526925



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   Gotcha. For now, I'll remove the conditional in the example and I'll 
also change the `\donttest` to `\dontrun`. That way the example checks will 
pass on Solaris.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-04 Thread GitBox


pitrou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790619759


   That said, it's just a matter of implementing said generation properly :-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-04 Thread GitBox


pitrou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790619135


   > Shall random decimal128 generation with precision > 18 be impossible? It 
used to be perfectly fine.
   
   Did it actually work? I think it would use random fixed size binary 
generation, which wouldn't guarantee the generated values fit in the given 
precision...
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-04 Thread GitBox


mathyingzhou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790615411


   @pitrou It is due to some new issue in Decimal128s.  Starting from 
https://github.com/apache/arrow/pull/8648/commits/68fd76fc5eba3350f341eb0fd7e4f83f7c83c51e
 we suddenly began to get "NotImplemented: random decimal128 generation with 
precision > 18". I reduced precision to 18 which caused ORC to misbehave. Shall 
random decimal128 generation with precision > 18 be impossible? It used to be 
perfectly fine.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete

2021-03-04 Thread GitBox


lidavidm commented on a change in pull request #9613:
URL: https://github.com/apache/arrow/pull/9613#discussion_r587452478



##
File path: cpp/src/arrow/io/caching.h
##
@@ -99,6 +100,9 @@ class ARROW_EXPORT ReadRangeCache {
   /// \brief Read a range previously given to Cache().
   Result> Read(ReadRange range);
 
+  /// \brief Wait until all ranges added so far have been cached.
+  Future<> Wait();

Review comment:
   The idea here is to cleanly separate out the I/O operations so you can 
schedule them on different executors. With a ReadAsync, you wouldn't be able to 
accomplish that unless you went and refactored the entire Parquet reader so 
that it was fully async-aware.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete

2021-03-04 Thread GitBox


lidavidm commented on a change in pull request #9613:
URL: https://github.com/apache/arrow/pull/9613#discussion_r587451127



##
File path: cpp/src/arrow/io/caching.cc
##
@@ -193,6 +195,36 @@ Result> 
ReadRangeCache::Read(ReadRange range) {
   return Status::Invalid("ReadRangeCache did not find matching cache entry");
 }
 
+Future<> ReadRangeCache::Wait() {
+  struct State {
+explicit State(std::vector>> f)
+: futures(std::move(f)), remaining(futures.size()) {}
+std::vector>> futures;
+std::atomic remaining;
+  };
+
+  std::vector>> futures;
+  for (const auto& entry : impl_->entries) {
+futures.push_back(entry.future);
+  }
+  auto out = Future<>::Make();
+  auto state = std::make_shared(std::move(futures));
+  for (const auto& future : state->futures) {
+future.AddCallback([state, out](const Result>&) 
mutable {
+  if (state->remaining.fetch_sub(1) != 1) return;

Review comment:
   Agreed. I can move this to future.h/future.cc since there's already a 
very similar helper there. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jmgpeeters commented on pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.

2021-03-04 Thread GitBox


jmgpeeters commented on pull request #9629:
URL: https://github.com/apache/arrow/pull/9629#issuecomment-790592214


   To be clear, these test failures should be due to 
https://github.com/apache/arrow-testing/pull/59 not being merged yet - which 
provides the associated materialised test data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-04 Thread GitBox


pitrou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790587488


   It seems there was a crash on the "AMD64 Conda C++" CI build. Can you take a 
look?
   Note that you should be able to reproduce this locally using our 
docker-compose setup.
   One way to do this is to first install the `archery` utility: 
https://arrow.apache.org/docs/developers/archery.html
   and then run `archery docker run conda-cpp`.
   
   (or you may also directly run `docker-compose run --rm conda-cpp`)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587428039



##
File path: cpp/src/parquet/statistics.cc
##
@@ -133,56 +136,95 @@ struct CompareHelper {
   static T Max(int type_length, const T& a, const T& b) {
 return Compare(0, a, b) ? b : a;
   }
-};  // namespace parquet
-
-template 
-struct ByteLikeCompareHelperBase {
-  using T = ByteArrayType::c_type;
-  using PtrType = typename std::conditional::type;
+};
 
-  static T DefaultMin() { return {}; }
-  static T DefaultMax() { return {}; }
-  static T Coalesce(T val, T fallback) { return val; }
+template 
+struct BinaryLikeComparer {};
 
-  static inline bool Compare(int type_length, const T& a, const T& b) {
-const auto* aptr = reinterpret_cast(a.ptr);
-const auto* bptr = reinterpret_cast(b.ptr);
-return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + 
b.len);
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+// Unsigned comparison is used for non-numeric types so straight
+// lexiographic comparison makes sense. (a.ptr is always unsigned)
+return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr 
+ b_length);
   }
+};
 
-  static T Min(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? a : b;
-  }
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+// Is signed is used for integers encoded as big-endian twos
+// complement integers. (e.g. decimals).
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+
+// At least of the lengths is zero.
+if (a_length == 0 || b_length == 0) {
+  return a_length == 0 && b_length > 0;
+}
 
-  static T Max(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? b : a;
+int8_t first_a = *a.ptr;
+int8_t first_b = *b.ptr;
+// We can short circuit for different signed numbers or
+// for equal length bytes arrays that have different first bytes.
+if ((0x80 & first_a) != (0x80 & first_b) ||
+(a_length == b_length && first_a != first_b)) {
+  return first_a < first_b;
+}
+// When the lengths are unequal and the numbers are of the same
+// sign we need to extend the digits.
+const uint8_t* a_start = a.ptr;
+const uint8_t* b_start = b.ptr;
+if (a_length != b_length) {
+  const uint8_t* lead_start = nullptr;
+  const uint8_t* lead_end = nullptr;
+  if (a_length > b_length) {
+int lead_length = a_length - b_length;
+lead_start = a.ptr;
+lead_end = a.ptr + lead_length;
+a_start += lead_length;
+  } else {
+DCHECK_LT(a_length, b_length);
+int lead_length = b_length - a_length;
+lead_start = b.ptr;
+lead_end = b.ptr + lead_length;
+b_start += lead_length;
+  }
+  // Compare extra bytes to the sign extension of the first
+  // byte of the other number.
+  uint8_t extension = first_a < 0 ? 0xFF : 0;
+  for (; lead_start != lead_end; lead_start++) {
+if (*lead_start < extension) {
+  // The first bytes of the long value are less
+  // then the extended short one.  So if a is the long value
+  // we can return true.
+  return a_length > b_length;
+} else if (*lead_start > extension) {

Review comment:
   Yes, thank you.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587423364



##
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##
@@ -2673,6 +2673,36 @@ TEST(ArrowReadWrite, Decimal256) {
   CheckSimpleRoundtrip(table, 2, props_store_schema);
 }
 
+TEST(ArrowReadWrite, DecimalStats) {
+  using ::arrow::Decimal128;
+  using ::arrow::field;
+
+  auto type = ::arrow::decimal128(/*precision=*/8, /*scale=*/0);
+
+  const char* json = R"(["255", "128", null, "0", "1", "-127", "-128", "-129", 
"-255"])";
+  auto array = ::arrow::ArrayFromJSON(type, json);
+  auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), 
{array});
+
+  std::shared_ptr buffer;
+  ASSERT_NO_FATAL_FAILURE(WriteTableToBuffer(table, /*row_grop_size=*/100,
+ 
default_arrow_writer_properties(), ));
+
+  std::unique_ptr reader;
+  ASSERT_OK_NO_THROW(OpenFile(std::make_shared(buffer),
+  ::arrow::default_memory_pool(), ));
+
+  std::shared_ptr min, max;
+  ReadSingleColumnFileStatistics(std::move(reader), , );
+
+  std::shared_ptr expected_min, expected_max;
+  ASSERT_OK_AND_ASSIGN(expected_min, array->GetScalar(array->length() - 1));
+  ASSERT_OK_AND_ASSIGN(expected_max, array->GetScalar(0));
+  ASSERT_TRUE(expected_min->Equals(*min))

Review comment:
   I think we have `AssertScalarsEqual`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.

2021-03-04 Thread GitBox


github-actions[bot] commented on pull request #9629:
URL: https://github.com/apache/arrow/pull/9629#issuecomment-790547752


   https://issues.apache.org/jira/browse/ARROW-11838



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jmgpeeters opened a new pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.

2021-03-04 Thread GitBox


jmgpeeters opened a new pull request #9629:
URL: https://github.com/apache/arrow/pull/9629


   The only code change required, AFAICT, is the calculation of num_dicts, 
which is no longer simply the number of fields, but rather the unique number of 
id's they point to. I'm calculating this on-demand, as it's quite cheap and not 
frequently called, but could also (p)re-compute this on every addField.
   
   For now, I've added tests that read materialised data generated from Java, 
as we don't support writing IPC with shared dictionaries in C++ either yet (and 
out of scope here). 
   
   Down the line, I would like full read & write support for shared 
dictionaries across at least C++, Python, Java and Julia, so I'll be coming 
back to this with follow-up PR's where needed. As part of that, I'll also 
change the tests to no longer rely on materialised files, but use the 
round-trip mechanism.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string

2021-03-04 Thread GitBox


pitrou commented on pull request #9606:
URL: https://github.com/apache/arrow/pull/9606#issuecomment-790546124


   Also note that the current implementation (in `scalar_set_lookup.cc`) is 
suboptimal as it decodes the dictionary array and runs the set lookup for each 
decoded dictionary value. It would be more efficient to compute and cache 
lookup results for distinct dictionary elements. That can be for a later JIRA, 
though.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9606:
URL: https://github.com/apache/arrow/pull/9606#discussion_r587386055



##
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc
##
@@ -280,6 +320,28 @@ class TestIndexInKernel : public ::testing::Test {
 ASSERT_OK(actual.chunked_array()->ValidateFull());
 AssertChunkedEqual(*expected, *actual.chunked_array());
   }
+
+  void CheckIndexInDictionary(const std::shared_ptr& type,
+  const std::shared_ptr& index_type,
+  const std::string& input_dictionary_json,
+  const std::string& input_index_json,
+  const std::string& value_set_json,
+  const std::string& expected_json, bool 
skip_nulls = false) {
+auto dict_type = dictionary(index_type, type);
+auto indices = ArrayFromJSON(index_type, input_index_json);
+auto dict = ArrayFromJSON(type, value_set_json);

Review comment:
   Should be `input_dictionary_json` as well.

##
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc
##
@@ -72,6 +72,27 @@ void CheckIsInChunked(const std::shared_ptr& 
input,
   AssertChunkedEqual(*expected, *actual);
 }
 
+void CheckIsInDictionary(const std::shared_ptr& type,
+ const std::shared_ptr& index_type,
+ const std::string& input_dictionary_json,
+ const std::string& input_index_json,
+ const std::string& value_set_json,
+ const std::string& expected_json, bool skip_nulls = 
false) {
+  auto dict_type = dictionary(index_type, type);
+  auto indices = ArrayFromJSON(index_type, input_index_json);
+  auto dict = ArrayFromJSON(type, value_set_json);

Review comment:
   This should use `input_dictionary_json`, not `value_set_json`.

##
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc
##
@@ -231,6 +252,25 @@ TEST_F(TestIsInKernel, Decimal) {
 /*skip_nulls=*/true);
 }
 
+TEST_F(TestIsInKernel, DictionaryArray) {
+  for (auto index_ty : all_dictionary_index_types()) {
+CheckIsInDictionary(/*type=*/utf8(),
+/*index_type=*/index_ty,
+/*input_dictionary_json=*/R"(["A", "B", "C", "D"])",
+/*input_index_json=*/"[1, 2, null, 0]",
+/*value_set_json=*/R"(["A", "B", "C"])",
+/*expected_json=*/"[true, true, false, true]",
+/*skip_nulls=*/false);
+CheckIsInDictionary(/*type=*/float32(),

Review comment:
   Consider adding the following tests:
   ```c++
   // With nulls and skip_nulls=false
   CheckIsInDictionary(/*type=*/utf8(),
   /*index_type=*/index_ty,
   /*input_dictionary_json=*/R"(["A", "B", "C", "D"])",
   /*input_index_json=*/"[1, 3, null, 0, 1]",
   /*value_set_json=*/R"(["C", "B", "A", null])",
   /*expected_json=*/"[true, false, true, true, true]",
   /*skip_nulls=*/false);
   CheckIsInDictionary(/*type=*/utf8(),
   /*index_type=*/index_ty,
   /*input_dictionary_json=*/R"(["A", null, "C", "D"])",
   /*input_index_json=*/"[1, 3, null, 0, 1]",
   /*value_set_json=*/R"(["C", "B", "A", null])",
   /*expected_json=*/"[true, false, true, true, true]",
   /*skip_nulls=*/false);
   CheckIsInDictionary(/*type=*/utf8(),
   /*index_type=*/index_ty,
   /*input_dictionary_json=*/R"(["A", null, "C", "D"])",
   /*input_index_json=*/"[1, 3, null, 0, 1]",
   /*value_set_json=*/R"(["C", "B", "A"])",
   /*expected_json=*/"[false, false, false, true, 
false]",
   /*skip_nulls=*/false);
   
   // With nulls and skip_nulls=true
   CheckIsInDictionary(/*type=*/utf8(),
   /*index_type=*/index_ty,
   /*input_dictionary_json=*/R"(["A", "B", "C", "D"])",
   /*input_index_json=*/"[1, 3, null, 0, 1]",
   /*value_set_json=*/R"(["C", "B", "A", null])",
   /*expected_json=*/"[true, false, false, true, true]",
   /*skip_nulls=*/true);
   CheckIsInDictionary(/*type=*/utf8(),
   /*index_type=*/index_ty,
   /*input_dictionary_json=*/R"(["A", null, "C", "D"])",
   /*input_index_json=*/"[1, 3, null, 0, 1]",
   /*value_set_json=*/R"(["C", "B", "A", null])",
   

[GitHub] [arrow] pitrou commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9613:
URL: https://github.com/apache/arrow/pull/9613#discussion_r587361262



##
File path: cpp/src/arrow/io/caching.cc
##
@@ -193,6 +195,36 @@ Result> 
ReadRangeCache::Read(ReadRange range) {
   return Status::Invalid("ReadRangeCache did not find matching cache entry");
 }
 
+Future<> ReadRangeCache::Wait() {
+  struct State {
+explicit State(std::vector>> f)
+: futures(std::move(f)), remaining(futures.size()) {}
+std::vector>> futures;
+std::atomic remaining;
+  };
+
+  std::vector>> futures;
+  for (const auto& entry : impl_->entries) {
+futures.push_back(entry.future);
+  }
+  auto out = Future<>::Make();
+  auto state = std::make_shared(std::move(futures));
+  for (const auto& future : state->futures) {
+future.AddCallback([state, out](const Result>&) 
mutable {
+  if (state->remaining.fetch_sub(1) != 1) return;

Review comment:
   Ideally this primitive would be globally available, instead of 
reimplemented here. So you could write for example:
   ```c++
 return Future::Gather(std::move(futures));
   ```
   
   Also there may be different kinds of "gather". By default you probably want 
to first error to immediately mark the resulting future errored.
   
   cc @westonpace 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9613: PARQUET-1993: [C++] expose way to wait for I/O to complete

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9613:
URL: https://github.com/apache/arrow/pull/9613#discussion_r587359365



##
File path: cpp/src/arrow/io/caching.h
##
@@ -99,6 +100,9 @@ class ARROW_EXPORT ReadRangeCache {
   /// \brief Read a range previously given to Cache().
   Result> Read(ReadRange range);
 
+  /// \brief Wait until all ranges added so far have been cached.
+  Future<> Wait();

Review comment:
   Hmm... Wouldn't it be better to expose a `ReadAsync` equivalent to 
`Read`?
   ```c++
   Future> Read(ReadRange range);
   ```
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-04 Thread GitBox


pitrou commented on pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#issuecomment-790509864


   @westonpace @bkietz Feel free to review.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-04 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r587330687



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -517,20 +650,32 @@ where
 },
 );
 
-if !options.descending {
-valids.sort_by(|a, b| cmp_array(a.1.as_ref(), b.1.as_ref()))
+let mut len = values.len();
+let descending = options.descending;
+
+if let Some(size) = limit {
+len = size.min(len);
+}
+
+// we are not using partial_sort here, because array is ArrayRef. 
Something is not working good in that.

Review comment:
   Sorry, there is a bug in the previous version. It's fixed in the newest 
[commit](https://github.com/sundy-li/partial_sort/commit/372778d0011b4d647ed4ee13d1dacb0865eb266d#diff-b1a35a68f14e696205874893c07fd24fdb2b47c23cc0e0c80a30c7d53759R103-R141)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter

2021-03-04 Thread GitBox


pitrou closed pull request #9627:
URL: https://github.com/apache/arrow/pull/9627


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9626: ARROW-11855: [C++][Python] Memory leak in to_pandas when converting chunked struct array

2021-03-04 Thread GitBox


pitrou commented on a change in pull request #9626:
URL: https://github.com/apache/arrow/pull/9626#discussion_r587322650



##
File path: python/pyarrow/tests/test_pandas.py
##
@@ -2272,6 +2272,30 @@ def test_to_pandas(self):
 series = pd.Series(arr.to_pandas())
 tm.assert_series_equal(series, expected)
 
+def test_to_pandas_multiple_chunks(self):
+# ARROW-11855
+bytes_start = pa.total_allocated_bytes()

Review comment:
   Probably want to call `gc.collect()` just before this, to avoid false 
positives.

##
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##
@@ -689,7 +691,8 @@ Status ConvertStruct(PandasOptions options, const 
ChunkedArray& data,
   auto name = array_type->field(static_cast(field_idx))->name();
   if (!arr->field(static_cast(field_idx))->IsNull(i)) {
 // Value exists in child array, obtain it
-auto array = 
reinterpret_cast(fields_data[field_idx].obj());
+auto array = reinterpret_cast(
+fields_data[field_idx + fields_data_offset].obj());

Review comment:
   Does this mean that conversion could give the wrong results (in addition 
to being leaky)? If so, can you add a test showcasing that? (I believe you need 
the different chunks to be unequal...).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on pull request #9626: ARROW-11855: [C++][Python] Memory leak in to_pandas when converting chunked struct array

2021-03-04 Thread GitBox


westonpace commented on pull request #9626:
URL: https://github.com/apache/arrow/pull/9626#issuecomment-790458923


   Failing check appears to be ARROW-11717 / unrelated to PR.  Please rerun or 
ignore.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yoonsikp commented on pull request #8491: ARROW-10349: [Python] linux aarch64 wheels

2021-03-04 Thread GitBox


yoonsikp commented on pull request #8491:
URL: https://github.com/apache/arrow/pull/8491#issuecomment-790429618


   Would truly appreciate having aarch64 wheel support for apache arrow!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


Dandandan commented on pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#issuecomment-790407618


   @alamb ha, thanks, fixed!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#issuecomment-790405647


   @pitrou I think I addressed all of your comments, would you mind taking 
another look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587235073



##
File path: cpp/src/parquet/statistics.cc
##
@@ -133,56 +136,95 @@ struct CompareHelper {
   static T Max(int type_length, const T& a, const T& b) {
 return Compare(0, a, b) ? b : a;
   }
-};  // namespace parquet
-
-template 
-struct ByteLikeCompareHelperBase {
-  using T = ByteArrayType::c_type;
-  using PtrType = typename std::conditional::type;
+};
 
-  static T DefaultMin() { return {}; }
-  static T DefaultMax() { return {}; }
-  static T Coalesce(T val, T fallback) { return val; }
+template 
+struct BinaryLikeComparer {};
 
-  static inline bool Compare(int type_length, const T& a, const T& b) {
-const auto* aptr = reinterpret_cast(a.ptr);
-const auto* bptr = reinterpret_cast(b.ptr);
-return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + 
b.len);
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+// Unsigned comparison is used for non-numeric types so straight
+// lexiographic comparison makes sense. (a.ptr is always unsigned)
+return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr 
+ b_length);
   }
+};
 
-  static T Min(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? a : b;
-  }
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+// Is signed is used for integers encoded as big-endian twos
+// complement integers. (e.g. decimals).
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+
+// At least of the lengths is zero.
+if (a_length == 0 || b_length == 0) {
+  return a_length == 0 && b_length > 0;
+}
 
-  static T Max(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? b : a;
+int8_t first_a = *a.ptr;
+int8_t first_b = *b.ptr;
+// We can short circuit for different signed numbers or
+// for equal length bytes arrays that have different first bytes.
+if ((0x80 & first_a) != (0x80 & first_b) ||
+(a_length == b_length && first_a != first_b)) {
+  return first_a < first_b;
+}
+// When the lengths are unequal and the numbers are of the same
+// sign we need to extend the digits.
+const uint8_t* a_start = a.ptr;
+const uint8_t* b_start = b.ptr;
+if (a_length != b_length) {
+  const uint8_t* lead_start = nullptr;
+  const uint8_t* lead_end = nullptr;
+  if (a_length > b_length) {
+int lead_length = a_length - b_length;
+lead_start = a.ptr;
+lead_end = a.ptr + lead_length;
+a_start += lead_length;
+  } else {
+DCHECK_LT(a_length, b_length);
+int lead_length = b_length - a_length;
+lead_start = b.ptr;
+lead_end = b.ptr + lead_length;
+b_start += lead_length;
+  }
+  // Compare extra bytes to the sign extension of the first
+  // byte of the other number.
+  uint8_t extension = first_a < 0 ? 0xFF : 0;
+  for (; lead_start != lead_end; lead_start++) {
+if (*lead_start < extension) {
+  // The first bytes of the long value are less
+  // then the extended short one.  So if a is the long value
+  // we can return true.
+  return a_length > b_length;
+} else if (*lead_start > extension) {

Review comment:
   I did a slightly different factoring let me know if you think that is 
clearer?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587234738



##
File path: cpp/src/parquet/statistics_test.cc
##
@@ -111,21 +117,28 @@ TEST(Comparison, UnsignedByteArray) {
 }
 
 TEST(Comparison, SignedFLBA) {
-  int size = 10;
+  int size = 4;
   auto comparator =
   MakeComparator(Type::FIXED_LEN_BYTE_ARRAY, SortOrder::SIGNED, 
size);
 
-  std::string s1 = "Anti123456";
-  std::string s2 = "Bunkd123456";
-  FLBA s1flba = FLBAFromString(s1);
-  FLBA s2flba = FLBAFromString(s2);
-  ASSERT_TRUE(comparator->Compare(s1flba, s2flba));
+  std::vector byte_values[] = {
+  {0x80, 0, 0, 0},  {0xFF, 0xFF, 0x01, 0},{0xFF, 0xFF, 0x80, 
0},
+  {0xFF, 0xFF, 0xFF, 0x80}, {0xFF, 0xFF, 0xFF, 0xFF}, {0, 0, 0x01, 0x01},
+  {0, 0x01, 0x01, 0},   {0x01, 0x01, 0, 0}};
+  std::vector values_to_compare;
+  for (auto& bytes : byte_values) {
+values_to_compare.emplace_back(FLBA(bytes.data()));
+  }
 
-  s1 = "Bünk123456";
-  s2 = "Bunk123456";
-  s1flba = FLBAFromString(s1);
-  s2flba = FLBAFromString(s2);
-  ASSERT_TRUE(comparator->Compare(s1flba, s2flba));
+  for (size_t x = 0; x < values_to_compare.size(); x++) {
+EXPECT_FALSE(comparator->Compare(values_to_compare[x], 
values_to_compare[x])) << x;
+for (size_t y = x + 1; y < values_to_compare.size(); y++) {
+  EXPECT_TRUE(comparator->Compare(values_to_compare[x], 
values_to_compare[y]))
+  << x << " " << y;
+  EXPECT_FALSE(comparator->Compare(values_to_compare[y], 
values_to_compare[x]))
+  << y << " " << x;
+}
+  }

Review comment:
   you probably shouldn't trust me.  I added a test when writing a column 
of Decimal128 values to arrow_reader_writer_test which failed without this 
change.  This required fixing some cases with extracting scalar values and 
there are still some broken aspects to the extraction.  I'll open up some JIRAs 
to fix these:
   1. Decimals encoded as Ints aren't extracted correctly.
   2. The decimal type chosen might not correspond to the decoded decimal type 
because we don't pass the arrow schema information through to the decoding. 
(i.e. we depend solely on precision for deciding Decimal128Scalar or 
Decimal256Scalar).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587233240



##
File path: cpp/src/parquet/statistics.cc
##
@@ -133,56 +136,95 @@ struct CompareHelper {
   static T Max(int type_length, const T& a, const T& b) {
 return Compare(0, a, b) ? b : a;
   }
-};  // namespace parquet
-
-template 
-struct ByteLikeCompareHelperBase {
-  using T = ByteArrayType::c_type;
-  using PtrType = typename std::conditional::type;
+};
 
-  static T DefaultMin() { return {}; }
-  static T DefaultMax() { return {}; }
-  static T Coalesce(T val, T fallback) { return val; }
+template 
+struct BinaryLikeComparer {};
 
-  static inline bool Compare(int type_length, const T& a, const T& b) {
-const auto* aptr = reinterpret_cast(a.ptr);
-const auto* bptr = reinterpret_cast(b.ptr);
-return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + 
b.len);
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+// Unsigned comparison is used for non-numeric types so straight
+// lexiographic comparison makes sense. (a.ptr is always unsigned)
+return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr 
+ b_length);
   }
+};
 
-  static T Min(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? a : b;
-  }
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+// Is signed is used for integers encoded as big-endian twos
+// complement integers. (e.g. decimals).
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+
+// At least of the lengths is zero.
+if (a_length == 0 || b_length == 0) {
+  return a_length == 0 && b_length > 0;
+}
 
-  static T Max(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? b : a;
+int8_t first_a = *a.ptr;
+int8_t first_b = *b.ptr;
+// We can short circuit for different signed numbers or
+// for equal length bytes arrays that have different first bytes.
+if ((0x80 & first_a) != (0x80 & first_b) ||
+(a_length == b_length && first_a != first_b)) {
+  return first_a < first_b;
+}

Review comment:
   Added the example above.  I can add more of the explanation if you think 
it is worthwhile.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587230891



##
File path: cpp/src/parquet/statistics_test.cc
##
@@ -71,19 +71,25 @@ static FLBA FLBAFromString(const std::string& s) {
 TEST(Comparison, SignedByteArray) {
   auto comparator = MakeComparator(Type::BYTE_ARRAY, 
SortOrder::SIGNED);
 
-  std::string s1 = "12345";
-  std::string s2 = "12345678";
-  ByteArray s1ba = ByteArrayFromString(s1);
-  ByteArray s2ba = ByteArrayFromString(s2);
-  ASSERT_TRUE(comparator->Compare(s1ba, s2ba));
+  std::vector byte_values[] = {
+  {0x80, 0x80, 0, 0},   {/*0xFF,*/ 0x80, 0, 0}, {/*0xFF,*/ 0xFF, 
0x01, 0},
+  {/*0xFF,0xFF,*/ 0x80, 0}, {/*0xFF,0xFF,0xFF,*/ 0x80}, {/*0xFF, 0xFF, 
0xFF,*/ 0xFF},
+  {/*0, 0,*/ 0x01, 0x01},   {/*0,*/ 0x01, 0x01, 0}, {0x01, 0x01, 0, 
0}};

Review comment:
   did something similar to this.  I don't think ByteArray supports string 
insertion?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dmyersturnbull opened a new issue #9628: write_feather incorrectly deletes files

2021-03-03 Thread GitBox


dmyersturnbull opened a new issue #9628:
URL: https://github.com/apache/arrow/issues/9628


   
   I'm happy to fill this out on Jira and/or to submit a pull request.
   I just can't log in to the Jira right now -- as soon as I log in it 
"forgets".
   
   Two related issues about 
[write_feather](https://github.com/apache/arrow/blob/b8b44197866242aa1fec6a7f76e3e1005ac3c6de/python/pyarrow/feather.py#L119).
   
    pathlib
   `write_feather` could accept a `pathlib.PurePath` since Python 3.6+ is 
already required.
   
    files are deleted
   
   On error, write_feather will delete a file that already existed.
   
   Suppose some code is supposed to read `myfile.feather`, update it, and write 
it back.
   If the writer throws an error, the original copy of `myfile.feather` is 
deleted.
   [Here's the 
culprit](https://github.com/apache/arrow/blob/b8b44197866242aa1fec6a7f76e3e1005ac3c6de/python/pyarrow/feather.py#L185).
   
   
   ```python
   except Exception:
   if isinstance(dest, str):
   try:
   os.remove(dest)
   except os.error:
   pass
   raise
   ```python
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter

2021-03-03 Thread GitBox


github-actions[bot] commented on pull request #9627:
URL: https://github.com/apache/arrow/pull/9627#issuecomment-790353990


   https://issues.apache.org/jira/browse/ARROW-11856



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace opened a new pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter

2021-03-03 Thread GitBox


westonpace opened a new pull request #9627:
URL: https://github.com/apache/arrow/pull/9627


   The type RecordBatchStreamWriter was in a type_fwd but never implemented 
anywhere.  The property type would be RecordBatchWriter



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-03 Thread GitBox


mathyingzhou edited a comment on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527


   @pitrou  Yes now it is ready for another review. I have fixed all the issues 
you mentioned, found and fixed a previously hidden bug and shortened the tests 
to about 650 lines (with more tests!) It should be a lot neater this time.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-03 Thread GitBox


mathyingzhou edited a comment on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527


   @pitrou  Yes now it is ready for another review. I have fixed all the issues 
you mentioned and shortened the tests to about 650 lines (with more tests!) It 
should be a lot neater this time.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-03 Thread GitBox


mathyingzhou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527


   @pitrou  Yes now it is ready for another review. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.

2021-03-03 Thread GitBox


kou closed pull request #9622:
URL: https://github.com/apache/arrow/pull/9622


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-03 Thread GitBox


liyafan82 commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-790239103


   > Congratulations @liyafan82 ! Do you have an idea how hard it will be to 
add zstd support?
   
   @pitrou Support for zstd should be much easier, as you can see, most of the 
effort is devoted to framework change and library selection. Such effort can be 
saved when supporting zstd. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sighingnow commented on a change in pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.

2021-03-03 Thread GitBox


sighingnow commented on a change in pull request #9622:
URL: https://github.com/apache/arrow/pull/9622#discussion_r586993756



##
File path: cpp/src/arrow/ArrowConfig.cmake.in
##
@@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static))
 get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION)
 get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY)
 
-add_library(arrow_bundled_dependencies STATIC IMPORTED)
-set_target_properties(
-  arrow_bundled_dependencies
-  PROPERTIES
-IMPORTED_LOCATION
-
"${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}"
-  )
+if(@ARROW_BUNDLED_STATIC_LIBS@)

Review comment:
   @kou Fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sighingnow commented on a change in pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.

2021-03-03 Thread GitBox


sighingnow commented on a change in pull request #9622:
URL: https://github.com/apache/arrow/pull/9622#discussion_r586992064



##
File path: cpp/src/arrow/ArrowConfig.cmake.in
##
@@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static))
 get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION)
 get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY)
 
-add_library(arrow_bundled_dependencies STATIC IMPORTED)
-set_target_properties(
-  arrow_bundled_dependencies
-  PROPERTIES
-IMPORTED_LOCATION
-
"${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}"
-  )
+if(@ARROW_BUNDLED_STATIC_LIBS@)

Review comment:
   Thanks! Thanks deinitely works!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements. 
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns or make 
partial_sort fallback to default sort when  limit equals to len.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9626: ARROW-11855: Memory leak in to_pandas when converting chunked struct array

2021-03-03 Thread GitBox


github-actions[bot] commented on pull request #9626:
URL: https://github.com/apache/arrow/pull/9626#issuecomment-790226986


   https://issues.apache.org/jira/browse/ARROW-11855



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements. 
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns or make 
partial_sort fallback to default sort when  limit is the len.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace opened a new pull request #9626: ARROW-11855: Memory leak in to_pandas when converting chunked struct array

2021-03-03 Thread GitBox


westonpace opened a new pull request #9626:
URL: https://github.com/apache/arrow/pull/9626


   When converting a struct with chunks to python the ownership of the arrow 
arrays was not being properly tracked and the deletion of the resulting pandas 
dataframe would leave some buffers behind.  I believe it was something like 
this (pseudo-code)...
   
   ```
   array_tracker** refs[num_fields];
   for chunk in chunks:
 *refs[chunk.index] = convert(...)
 for row in rows:
   row.owns.append(refs[chunk.index])
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements.
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets.
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex

2021-03-03 Thread GitBox


seddonm1 commented on pull request #9625:
URL: https://github.com/apache/arrow/pull/9625#issuecomment-790222729


   Thanks @andygrove . I have a few more PRs to do to finish this first phase 
of work. Then I think it's time to tackle type coercion.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   3   4   5   6   7   8   9   10   >