[GitHub] [arrow] Dandandan commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


Dandandan commented on pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#issuecomment-790407618


   @alamb ha, thanks, fixed!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#issuecomment-790405647


   @pitrou I think I addressed all of your comments, would you mind taking 
another look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587235073



##
File path: cpp/src/parquet/statistics.cc
##
@@ -133,56 +136,95 @@ struct CompareHelper {
   static T Max(int type_length, const T& a, const T& b) {
 return Compare(0, a, b) ? b : a;
   }
-};  // namespace parquet
-
-template 
-struct ByteLikeCompareHelperBase {
-  using T = ByteArrayType::c_type;
-  using PtrType = typename std::conditional::type;
+};
 
-  static T DefaultMin() { return {}; }
-  static T DefaultMax() { return {}; }
-  static T Coalesce(T val, T fallback) { return val; }
+template 
+struct BinaryLikeComparer {};
 
-  static inline bool Compare(int type_length, const T& a, const T& b) {
-const auto* aptr = reinterpret_cast(a.ptr);
-const auto* bptr = reinterpret_cast(b.ptr);
-return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + 
b.len);
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+// Unsigned comparison is used for non-numeric types so straight
+// lexiographic comparison makes sense. (a.ptr is always unsigned)
+return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr 
+ b_length);
   }
+};
 
-  static T Min(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? a : b;
-  }
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+// Is signed is used for integers encoded as big-endian twos
+// complement integers. (e.g. decimals).
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+
+// At least of the lengths is zero.
+if (a_length == 0 || b_length == 0) {
+  return a_length == 0 && b_length > 0;
+}
 
-  static T Max(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? b : a;
+int8_t first_a = *a.ptr;
+int8_t first_b = *b.ptr;
+// We can short circuit for different signed numbers or
+// for equal length bytes arrays that have different first bytes.
+if ((0x80 & first_a) != (0x80 & first_b) ||
+(a_length == b_length && first_a != first_b)) {
+  return first_a < first_b;
+}
+// When the lengths are unequal and the numbers are of the same
+// sign we need to extend the digits.
+const uint8_t* a_start = a.ptr;
+const uint8_t* b_start = b.ptr;
+if (a_length != b_length) {
+  const uint8_t* lead_start = nullptr;
+  const uint8_t* lead_end = nullptr;
+  if (a_length > b_length) {
+int lead_length = a_length - b_length;
+lead_start = a.ptr;
+lead_end = a.ptr + lead_length;
+a_start += lead_length;
+  } else {
+DCHECK_LT(a_length, b_length);
+int lead_length = b_length - a_length;
+lead_start = b.ptr;
+lead_end = b.ptr + lead_length;
+b_start += lead_length;
+  }
+  // Compare extra bytes to the sign extension of the first
+  // byte of the other number.
+  uint8_t extension = first_a < 0 ? 0xFF : 0;
+  for (; lead_start != lead_end; lead_start++) {
+if (*lead_start < extension) {
+  // The first bytes of the long value are less
+  // then the extended short one.  So if a is the long value
+  // we can return true.
+  return a_length > b_length;
+} else if (*lead_start > extension) {

Review comment:
   I did a slightly different factoring let me know if you think that is 
clearer?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587234738



##
File path: cpp/src/parquet/statistics_test.cc
##
@@ -111,21 +117,28 @@ TEST(Comparison, UnsignedByteArray) {
 }
 
 TEST(Comparison, SignedFLBA) {
-  int size = 10;
+  int size = 4;
   auto comparator =
   MakeComparator(Type::FIXED_LEN_BYTE_ARRAY, SortOrder::SIGNED, 
size);
 
-  std::string s1 = "Anti123456";
-  std::string s2 = "Bunkd123456";
-  FLBA s1flba = FLBAFromString(s1);
-  FLBA s2flba = FLBAFromString(s2);
-  ASSERT_TRUE(comparator->Compare(s1flba, s2flba));
+  std::vector byte_values[] = {
+  {0x80, 0, 0, 0},  {0xFF, 0xFF, 0x01, 0},{0xFF, 0xFF, 0x80, 
0},
+  {0xFF, 0xFF, 0xFF, 0x80}, {0xFF, 0xFF, 0xFF, 0xFF}, {0, 0, 0x01, 0x01},
+  {0, 0x01, 0x01, 0},   {0x01, 0x01, 0, 0}};
+  std::vector values_to_compare;
+  for (auto& bytes : byte_values) {
+values_to_compare.emplace_back(FLBA(bytes.data()));
+  }
 
-  s1 = "Bünk123456";
-  s2 = "Bunk123456";
-  s1flba = FLBAFromString(s1);
-  s2flba = FLBAFromString(s2);
-  ASSERT_TRUE(comparator->Compare(s1flba, s2flba));
+  for (size_t x = 0; x < values_to_compare.size(); x++) {
+EXPECT_FALSE(comparator->Compare(values_to_compare[x], 
values_to_compare[x])) << x;
+for (size_t y = x + 1; y < values_to_compare.size(); y++) {
+  EXPECT_TRUE(comparator->Compare(values_to_compare[x], 
values_to_compare[y]))
+  << x << " " << y;
+  EXPECT_FALSE(comparator->Compare(values_to_compare[y], 
values_to_compare[x]))
+  << y << " " << x;
+}
+  }

Review comment:
   you probably shouldn't trust me.  I added a test when writing a column 
of Decimal128 values to arrow_reader_writer_test which failed without this 
change.  This required fixing some cases with extracting scalar values and 
there are still some broken aspects to the extraction.  I'll open up some JIRAs 
to fix these:
   1. Decimals encoded as Ints aren't extracted correctly.
   2. The decimal type chosen might not correspond to the decoded decimal type 
because we don't pass the arrow schema information through to the decoding. 
(i.e. we depend solely on precision for deciding Decimal128Scalar or 
Decimal256Scalar).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587233240



##
File path: cpp/src/parquet/statistics.cc
##
@@ -133,56 +136,95 @@ struct CompareHelper {
   static T Max(int type_length, const T& a, const T& b) {
 return Compare(0, a, b) ? b : a;
   }
-};  // namespace parquet
-
-template 
-struct ByteLikeCompareHelperBase {
-  using T = ByteArrayType::c_type;
-  using PtrType = typename std::conditional::type;
+};
 
-  static T DefaultMin() { return {}; }
-  static T DefaultMax() { return {}; }
-  static T Coalesce(T val, T fallback) { return val; }
+template 
+struct BinaryLikeComparer {};
 
-  static inline bool Compare(int type_length, const T& a, const T& b) {
-const auto* aptr = reinterpret_cast(a.ptr);
-const auto* bptr = reinterpret_cast(b.ptr);
-return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + 
b.len);
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+// Unsigned comparison is used for non-numeric types so straight
+// lexiographic comparison makes sense. (a.ptr is always unsigned)
+return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr 
+ b_length);
   }
+};
 
-  static T Min(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? a : b;
-  }
+template 
+struct BinaryLikeComparer {
+  static bool Compare(int type_length, const T& a, const T& b) {
+// Is signed is used for integers encoded as big-endian twos
+// complement integers. (e.g. decimals).
+int a_length = value_length(type_length, a);
+int b_length = value_length(type_length, b);
+
+// At least of the lengths is zero.
+if (a_length == 0 || b_length == 0) {
+  return a_length == 0 && b_length > 0;
+}
 
-  static T Max(int type_length, const T& a, const T& b) {
-if (a.ptr == nullptr) return b;
-if (b.ptr == nullptr) return a;
-return Compare(type_length, a, b) ? b : a;
+int8_t first_a = *a.ptr;
+int8_t first_b = *b.ptr;
+// We can short circuit for different signed numbers or
+// for equal length bytes arrays that have different first bytes.
+if ((0x80 & first_a) != (0x80 & first_b) ||
+(a_length == b_length && first_a != first_b)) {
+  return first_a < first_b;
+}

Review comment:
   Added the example above.  I can add more of the explanation if you think 
it is worthwhile.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics

2021-03-03 Thread GitBox


emkornfield commented on a change in pull request #9582:
URL: https://github.com/apache/arrow/pull/9582#discussion_r587230891



##
File path: cpp/src/parquet/statistics_test.cc
##
@@ -71,19 +71,25 @@ static FLBA FLBAFromString(const std::string& s) {
 TEST(Comparison, SignedByteArray) {
   auto comparator = MakeComparator(Type::BYTE_ARRAY, 
SortOrder::SIGNED);
 
-  std::string s1 = "12345";
-  std::string s2 = "12345678";
-  ByteArray s1ba = ByteArrayFromString(s1);
-  ByteArray s2ba = ByteArrayFromString(s2);
-  ASSERT_TRUE(comparator->Compare(s1ba, s2ba));
+  std::vector byte_values[] = {
+  {0x80, 0x80, 0, 0},   {/*0xFF,*/ 0x80, 0, 0}, {/*0xFF,*/ 0xFF, 
0x01, 0},
+  {/*0xFF,0xFF,*/ 0x80, 0}, {/*0xFF,0xFF,0xFF,*/ 0x80}, {/*0xFF, 0xFF, 
0xFF,*/ 0xFF},
+  {/*0, 0,*/ 0x01, 0x01},   {/*0,*/ 0x01, 0x01, 0}, {0x01, 0x01, 0, 
0}};

Review comment:
   did something similar to this.  I don't think ByteArray supports string 
insertion?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dmyersturnbull opened a new issue #9628: write_feather incorrectly deletes files

2021-03-03 Thread GitBox


dmyersturnbull opened a new issue #9628:
URL: https://github.com/apache/arrow/issues/9628


   
   I'm happy to fill this out on Jira and/or to submit a pull request.
   I just can't log in to the Jira right now -- as soon as I log in it 
"forgets".
   
   Two related issues about 
[write_feather](https://github.com/apache/arrow/blob/b8b44197866242aa1fec6a7f76e3e1005ac3c6de/python/pyarrow/feather.py#L119).
   
    pathlib
   `write_feather` could accept a `pathlib.PurePath` since Python 3.6+ is 
already required.
   
    files are deleted
   
   On error, write_feather will delete a file that already existed.
   
   Suppose some code is supposed to read `myfile.feather`, update it, and write 
it back.
   If the writer throws an error, the original copy of `myfile.feather` is 
deleted.
   [Here's the 
culprit](https://github.com/apache/arrow/blob/b8b44197866242aa1fec6a7f76e3e1005ac3c6de/python/pyarrow/feather.py#L185).
   
   
   ```python
   except Exception:
   if isinstance(dest, str):
   try:
   os.remove(dest)
   except os.error:
   pass
   raise
   ```python
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter

2021-03-03 Thread GitBox


github-actions[bot] commented on pull request #9627:
URL: https://github.com/apache/arrow/pull/9627#issuecomment-790353990


   https://issues.apache.org/jira/browse/ARROW-11856



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace opened a new pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter

2021-03-03 Thread GitBox


westonpace opened a new pull request #9627:
URL: https://github.com/apache/arrow/pull/9627


   The type RecordBatchStreamWriter was in a type_fwd but never implemented 
anywhere.  The property type would be RecordBatchWriter



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-03 Thread GitBox


mathyingzhou edited a comment on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527


   @pitrou  Yes now it is ready for another review. I have fixed all the issues 
you mentioned, found and fixed a previously hidden bug and shortened the tests 
to about 650 lines (with more tests!) It should be a lot neater this time.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-03 Thread GitBox


mathyingzhou edited a comment on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527


   @pitrou  Yes now it is ready for another review. I have fixed all the issues 
you mentioned and shortened the tests to about 650 lines (with more tests!) It 
should be a lot neater this time.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-03 Thread GitBox


mathyingzhou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527


   @pitrou  Yes now it is ready for another review. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.

2021-03-03 Thread GitBox


kou closed pull request #9622:
URL: https://github.com/apache/arrow/pull/9622


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-03 Thread GitBox


liyafan82 commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-790239103


   > Congratulations @liyafan82 ! Do you have an idea how hard it will be to 
add zstd support?
   
   @pitrou Support for zstd should be much easier, as you can see, most of the 
effort is devoted to framework change and library selection. Such effort can be 
saved when supporting zstd. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sighingnow commented on a change in pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.

2021-03-03 Thread GitBox


sighingnow commented on a change in pull request #9622:
URL: https://github.com/apache/arrow/pull/9622#discussion_r586993756



##
File path: cpp/src/arrow/ArrowConfig.cmake.in
##
@@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static))
 get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION)
 get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY)
 
-add_library(arrow_bundled_dependencies STATIC IMPORTED)
-set_target_properties(
-  arrow_bundled_dependencies
-  PROPERTIES
-IMPORTED_LOCATION
-
"${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}"
-  )
+if(@ARROW_BUNDLED_STATIC_LIBS@)

Review comment:
   @kou Fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sighingnow commented on a change in pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.

2021-03-03 Thread GitBox


sighingnow commented on a change in pull request #9622:
URL: https://github.com/apache/arrow/pull/9622#discussion_r586992064



##
File path: cpp/src/arrow/ArrowConfig.cmake.in
##
@@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static))
 get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION)
 get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY)
 
-add_library(arrow_bundled_dependencies STATIC IMPORTED)
-set_target_properties(
-  arrow_bundled_dependencies
-  PROPERTIES
-IMPORTED_LOCATION
-
"${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}"
-  )
+if(@ARROW_BUNDLED_STATIC_LIBS@)

Review comment:
   Thanks! Thanks deinitely works!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements. 
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns or make 
partial_sort fallback to default sort when  limit equals to len.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9626: ARROW-11855: Memory leak in to_pandas when converting chunked struct array

2021-03-03 Thread GitBox


github-actions[bot] commented on pull request #9626:
URL: https://github.com/apache/arrow/pull/9626#issuecomment-790226986


   https://issues.apache.org/jira/browse/ARROW-11855



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements. 
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns or make 
partial_sort fallback to default sort when  limit is the len.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace opened a new pull request #9626: ARROW-11855: Memory leak in to_pandas when converting chunked struct array

2021-03-03 Thread GitBox


westonpace opened a new pull request #9626:
URL: https://github.com/apache/arrow/pull/9626


   When converting a struct with chunks to python the ownership of the arrow 
arrays was not being properly tracked and the deletion of the resulting pandas 
dataframe would leave some buffers behind.  I believe it was something like 
this (pseudo-code)...
   
   ```
   array_tracker** refs[num_fields];
   for chunk in chunks:
 *refs[chunk.index] = convert(...)
 for row in rows:
   row.owns.append(refs[chunk.index])
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements.
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets.
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex

2021-03-03 Thread GitBox


seddonm1 commented on pull request #9625:
URL: https://github.com/apache/arrow/pull/9625#issuecomment-790222729


   Thanks @andygrove . I have a few more PRs to do to finish this first phase 
of work. Then I think it's time to tackle type coercion.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #9561: ARROW-11649: [R] Add support for null_fallback to R

2021-03-03 Thread GitBox


nealrichardson closed pull request #9561:
URL: https://github.com/apache/arrow/pull/9561


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9606:
URL: https://github.com/apache/arrow/pull/9606#discussion_r586922616



##
File path: r/tests/testthat/test-Array.R
##
@@ -723,6 +723,17 @@ test_that("[ accepts Arrays and otherwise handles bad 
input", {
   )
 })
 
+test_that("[ %in% looks up string key in dictionary", {

Review comment:
   ```suggestion
   test_that("%in% works on dictionary arrays", {
   ```

##
File path: r/tests/testthat/test-Array.R
##
@@ -723,6 +723,17 @@ test_that("[ accepts Arrays and otherwise handles bad 
input", {
   )
 })
 
+test_that("[ %in% looks up string key in dictionary", {
+  a1 <- Array$create(as.factor(c("A", "B", "C")))
+  a2 <- DictionaryArray$create(c(0L, 1L, 2L), c(4.5, 3.2, 1.1))
+  b1 <- Array$create(c(FALSE, TRUE, FALSE))
+  b2 <- Array$create(c(FALSE, FALSE, FALSE))
+  expect_equal(b1, arrow:::call_function("is_in_meta_binary", a1, 
Array$create("B")))
+  expect_equal(b2, arrow:::call_function("is_in_meta_binary", a1, 
Array$create("D")))
+  expect_equal(b1, arrow:::call_function("is_in_meta_binary", a2, 
Array$create(3.2)))
+  expect_error(arrow:::call_function("is_in_meta_binary", a2, 
Array$create("B")))

Review comment:
   I would write these as e.g. `expect_equal(a1 %in% "B", b1)` etc.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask

2021-03-03 Thread GitBox


kou closed pull request #9616:
URL: https://github.com/apache/arrow/pull/9616


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask

2021-03-03 Thread GitBox


kou commented on pull request #9616:
URL: https://github.com/apache/arrow/pull/9616#issuecomment-790161292


   CI is green. I'll merge this.
   
   Thanks for updating GLib and Ruby parts too!
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string

2021-03-03 Thread GitBox


rok commented on pull request #9606:
URL: https://github.com/apache/arrow/pull/9606#issuecomment-790159420


   > Thanks for doing this. Can you add tests on the C++ side? I expect tests 
for `IsIn` and `IndexIn`, with non-trivial dictionary arrays, also in cases 
where not all dictionary values are referred by indices.
   
   Added for string and float32 types. Should we check more?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string

2021-03-03 Thread GitBox


rok commented on a change in pull request #9606:
URL: https://github.com/apache/arrow/pull/9606#discussion_r586896681



##
File path: r/tests/testthat/test-Array.R
##
@@ -727,6 +727,15 @@ test_that("[ accepts Arrays and otherwise handles bad 
input", {
   )
 })
 
+test_that("[ %in% looks up string key in dictionary", {
+  a1 <- Array$create(as.factor(c("A", "B", "C")))
+  a2 <- DictionaryArray$create(c(0L, 1L, 2L), c(4.5, 3.2, 1.1))
+  arrow:::call_function("is_in_meta_binary", a1, Array$create("B"))
+  arrow:::call_function("is_in_meta_binary", a1, Array$create("D"))
+  arrow:::call_function("is_in_meta_binary", a2, Array$create(3.2))

Review comment:
   I was looking to test for errors but the added check is better. Adding.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex

2021-03-03 Thread GitBox


seddonm1 commented on pull request #9625:
URL: https://github.com/apache/arrow/pull/9625#issuecomment-790153156


   @andygrove no problem. I will create a macro for that. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes

2021-03-03 Thread GitBox


alamb commented on pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#issuecomment-790144154


   Thanks for this @ericwburden  -- I hope to find time to review this tomorrow



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex

2021-03-03 Thread GitBox


alamb commented on pull request #9625:
URL: https://github.com/apache/arrow/pull/9625#issuecomment-790143072


   Thanks @seddonm1  -- I plan to review this tomorrow



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9579: ARROW-11774: [R] macos one line install

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9579:
URL: https://github.com/apache/arrow/pull/9579#discussion_r586867436



##
File path: r/tools/nixlibs.R
##
@@ -396,6 +401,7 @@ cmake_version <- function(cmd = "cmake") {
 with_s3_support <- function(env_vars) {
   arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || 
tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false"
   if (arrow_s3) {
+# TODO: add in brew versions of these?

Review comment:
   Please do





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


alamb commented on pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#issuecomment-790142862


   @Dandandan  on no! It now seems to have failed `rust fmt` linting



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586859667



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   It does but I think CRAN doesn't do `--run-donttest` everywhere. Anyway 
we won't be turning off these features on CRAN except for Solaris so I think 
it's fine, or worth the risk so as not to pollute the examples with potentially 
confusing (to users) wrapping.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jonkeane commented on pull request #9561: ARROW-11649: [R] Add support for null_fallback to R

2021-03-03 Thread GitBox


jonkeane commented on pull request #9561:
URL: https://github.com/apache/arrow/pull/9561#issuecomment-790126766


   Ok, I've accepted, re-documented, and pushed. This is good to go once CI 
passes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586830050



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   I initially did that, but apparently as of R 4.0.0:
   >`R CMD check --as-cran` now runs `\donttest` examples
   
   (from https://cran.r-project.org/doc/manuals/r-devel/NEWS.html)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask

2021-03-03 Thread GitBox


lidavidm commented on pull request #9616:
URL: https://github.com/apache/arrow/pull/9616#issuecomment-790099377


   Thank you for the review and for patching up my very poor bindings 
:sweat_smile: 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex

2021-03-03 Thread GitBox


github-actions[bot] commented on pull request #9625:
URL: https://github.com/apache/arrow/pull/9625#issuecomment-790096651


   https://issues.apache.org/jira/browse/ARROW-11653



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586819763



##
File path: r/tools/autobrew
##
@@ -48,7 +48,8 @@ fi
 # Hardcode this for my custom autobrew build
 rm -f $BREWDIR/lib/*.dylib
 AWS_LIBS="-laws-cpp-sdk-config -laws-cpp-sdk-transfer 
-laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity 
-laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream 
-laws-checksums -laws-c-common -lpthread -lcurl"
-PKG_LIBS="-L$BREWDIR/lib -lparquet -larrow_dataset -larrow 
-larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS"
+PKG_LIBS="-lparquet -larrow_dataset -larrow -larrow_bundled_dependencies 
-lthrift -llz4 -lsnappy -lzstd $AWS_LIBS"
+PKG_DIRS="-L$BREWDIR/lib"

Review comment:
   We will need to remember to upstream this change when we release to CRAN





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586818947



##
File path: r/configure
##
@@ -26,13 +26,13 @@
 # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
 
 # Library settings
-PKG_CONFIG_NAME="arrow parquet arrow-dataset"
+PKG_CONFIG_NAME="arrow"
 PKG_DEB_NAME="(unsuppored)"
 PKG_RPM_NAME="(unsuppored)"
 PKG_BREW_NAME="apache-arrow"
 PKG_TEST_HEADER=""
-# These must be the same order as $(pkg-config --libs arrow-dataset)
-PKG_LIBS="-larrow_dataset -lparquet -larrow"
+PKG_LIBS="-larrow"
+BUNDLED_LIBS=""

Review comment:
   Likewise I don't think you need to set this here, "" is what it will be 
anyway if it's not set.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 opened a new pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex

2021-03-03 Thread GitBox


seddonm1 opened a new pull request #9625:
URL: https://github.com/apache/arrow/pull/9625


   @alamb This is the second last of the current string functions but I think 
there may be one after that with new code.
   
   This implements some of the miscellaneous string functions `ascii`, `chr`, 
`initcap`, `repeat`, `reverse`, `to_hex`. The next PR will have more useful 
functions (including regex).
   
   A little bit of tidying for consistency to the other functions was applied.
   
   This PR is a child of #9243 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586818018



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   I don't feel strongly about not requiring dev roxygen, but it's fine 
either way.
   
   If you don't want to use examplesIf here, I think you can leave off the `if 
(arrow_with_dataset())` checks everywhere since it's already inside a 
`\donttest`
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #9622: ARROW-11836: [C++] Aovid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.

2021-03-03 Thread GitBox


kou commented on a change in pull request #9622:
URL: https://github.com/apache/arrow/pull/9622#discussion_r586790211



##
File path: cpp/src/arrow/ArrowConfig.cmake.in
##
@@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static))
 get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION)
 get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY)
 
-add_library(arrow_bundled_dependencies STATIC IMPORTED)
-set_target_properties(
-  arrow_bundled_dependencies
-  PROPERTIES
-IMPORTED_LOCATION
-
"${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}"
-  )
+if(@ARROW_BUNDLED_STATIC_LIBS@)

Review comment:
   This doesn't work when we have two or more bundled static libraries. 
e.g.:
   
   ```cmake
   if(jemalloc::jemalloc;mimalloc::mimalloc)
 # ...
   endif()
   ```
   
   doesn't work:
   
   ```text
   CMake Error at /tmp/local/lib/cmake/arrow/ArrowConfig.cmake:99 (if):
 if given arguments:
   
   "jemalloc::jemalloc" "mimalloc::mimalloc"
   
 Unknown arguments specified
   Call Stack (most recent call first):
 /tmp/local/lib/cmake/arrow/FindArrow.cmake:213 (find_package)
 /tmp/local/lib/cmake/arrow/FindArrow.cmake:320 
(arrow_find_package_cmake_package_configuration)
 /tmp/local/lib/cmake/arrow/FindArrow.cmake:357 (arrow_find_package)
 CMakeLists.txt:24 (find_package)
   ```
   
   How about using a variable?
   
   ```diff
   diff --git a/cpp/src/arrow/ArrowConfig.cmake.in 
b/cpp/src/arrow/ArrowConfig.cmake.in
   index a5a3619eb..6209baeec 100644
   --- a/cpp/src/arrow/ArrowConfig.cmake.in
   +++ b/cpp/src/arrow/ArrowConfig.cmake.in
   @@ -37,6 +37,7 @@ set(ARROW_FULL_SO_VERSION "@ARROW_FULL_SO_VERSION@")
set(ARROW_LIBRARY_PATH_SUFFIXES "@ARROW_LIBRARY_PATH_SUFFIXES@")
set(ARROW_INCLUDE_PATH_SUFFIXES "@ARROW_INCLUDE_PATH_SUFFIXES@")
set(ARROW_SYSTEM_DEPENDENCIES "@ARROW_SYSTEM_DEPENDENCIES@")
   +set(ARROW_BUNDLED_STATIC_LIBS "@ARROW_BUNDLED_STATIC_LIBS@")

include("${CMAKE_CURRENT_LIST_DIR}/ArrowOptions.cmake")

   @@ -71,7 +72,7 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static))
get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION)
get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY)

   -if(@ARROW_BUNDLED_STATIC_LIBS@)
   +if(ARROW_BUNDLED_STATIC_LIBS)
  add_library(arrow_bundled_dependencies STATIC IMPORTED)
  set_target_properties(
arrow_bundled_dependencies
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586785240



##
File path: r/configure
##
@@ -145,7 +150,7 @@ else
 # TODO: what about non-bundled deps?
 BUNDLED_LIBS=`cd $LIB_DIR && ls *.a`
 BUNDLED_LIBS=`echo $BUNDLED_LIBS | sed -E "s/lib(.*)\.a/-l\1/" | sed 
-e "s/\\.a lib/ -l/g"`
-PKG_LIBS="-L$(pwd)/$LIB_DIR $PKG_LIBS $BUNDLED_LIBS"
+PKG_DIRS="-L$(pwd)/$LIB_DIR $PKG_DIRS"

Review comment:
   Oh, gotcha. I just went through and checked the control flow more 
carefully, and I agree—`PKG_DIRS` will never be set here. I removed it in 
31bd5e9c12023b793a7ef30d6d14f2290698ffc1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586783964



##
File path: r/configure
##
@@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then
   # Always build with arrow on macOS
   PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW"
   # Check for features
-  LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'`
-  grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 
2>&1
+  LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'`
+  ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake"
+  # Check for Arrow Dataset subcomponent

Review comment:
   Fixed in b7cdbe26ef6ab9496b2a1f810e84469db9620fe1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586783690



##
File path: r/configure
##
@@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then
   # Always build with arrow on macOS
   PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW"
   # Check for features
-  LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'`
-  grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 
2>&1
+  LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'`
+  ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake"
+  # Check for Arrow Dataset subcomponent
+  grep 'set(ARROW_DATASET "ON")' $ARROW_OPTS_CMAKE >/dev/null 2>&1
+  if [ $? -eq 0 ]; then
+PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_DATASET"
+PKG_LIBS="-larrow_dataset $PKG_LIBS"
+# TODO: what if arrow-dataset has a different -L location than arrow?

Review comment:
   Done in 3dc2061abc13949249e689c64cbdab810948ec65





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586783164



##
File path: r/configure
##
@@ -26,13 +26,14 @@
 # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
 
 # Library settings
-PKG_CONFIG_NAME="arrow parquet arrow-dataset"
+PKG_CONFIG_NAME="arrow"
 PKG_DEB_NAME="(unsuppored)"
 PKG_RPM_NAME="(unsuppored)"
 PKG_BREW_NAME="apache-arrow"
 PKG_TEST_HEADER=""
-# These must be the same order as $(pkg-config --libs arrow-dataset)
-PKG_LIBS="-larrow_dataset -lparquet -larrow"
+PKG_LIBS="-larrow"
+PKG_DIRS=""

Review comment:
   I confirmed this. Removed in edbfb7e3732cf63e63ba2a7380771a2c56e722f2





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586763543



##
File path: r/configure
##
@@ -26,13 +26,14 @@
 # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
 
 # Library settings
-PKG_CONFIG_NAME="arrow parquet arrow-dataset"
+PKG_CONFIG_NAME="arrow"
 PKG_DEB_NAME="(unsuppored)"
 PKG_RPM_NAME="(unsuppored)"
 PKG_BREW_NAME="apache-arrow"
 PKG_TEST_HEADER=""
-# These must be the same order as $(pkg-config --libs arrow-dataset)
-PKG_LIBS="-larrow_dataset -lparquet -larrow"
+PKG_LIBS="-larrow"

Review comment:
   Correction: the order was wrong, but this was corrected in 
b7cdbe26ef6ab9496b2a1f810e84469db9620fe1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask

2021-03-03 Thread GitBox


lidavidm commented on pull request #9616:
URL: https://github.com/apache/arrow/pull/9616#issuecomment-790032276


   Looks like things finally pass after fixing the GLib and Ruby bindings. I 
opted not to bind Fragment::Scan since the API there is being reworked at the 
moment anyways.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586750914



##
File path: r/configure
##
@@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then
   # Always build with arrow on macOS
   PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW"
   # Check for features
-  LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'`
-  grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 
2>&1
+  LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'`
+  ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake"
+  # Check for Arrow Dataset subcomponent

Review comment:
   D'oh—I switched from appending to prepending and forgot to change the 
order  Thank you for catching that.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586743366



##
File path: r/configure
##
@@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then
   # Always build with arrow on macOS
   PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW"
   # Check for features
-  LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'`
-  grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 
2>&1
+  LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'`
+  ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake"
+  # Check for Arrow Dataset subcomponent

Review comment:
   Sorry, my note was ambiguous. The order should be `-larrow_dataset 
-lparquet`, correct. But if we're prepending them one at a time, we should 
first prepend parquet, then arrow_dataset. If I'm following correctly, the code 
here first does if ARROW_DATASET, do `PKG_LIBS="-larrow_dataset $PKG_LIBS"`, 
then if ARROW_PARQUET, do `PKG_LIBS="-lparquet $PKG_LIBS"`, which would result 
in `-lparquet -larrow_dataset`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


codecov-io commented on pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#issuecomment-790021228


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9595?src=pr=h1) Report
   > Merging 
[#9595](https://codecov.io/gh/apache/arrow/pull/9595?src=pr=desc) (2869040) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/0f647261058892289b1722e5883f8610cc5dd3d9?el=desc)
 (0f64726) will **increase** coverage by `0.04%`.
   > The diff coverage is `78.72%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9595/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9595?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9595  +/-   ##
   ==
   + Coverage   82.33%   82.38%   +0.04% 
   ==
 Files 245  245  
 Lines   5640757134 +727 
   ==
   + Hits4644347068 +625 
   - Misses   996410066 +102 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9595?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==)
 | `84.16% <77.77%> (+0.64%)` | :arrow_up: |
   | 
[rust/datafusion/src/physical\_plan/repartition.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3JlcGFydGl0aW9uLnJz)
 | `81.21% <100.00%> (-0.14%)` | :arrow_down: |
   | 
[...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==)
 | `73.38% <0.00%> (-3.62%)` | :arrow_down: |
   | 
[rust/arrow/src/array/equal/utils.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdXRpbHMucnM=)
 | `75.49% <0.00%> (-0.99%)` | :arrow_down: |
   | 
[rust/arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzL2ZpZWxkLnJz)
 | `55.47% <0.00%> (-0.66%)` | :arrow_down: |
   | 
[rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=)
 | `87.83% <0.00%> (-0.22%)` | :arrow_down: |
   | 
[rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==)
 | `38.33% <0.00%> (ø)` | |
   | 
[...datafusion/src/physical\_plan/crypto\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NyeXB0b19leHByZXNzaW9ucy5ycw==)
 | `52.45% <0.00%> (ø)` | |
   | 
[...integration-testing/src/flight\_server\_scenarios.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9mbGlnaHRfc2VydmVyX3NjZW5hcmlvcy5ycw==)
 | `0.00% <0.00%> (ø)` | |
   | 
[...-testing/src/flight\_server\_scenarios/middleware.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9mbGlnaHRfc2VydmVyX3NjZW5hcmlvcy9taWRkbGV3YXJlLnJz)
 | `0.00% <0.00%> (ø)` | |
   | ... and [10 
more](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9595?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9595?src=pr=footer). Last 
update 
[0f64726...2869040](https://codecov.io/gh/apache/arrow/pull/9595?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586741670



##
File path: r/configure
##
@@ -145,7 +150,7 @@ else
 # TODO: what about non-bundled deps?
 BUNDLED_LIBS=`cd $LIB_DIR && ls *.a`
 BUNDLED_LIBS=`echo $BUNDLED_LIBS | sed -E "s/lib(.*)\.a/-l\1/" | sed 
-e "s/\\.a lib/ -l/g"`
-PKG_LIBS="-L$(pwd)/$LIB_DIR $PKG_LIBS $BUNDLED_LIBS"
+PKG_DIRS="-L$(pwd)/$LIB_DIR $PKG_DIRS"

Review comment:
   Yeah, I followed that. I just meant that I don't think it's possible for 
`PKG_DIRS` to be set to anything other than "" before here, so you don't need 
to append it do you, just set `PKG_DIRS="-L$(pwd)/$LIB_DIR"`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-03 Thread GitBox


github-actions[bot] commented on pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#issuecomment-790018580


   Revision: 848e020cfaf234411d3a0167b58dc39c030823a4
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-174](https://github.com/ursacomputing/crossbow/branches/all?query=actions-174)
   
   |Task|Status|
   ||--|
   |test-conda-python-3.6|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.6)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.6)|
   |test-conda-python-3.6-pandas-0.23|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.6-pandas-0.23)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.6-pandas-0.23)|
   |test-conda-python-3.7|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7)|
   |test-conda-python-3.7-dask-latest|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-dask-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-dask-latest)|
   |test-conda-python-3.7-hdfs-3.2|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-hdfs-3.2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-hdfs-3.2)|
   |test-conda-python-3.7-kartothek-latest|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-kartothek-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-kartothek-latest)|
   |test-conda-python-3.7-kartothek-master|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-kartothek-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-kartothek-master)|
   |test-conda-python-3.7-pandas-latest|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-pandas-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-pandas-latest)|
   |test-conda-python-3.7-pandas-master|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-pandas-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-pandas-master)|
   |test-conda-python-3.7-spark-branch-3.0|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-spark-branch-3.0)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-spark-branch-3.0)|
   |test-conda-python-3.7-turbodbc-latest|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-turbodbc-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-turbodbc-latest)|
   |test-conda-python-3.7-turbodbc-master|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-turbodbc-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-turbodbc-master)|
   |test-conda-python-3.8|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.8)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.8)|
   |test-conda-python-3.8-dask-master|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.8-dask-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.8-dask-master)|
   |test-conda-python-3.8-hypothesis|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.8-hypothesis)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.8-hypothesis)|
   

[GitHub] [arrow] ovr commented on a change in pull request #9232: ARROW-10818: [Rust] Implement DecimalType

2021-03-03 Thread GitBox


ovr commented on a change in pull request #9232:
URL: https://github.com/apache/arrow/pull/9232#discussion_r586722375



##
File path: rust/datafusion/src/physical_plan/group_scalar.rs
##
@@ -22,10 +22,12 @@ use std::convert::{From, TryFrom};
 
 use crate::error::{DataFusionError, Result};
 use crate::scalar::ScalarValue;
+use arrow::datatypes::Decimal128Type;
 
 /// Enumeration of types that can be used in a GROUP BY expression
 #[derive(Debug, PartialEq, Eq, Hash, Clone)]
 pub(crate) enum GroupByScalar {
+Decimal128(i128, usize, usize),

Review comment:
   Change it to 
   
   ```
   // Box is needed to reduce memory usage
   Decimal128(Box),
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-03 Thread GitBox


pitrou commented on pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#issuecomment-789996441


   @github-actions crossbow submit -g python



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9528: ARROW-8732: [C++] Add basic cancellation API

2021-03-03 Thread GitBox


pitrou commented on pull request #9528:
URL: https://github.com/apache/arrow/pull/9528#issuecomment-789995923


   @ursabot crossbow submit -g python



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


alamb commented on a change in pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#discussion_r586715897



##
File path: rust/datafusion/src/physical_plan/hash_join.rs
##
@@ -311,6 +319,7 @@ fn update_hash(
 hash:  JoinHashMap,
 offset: usize,
 random_state: ,
+hashes_buffer:  Vec,

Review comment:
   樂  definitely not for this PR





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7179: ARROW-8732: [C++] Add basic cancellation API

2021-03-03 Thread GitBox


pitrou commented on pull request #7179:
URL: https://github.com/apache/arrow/pull/7179#issuecomment-789990634


   Closed, superseded by #9528.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7179: ARROW-8732: [C++] Add basic cancellation API

2021-03-03 Thread GitBox


pitrou closed pull request #7179:
URL: https://github.com/apache/arrow/pull/7179


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


Dandandan commented on pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#issuecomment-789968391


   Thanks @alamb resolved the incosistent naming.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


Dandandan commented on a change in pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#discussion_r586686916



##
File path: rust/datafusion/src/physical_plan/hash_join.rs
##
@@ -311,6 +319,7 @@ fn update_hash(
 hash:  JoinHashMap,
 offset: usize,
 random_state: ,
+hashes_buffer:  Vec,

Review comment:
   Indeed, this change is for reusing the allocated `Vec`.
   
   Yes, makes sense to group them in a struct. There are some opportunities in 
other functions `build_join_indexes`  `build_batch`, etc. for this as well. Not 
sure if it makes sense they all receive the same struct, or maybe all of them a 
subset of the most commonly needed parts :thinking: 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


Dandandan commented on a change in pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#discussion_r586686916



##
File path: rust/datafusion/src/physical_plan/hash_join.rs
##
@@ -311,6 +319,7 @@ fn update_hash(
 hash:  JoinHashMap,
 offset: usize,
 random_state: ,
+hashes_buffer:  Vec,

Review comment:
   Indeed, this change is for reusing the allocated `Vec`.
   
   Yes, makes sense to group them in a struct. There are some opportunities in 
other functions "build_join_indexes"  "build_batch", etc. for this as well. Not 
sure if it makes sense they all receive the same struct, or maybe all of them a 
subset of the most commonly needed parts :thinking: 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586648630



##
File path: r/configure
##
@@ -69,19 +70,22 @@ fi
 if [ "$INCLUDE_DIR" ] || [ "$LIB_DIR" ]; then
   echo "*** Using INCLUDE_DIR/LIB_DIR"
   PKG_CFLAGS="-I$INCLUDE_DIR $PKG_CFLAGS"
-  PKG_LIBS="-L$LIB_DIR $PKG_LIBS"
+  PKG_DIRS="-L$LIB_DIR $PKG_DIRS"
 else
   # Use pkg-config if available and allowed
   pkg-config --version >/dev/null 2>&1
   if [ "$ARROW_USE_PKG_CONFIG" != "false" ] && [ $? -eq 0 ]; then
 PKGCONFIG_CFLAGS=`pkg-config --cflags --silence-errors ${PKG_CONFIG_NAME}`
-PKGCONFIG_LIBS=`pkg-config --libs --silence-errors ${PKG_CONFIG_NAME}`
+PKGCONFIG_LIBS=`pkg-config --libs-only-l --silence-errors 
${PKG_CONFIG_NAME}`
+PKGCONFIG_DIRS=`pkg-config --libs-only-L --silence-errors 
${PKG_CONFIG_NAME}`
+# TODO: what about --libs-only-other?

Review comment:
   Cool, thanks for pointing that out.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586643234



##
File path: r/configure
##
@@ -145,7 +150,7 @@ else
 # TODO: what about non-bundled deps?
 BUNDLED_LIBS=`cd $LIB_DIR && ls *.a`
 BUNDLED_LIBS=`echo $BUNDLED_LIBS | sed -E "s/lib(.*)\.a/-l\1/" | sed 
-e "s/\\.a lib/ -l/g"`
-PKG_LIBS="-L$(pwd)/$LIB_DIR $PKG_LIBS $BUNDLED_LIBS"
+PKG_DIRS="-L$(pwd)/$LIB_DIR $PKG_DIRS"

Review comment:
   `PKG_DIRS` contains only the `-L` flags, which are directories. 
`PKG_LIBS` contain only the `-l` flags, which are package names. Previously 
these were combined together in `PKG_LIBS`. I separated them because it makes 
the appending/prepending much more straightforward.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586639762



##
File path: r/configure
##
@@ -145,7 +150,7 @@ else
 # TODO: what about non-bundled deps?
 BUNDLED_LIBS=`cd $LIB_DIR && ls *.a`
 BUNDLED_LIBS=`echo $BUNDLED_LIBS | sed -E "s/lib(.*)\.a/-l\1/" | sed 
-e "s/\\.a lib/ -l/g"`
-PKG_LIBS="-L$(pwd)/$LIB_DIR $PKG_LIBS $BUNDLED_LIBS"
+PKG_DIRS="-L$(pwd)/$LIB_DIR $PKG_DIRS"

Review comment:
   `$(command)` is apparently not a bashism: 
https://stackoverflow.com/questions/3789894/is-command-substitution-foo-bashism





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586615537



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   Right, roxygen2 does not run on CRAN, but I think we should avoid 
requiring R package contributors to use dev versions of devtools, roxygen2, 
etc.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #9598: ARROW-11804: [Developer] Offer to create JIRA issue

2021-03-03 Thread GitBox


nealrichardson commented on pull request #9598:
URL: https://github.com/apache/arrow/pull/9598#issuecomment-789884959


   > In general, I felt the approach to adding an optional, manually 
acknowledged step
   
   I agree. What I'm envisioning is that the existing bot that says "Thanks for 
opening a pull request! Please name your PR with ARROW-: Issue Title, or go 
to JIRA and make an issue" would instead say "Thanks for opening a pull 
request! Please name your PR with ARROW-: Issue Title, or comment `please 
make a JIRA` on the PR to have me make one for you". And a similar message for 
new GitHub Issues. That way, the contributor doesn't have to leave GitHub if 
they don't want, but we get the JIRA issue up front too. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


nealrichardson commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586587025



##
File path: r/R/dataset-partition.R
##
@@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
-#' hive_partition(year = int16(), month = int8())
+#' if (arrow_with_dataset()) {

Review comment:
   We don't need roxygen2 to be on CRAN in order to use it, the Rd it 
generates is valid.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586568597



##
File path: r/configure
##
@@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then
   # Always build with arrow on macOS
   PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW"
   # Check for features
-  LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'`
-  grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 
2>&1
+  LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'`
+  ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake"
+  # Check for Arrow Dataset subcomponent

Review comment:
   I thought so too at first, but when I run
   ```
   pkg-config --libs-only-l arrow-dataset
   ```
   I get `-larrow_dataset` first and `-lparquet` second.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586568597



##
File path: r/configure
##
@@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then
   # Always build with arrow on macOS
   PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW"
   # Check for features
-  LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'`
-  grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 
2>&1
+  LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'`
+  ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake"
+  # Check for Arrow Dataset subcomponent

Review comment:
   I thought so too at first, but when I run
   ```
   pkg-config --libs-only-l arrow-dataset
   ```
   I get `-larrow-dataset` first and `-lparquet` second.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586564929



##
File path: r/configure
##
@@ -26,13 +26,14 @@
 # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
 
 # Library settings
-PKG_CONFIG_NAME="arrow parquet arrow-dataset"
+PKG_CONFIG_NAME="arrow"
 PKG_DEB_NAME="(unsuppored)"
 PKG_RPM_NAME="(unsuppored)"
 PKG_BREW_NAME="apache-arrow"
 PKG_TEST_HEADER=""
-# These must be the same order as $(pkg-config --libs arrow-dataset)
-PKG_LIBS="-larrow_dataset -lparquet -larrow"
+PKG_LIBS="-larrow"

Review comment:
   Yep, the order in which the code below prepends to `PKG_LIBS` makes its 
value match the output of
   ```
   pkg-config --libs-only-l arrow-dataset
   ```
(at least on my Mac) which is:
   ```
   -larrow_dataset -lparquet -larrow
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586564929



##
File path: r/configure
##
@@ -26,13 +26,14 @@
 # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
 
 # Library settings
-PKG_CONFIG_NAME="arrow parquet arrow-dataset"
+PKG_CONFIG_NAME="arrow"
 PKG_DEB_NAME="(unsuppored)"
 PKG_RPM_NAME="(unsuppored)"
 PKG_BREW_NAME="apache-arrow"
 PKG_TEST_HEADER=""
-# These must be the same order as $(pkg-config --libs arrow-dataset)
-PKG_LIBS="-larrow_dataset -lparquet -larrow"
+PKG_LIBS="-larrow"

Review comment:
   Yep, the order in which the code below prepends to `PKG_LIBS` makes its 
value match the output of `pkg-config --libs-only-l arrow-dataset` (at least on 
my Mac) which is:
   ```
   -larrow_dataset -lparquet -larrow
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586564929



##
File path: r/configure
##
@@ -26,13 +26,14 @@
 # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
 
 # Library settings
-PKG_CONFIG_NAME="arrow parquet arrow-dataset"
+PKG_CONFIG_NAME="arrow"
 PKG_DEB_NAME="(unsuppored)"
 PKG_RPM_NAME="(unsuppored)"
 PKG_BREW_NAME="apache-arrow"
 PKG_TEST_HEADER=""
-# These must be the same order as $(pkg-config --libs arrow-dataset)
-PKG_LIBS="-larrow_dataset -lparquet -larrow"
+PKG_LIBS="-larrow"

Review comment:
   Yep, the order in which the code below prepends to `PKG_LIBS` matches 
the output of `pkg-config --libs-only-l arrow-dataset` (at least on my Mac) 
which is:
   ```
   -larrow_dataset -lparquet -larrow
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components

2021-03-03 Thread GitBox


ianmcook commented on a change in pull request #9610:
URL: https://github.com/apache/arrow/pull/9610#discussion_r586560582



##
File path: r/data-raw/codegen.R
##
@@ -194,10 +194,16 @@ arrow_exports_cpp <- glue::glue('
 
 {feature_available("arrow")}
 
+{feature_available("dataset")}

Review comment:
   Done in c8d390d





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove closed pull request #9619: ARROW-11842: [Rust][Parquet] Use clone_from in get_batch_with_dict

2021-03-03 Thread GitBox


andygrove closed pull request #9619:
URL: https://github.com/apache/arrow/pull/9619


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #9615: ARROW-3316: [R] Multi-threaded conversion from R data.frame to Arrow table / record batch

2021-03-03 Thread GitBox


romainfrancois commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-789772368


   ```cpp
   bool CanExtendParallel(SEXP x, const std::shared_ptr& type) 
{
 // TODO: identify when it's ok to do things in parallel
 return false;
   }
   ```
   
   turning this to `true` entirely makes everything fail dramatically. So we'll 
need to pay careful attention about what can and cannot be done concurrently. 
We might have to dial down the use of the `cpp11` package because I believe 
when we create a `cpp11` vector to shell a `SEXP` this uses a central resource 
for the protection. 
   
   It would probably be better to move the "can this be done in parallel" to a 
virtual method of `RConverter`, but we would need to move capture `converter` 
in the `task` lambda or at least figure out some way for the converter to still 
be alive when the task is run. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ericwburden commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes

2021-03-03 Thread GitBox


ericwburden commented on a change in pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#discussion_r586483501



##
File path: rust/arrow/src/array/array_binary.rs
##
@@ -89,7 +89,7 @@ impl 
GenericBinaryArray {
 // pointer alignment & location is ensured by RawPtrBox
 // buffer bounds/offset is ensured by the value_offset invariants
 std::slice::from_raw_parts(
-self.value_data.as_ptr().offset(start.to_isize()),
+self.value_data.as_ptr().offset(start.to_isize().unwrap()),

Review comment:
   I do think it would be good form to add the explanatory comments, 
though. I'll just hang out and see if anyone corrects my logic, then commit the 
comments.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ericwburden commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes

2021-03-03 Thread GitBox


ericwburden commented on a change in pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#discussion_r586482079



##
File path: rust/arrow/src/array/array_binary.rs
##
@@ -89,7 +89,7 @@ impl 
GenericBinaryArray {
 // pointer alignment & location is ensured by RawPtrBox
 // buffer bounds/offset is ensured by the value_offset invariants
 std::slice::from_raw_parts(
-self.value_data.as_ptr().offset(start.to_isize()),
+self.value_data.as_ptr().offset(start.to_isize().unwrap()),

Review comment:
   Actually, now that I consider it, the implementation of `to_isize()` 
that I removed from OffsetSizeTrait already had the unwraps, just inside the 
`to_isize()` function instead of on the result of `to_isize()`. So, at the very 
least, we shouldn't be adding any risks we weren't already taking.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ericwburden commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes

2021-03-03 Thread GitBox


ericwburden commented on a change in pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#discussion_r586473814



##
File path: rust/arrow/src/array/array_binary.rs
##
@@ -89,7 +89,7 @@ impl 
GenericBinaryArray {
 // pointer alignment & location is ensured by RawPtrBox
 // buffer bounds/offset is ensured by the value_offset invariants
 std::slice::from_raw_parts(
-self.value_data.as_ptr().offset(start.to_isize()),
+self.value_data.as_ptr().offset(start.to_isize().unwrap()),

Review comment:
   Here's my rationale:
   
   > `start` is an , which is a generic type that implements the 
   > OffsetSizeTrait. Currently, only i32 and i64 implement OffsetSizeTrait,
   > both of which should cleanly cast to isize on an architecture that supports
   > 32/64-bit offsets
   
   I *think* that is true. Can anyone think of a situation in which that would 
not hold? (For reference, I'm primarily an R programmer, so questions about 
memory address size are a bit arcane for me).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #9573: ARROW-11783: [Rust] Proposal for RFCs in Rust Arrow

2021-03-03 Thread GitBox


andygrove commented on pull request #9573:
URL: https://github.com/apache/arrow/pull/9573#issuecomment-789746539


   > I've been looking for ways to link the mdbook that could be generated for 
the RFCs to the apache arrow website and to be honest, I don't think it is 
possible. Also, I think it will become very cumbersome when people want to add 
a new RFC to the repo if we keep the RFCs within the Arrow repository.
   > 
   > I propose that we create another repo in the main Apache group and from 
there we can work with the RFCs. Something similar to what they have done to 
the [Arrow website](https://github.com/apache/arrow-site). I also think that by 
keeping it as a separate repository it will make it easier to add and maintain 
all the RFCs that may be received.
   > 
   > What do you think? @andygrove @alamb @jorgecarleitao @nevi-me
   
   Personally, I think we should have the RFCs in the main repo. We already 
publish documentation from the main repo to the Arrow web site (for example, 
the content under `docs/source`). Sorry, I haven't been following this closely 
so need to catch up on the discussions as to why you are running into issues 
with publishing the mdbook. I'll try and do that over the next couple days.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ovr commented on a change in pull request #9232: ARROW-10818: [Rust] Implement DecimalType

2021-03-03 Thread GitBox


ovr commented on a change in pull request #9232:
URL: https://github.com/apache/arrow/pull/9232#discussion_r586437638



##
File path: rust/datafusion/src/physical_plan/group_scalar.rs
##
@@ -22,10 +22,12 @@ use std::convert::{From, TryFrom};
 
 use crate::error::{DataFusionError, Result};
 use crate::scalar::ScalarValue;
+use arrow::datatypes::Decimal128Type;
 
 /// Enumeration of types that can be used in a GROUP BY expression
 #[derive(Debug, PartialEq, Eq, Hash, Clone)]
 pub(crate) enum GroupByScalar {
+Decimal128(i128, usize, usize),

Review comment:
   @alamb Should I store it as (i128, usize, usize) or `Decimal128Type` 
(which is a structure) for `ScalaraValue::Decimal128`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes

2021-03-03 Thread GitBox


andygrove commented on a change in pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#discussion_r586449677



##
File path: rust/arrow/src/array/array_binary.rs
##
@@ -89,7 +89,7 @@ impl 
GenericBinaryArray {
 // pointer alignment & location is ensured by RawPtrBox
 // buffer bounds/offset is ensured by the value_offset invariants
 std::slice::from_raw_parts(
-self.value_data.as_ptr().offset(start.to_isize()),
+self.value_data.as_ptr().offset(start.to_isize().unwrap()),

Review comment:
   If these calls to `unwrap` can't be avoided, could we at least document 
why they are needed/safe?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-03 Thread GitBox


pitrou edited a comment on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-789737670


   @mathyingzhou What is the status of this PR? Are you waiting for another 
review? (feel free to ping)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-03 Thread GitBox


pitrou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-789737670


   @mathyingzhou What is the status of this PR? Are you waiting for another 
review? (feel free to ping=



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9614: ARROW-11839: [C++] Use xsimd for generation of accelerated bit-unpacking

2021-03-03 Thread GitBox


pitrou commented on pull request #9614:
URL: https://github.com/apache/arrow/pull/9614#issuecomment-789736290


   Interesting, thank you.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9213: ARROW-11266: [Rust][DataFusion] Implement vectorized hashing for hash aggregate [WIP]

2021-03-03 Thread GitBox


alamb commented on pull request #9213:
URL: https://github.com/apache/arrow/pull/9213#issuecomment-789733069


   Closing this PR for now; We can reopen it if need be



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9493: ARROW-11624: [Rust] Move Arrow benchmarks to its own crate

2021-03-03 Thread GitBox


alamb commented on pull request #9493:
URL: https://github.com/apache/arrow/pull/9493#issuecomment-789732796


   Closing this PR for now



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9493: ARROW-11624: [Rust] Move Arrow benchmarks to its own crate

2021-03-03 Thread GitBox


alamb closed pull request #9493:
URL: https://github.com/apache/arrow/pull/9493


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ovr commented on a change in pull request #9232: ARROW-10818: [Rust] Implement DecimalType

2021-03-03 Thread GitBox


ovr commented on a change in pull request #9232:
URL: https://github.com/apache/arrow/pull/9232#discussion_r586437638



##
File path: rust/datafusion/src/physical_plan/group_scalar.rs
##
@@ -22,10 +22,12 @@ use std::convert::{From, TryFrom};
 
 use crate::error::{DataFusionError, Result};
 use crate::scalar::ScalarValue;
+use arrow::datatypes::Decimal128Type;
 
 /// Enumeration of types that can be used in a GROUP BY expression
 #[derive(Debug, PartialEq, Eq, Hash, Clone)]
 pub(crate) enum GroupByScalar {
+Decimal128(i128, usize, usize),

Review comment:
   @alamb Should I store it as (i128, usize, usize) or `Decimal128Type` 
(which is a structure)?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ovr commented on pull request #9232: ARROW-10818: [Rust] Implement DecimalType

2021-03-03 Thread GitBox


ovr commented on pull request #9232:
URL: https://github.com/apache/arrow/pull/9232#issuecomment-789732012


   > @ovr what do you think we should do with this PR? Is it worth keeping open 
as a Draft or shall we close it for now?
   
   It's time to finish it. So, I've rebased PR and drop `Decimal256` from it 
(will send it in another PR). Planing to finish it on weekends.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


alamb commented on a change in pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#discussion_r586420039



##
File path: rust/datafusion/src/physical_plan/hash_join.rs
##
@@ -699,13 +726,11 @@ macro_rules! hash_array {
 }
 
 /// Creates hash values for every element in the row based on the values in 
the columns
-pub fn create_hashes(
+pub fn create_hashes<'a>(
 arrays: &[ArrayRef],
 random_state: ,
-) -> Result> {
-let rows = arrays[0].len();
-let mut hashes = vec![0; rows];
-
+hashes: &'a mut Vec,

Review comment:
   I think it would help readability to name this  `hashes_buffer ` for 
consistency

##
File path: rust/datafusion/src/physical_plan/hash_join.rs
##
@@ -311,6 +319,7 @@ fn update_hash(
 hash:  JoinHashMap,
 offset: usize,
 random_state: ,
+hashes_buffer:  Vec,

Review comment:
   This is effectively allowing `hashes_buffer ` to be reused, right?
   
   It may eventually make sense to make some struct that holds all the relevant 
state (`on`, `random_state`, `hash_buf`, etc). 

##
File path: rust/datafusion/src/physical_plan/hash_join.rs
##
@@ -476,15 +485,16 @@ fn build_join_indexes(
 .into_array(left_data.1.num_rows()))
 })
 .collect::>>()?;
-
-let hash_values = create_hashes(_values, _state)?;
+let hash_buff =  vec![0; keys_values[0].len()];

Review comment:
   I recommend naming this `hashes_buffer` for consistency with its use 
elsewhere





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices

2021-03-03 Thread GitBox


alamb commented on pull request #9595:
URL: https://github.com/apache/arrow/pull/9595#issuecomment-789714615


   The integration test failure in 
https://github.com/apache/arrow/pull/9595/checks?check_run_id=1998235390 seems 
to be the same as was fixed in https://github.com/apache/arrow/pull/9593
   
   I also pulled this branch locally, and re-ran the tests and everythings 
looks good to me



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9624: ARROW-11845: [Rust]Implement to_isize() for ArrowNativeTypes

2021-03-03 Thread GitBox


github-actions[bot] commented on pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#issuecomment-789710271


   https://issues.apache.org/jira/browse/ARROW-11845



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-03 Thread GitBox


pitrou commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-789701493


   Congratulations @liyafan82 ! Do you have an idea how hard it will be to add 
zstd support?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask

2021-03-03 Thread GitBox


lidavidm commented on pull request #9616:
URL: https://github.com/apache/arrow/pull/9616#issuecomment-789698947


   Ah, this is broken until I fully define GADInMemoryFragment...let me fix 
that up



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9624: ARROW-11845: [Rust]Implement to_isize() for ArrowNativeTypes

2021-03-03 Thread GitBox


github-actions[bot] commented on pull request #9624:
URL: https://github.com/apache/arrow/pull/9624#issuecomment-789694328


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-03 Thread GitBox


liyafan82 commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-789684969


   The integration tests have passed. Please take another another look when you 
have time, dear reviewers. (maybe just review the last three commits) 
   Thanks a lot.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9605: ARROW-11802: [Rust][DataFusion] Remove use of crossbeam channels to avoid potential deadlocks

2021-03-03 Thread GitBox


Dandandan commented on pull request #9605:
URL: https://github.com/apache/arrow/pull/9605#issuecomment-789670134


   I also tested partitioning parquet & reading parquet from datafusion - all 
worked OK :+1: 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >