date:20201111

[GitHub] [arrow] liyafan82 commented on pull request #8210: ARROW-10031: [CI][Java] Support Java benchmark in Ursabot

2020-11-11 Thread GitBox



liyafan82 commented on pull request #8210:
URL: https://github.com/apache/arrow/pull/8210#issuecomment-725281667


   > At my end, I can generate the following JSON file by `archery benchmark 
diff --language=java ...`
   > 
   > @liyafan82 any comments regarding format and parameters are appreciated.
   > 
   > ```
   >   benchmark  baseline  
   contender  change %  

 counters
   > 0  org.apache.arrow.vector.IntBenchmarks.setIntDirectly  22.500 us/op  
11.209 us/op   -50.182  {'mode': 'avgt', 'threads': 1, 'warmups': 5, 
'warmupTime': '10 s', 'measurements': 4, 'measurementTime': '10 s', 'jvmArgs': 
['-Darrow.enable_null_check_for_get=false 
-Darrow.enable_unsafe_memory_access=true']}
   > 2  org.apache.arrow.vector.IntBenchmarks.setWithValueHolder  19.031 us/op  
 6.627 us/op   -65.179  {'mode': 'avgt', 'threads': 1, 'warmups': 5, 
'warmupTime': '10 s', 'measurements': 4, 'measurementTime': '10 s', 'jvmArgs': 
['-Darrow.enable_null_check_for_get=false 
-Darrow.enable_unsafe_memory_access=true']}
   > 1   org.apache.arrow.vector.IntBenchmarks.setWithWriter  32.626 us/op  
10.246 us/op   -68.594  {'mode': 'avgt', 'threads': 1, 'warmups': 5, 
'warmupTime': '10 s', 'measurements': 4, 'measurementTime': '10 s', 'jvmArgs': 
['-Darrow.enable_null_check_for_get=false 
-Darrow.enable_unsafe_memory_access=true']}
   > ```
   
   @kiszk Thanks for your effort. Generally, it looks great! Some minor 
comments:
   1. It is clearer to rename title 'counters' to 'configuration'?
   2. I am curious how are the benchmarks sorted, by 'change %'? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8637: ARROW-10021: [C++][Compute] Return top-n modes in mode kernel

2020-11-11 Thread GitBox



github-actions[bot] commented on pull request #8637:
URL: https://github.com/apache/arrow/pull/8637#issuecomment-725320345


   https://issues.apache.org/jira/browse/ARROW-10021



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



vertexclique commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725349668


   Answered



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2020-11-11 Thread GitBox



jorisvandenbossche commented on a change in pull request #8621:
URL: https://github.com/apache/arrow/pull/8621#discussion_r521311397



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -1231,6 +1251,305 @@ Result StrptimeResolve(KernelContext* ctx, 
const std::vector
+struct UTF8TrimWhitespaceBase : StringTransform {
+  using Base = StringTransform;
+  using offset_type = typename Base::offset_type;
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+const uint8_t* begin = input;
+const uint8_t* end = input + input_string_ncodeunits;
+const uint8_t* end_trimmed = end;
+const uint8_t* begin_trimmed = begin;
+
+auto predicate = [](uint32_t c) { return !IsSpaceCharacterUnicode(c); };
+if (left && !ARROW_PREDICT_TRUE(
+arrow::util::UTF8FindIf(begin, end, predicate, 
_trimmed))) {
+  return false;
+}
+if (right && (begin_trimmed < end)) {
+  if (!ARROW_PREDICT_TRUE(arrow::util::UTF8FindIfReverse(begin_trimmed, 
end,
+ predicate, 
_trimmed))) {
+return false;
+  }
+}
+std::copy(begin_trimmed, end_trimmed, output);
+*output_written = static_cast(end_trimmed - begin_trimmed);
+return true;
+  }
+  void Execute(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+EnsureLookupTablesFilled();
+Base::Execute(ctx, batch, out);
+  }
+};
+
+template 
+struct UTF8TrimWhitespace
+: UTF8TrimWhitespaceBase> {};
+
+template 
+struct UTF8LTrimWhitespace
+: UTF8TrimWhitespaceBase> {};
+
+template 
+struct UTF8RTrimWhitespace
+: UTF8TrimWhitespaceBase> {};
+
+template 
+struct UTF8TrimBase : StringTransform {
+  using Base = StringTransform;
+  using offset_type = typename Base::offset_type;
+  using State = OptionsWrapper;
+  TrimOptions options;
+  std::vector codepoints;
+
+  explicit UTF8TrimBase(TrimOptions options) : options(options) {
+// TODO: check return / can we raise an exception here?
+arrow::util::UTF8ForEach(options.characters, [&](uint32_t c) {
+  codepoints.resize(std::max(c + 1, 
static_cast(codepoints.size(;
+  codepoints.at(c) = true;
+});
+  }
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+TrimOptions options = State::Get(ctx);
+Derived(options).Execute(ctx, batch, out);
+  }
+
+  void Execute(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+EnsureLookupTablesFilled();
+Base::Execute(ctx, batch, out);
+  }
+
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+const uint8_t* begin = input;
+const uint8_t* end = input + input_string_ncodeunits;
+const uint8_t* end_trimmed = end;
+const uint8_t* begin_trimmed = begin;
+
+auto predicate = [&](uint32_t c) {
+  bool contains = codepoints[c];
+  return !contains;
+};
+if (left && !ARROW_PREDICT_TRUE(
+arrow::util::UTF8FindIf(begin, end, predicate, 
_trimmed))) {
+  return false;
+}
+if (right && (begin_trimmed < end)) {
+  if (!ARROW_PREDICT_TRUE(arrow::util::UTF8FindIfReverse(begin_trimmed, 
end,
+ predicate, 
_trimmed))) {
+return false;
+  }
+}
+std::copy(begin_trimmed, end_trimmed, output);
+*output_written = static_cast(end_trimmed - begin_trimmed);
+return true;
+  }
+};
+template 
+struct UTF8Trim : UTF8TrimBase> {
+  using Base = UTF8TrimBase>;
+  using Base::Base;
+};
+
+template 
+struct UTF8LTrim : UTF8TrimBase> {
+  using Base = UTF8TrimBase>;
+  using Base::Base;
+};
+
+template 
+struct UTF8RTrim : UTF8TrimBase> {
+  using Base = UTF8TrimBase>;
+  using Base::Base;
+};
+
+#endif
+
+template 
+struct AsciiTrimWhitespaceBase : StringTransform {
+  using offset_type = typename Type::offset_type;
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+const uint8_t* begin = input;
+const uint8_t* end = input + input_string_ncodeunits;
+const uint8_t* end_trimmed = end;
+
+auto predicate = [](unsigned char c) { return !IsSpaceCharacterAscii(c); };
+const uint8_t* begin_trimmed = left ? std::find_if(begin, end, predicate) 
: begin;
+if (right & (begin_trimmed < end)) {
+  std::reverse_iterator rbegin(end);
+  std::reverse_iterator rend(begin_trimmed);
+  end_trimmed = std::find_if(rbegin, rend, predicate).base();
+}
+std::copy(begin_trimmed, end_trimmed, output);
+*output_written = static_cast(end_trimmed - begin_trimmed);
+return true;
+  }
+};
+
+template 
+struct AsciiTrimWhitespace
+: AsciiTrimWhitespaceBase> {};
+
+template 
+struct AsciiLTrimWhitespace
+: AsciiTrimWhitespaceBase>

[GitHub] [arrow] alamb commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



alamb commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725397069


   I am checking / testing this PR out locally.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on pull request #8210: ARROW-10031: [CI][Java] Support Java benchmark in Ursabot

2020-11-11 Thread GitBox



kiszk commented on pull request #8210:
URL: https://github.com/apache/arrow/pull/8210#issuecomment-725290440


   For 1., I will rename the title for cpp and Java.
   
   For 2, you are right as sorted 
[here](https://github.com/apache/arrow/blob/master/dev/archery/archery/cli.py#L595).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 opened a new pull request #8637: ARROW-10021: [C++][Compute] Return top-n modes in mode kernel

2020-11-11 Thread GitBox



cyb70289 opened a new pull request #8637:
URL: https://github.com/apache/arrow/pull/8637


   This patch generalize mode kernel to return top-n modes.
   No performance difference for normal benchmarks.
   About 20% performance drop for 100% null benchmarks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8626: ARROW-10545: [C++] Fix crash on invalid Parquet file (OSS-Fuzz)

2020-11-11 Thread GitBox



pitrou commented on pull request #8626:
URL: https://github.com/apache/arrow/pull/8626#issuecomment-725298582


   As I said above, this will break CI until #8617 is merged.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2020-11-11 Thread GitBox



maartenbreddels commented on pull request #8621:
URL: https://github.com/apache/arrow/pull/8621#issuecomment-725345905


   The `std::vector` was a good idea, and indeed because of it's bit 
usage, the memory usage for Unicode isn't that heavy (most extreme: `0x10 
bits = 140kb` in case of a contiguous array implementation).
   
   Benchmarks:
   ```
   set:
   TrimManyAscii_median   28346892 ns   28345125 ns 25   558.956MB/s   
35.2794M items/s
   TrimManyUtf8_median28302644 ns   28294883 ns 25   559.949MB/s   
35.3421M items/s
   
   unordered_set:
   TrimManyAscii_median   32017530 ns   32014024 ns 22   494.898MB/s   
31.2363M items/s
   TrimManyUtf8_median (not run)
   
   vector
   TrimManyAscii_median   14911543 ns   14910620 ns 47   1062.58MB/s   
67.0663M items/s
   TrimManyUtf8_median16148001 ns   16146053 ns 44   981.273MB/s   
61.9346M items/s
   
   bitset<256>
   TrimManyAscii_median   14304925 ns   14304010 ns 49   1107.64MB/s   
69.9105M items/s
   ```
   
   
   `vector` is good enough I think, the bitset is consistently faster 
(5%), but I'd rather have similar code for both solutions.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb closed pull request #8636: ARROW-10552: [Rust] Removed un-used Result

2020-11-11 Thread GitBox



alamb closed pull request #8636:
URL: https://github.com/apache/arrow/pull/8636


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8637: ARROW-10021: [C++][Compute] Return top-n modes in mode kernel

2020-11-11 Thread GitBox



pitrou commented on a change in pull request #8637:
URL: https://github.com/apache/arrow/pull/8637#discussion_r521255222



##
File path: docs/source/cpp/compute.rst
##
@@ -140,7 +140,7 @@ Aggregations
 
+--+++---++
 | min_max  | Unary  | Numeric| Scalar Struct  
(1)| :struct:`MinMaxOptions`|
 
+--+++---++
-| mode | Unary  | Numeric| Scalar Struct  
(2)||
+| mode | Unary  | Numeric| Scalar Struct  
(2)| :struct:`ModeOptions`  |

Review comment:
   Replace "Scalar Struct" with "Struct".

##
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##
@@ -735,71 +735,63 @@ class TestPrimitiveModeKernel : public ::testing::Test {
  public:
   using ArrowType = T;
   using Traits = TypeTraits;
-  using c_type = typename ArrowType::c_type;
-  using ModeType = typename Traits::ScalarType;
-  using CountType = typename TypeTraits::ScalarType;
-
-  void AssertModeIs(const Datum& array, c_type expected_mode, int64_t 
expected_count) {
-ASSERT_OK_AND_ASSIGN(Datum out, Mode(array));
-const StructScalar& value = out.scalar_as();
+  using CType = typename ArrowType::c_type;
+
+  void AssertModesAre(const Datum& array, const int n,
+  const std::vector& expected_modes,
+  const std::vector& expected_counts) {
+ASSERT_OK_AND_ASSIGN(Datum out, Mode(array, ModeOptions{n}));
+const StructArray out_array(out.array());
+ASSERT_EQ(out_array.length(), expected_modes.size());
+ASSERT_EQ(out_array.num_fields(), 2);
+
+const CType* out_modes = out_array.field(0)->data()->GetValues(1);
+const int64_t* out_counts = 
out_array.field(1)->data()->GetValues(1);
+for (int i = 0; i < out_array.length(); ++i) {
+  // equal or nan equal
+  ASSERT_TRUE(
+  (expected_modes[i] == out_modes[i]) ||
+  (expected_modes[i] != expected_modes[i] && out_modes[i] != 
out_modes[i]));
+  ASSERT_EQ(expected_counts[i], out_counts[i]);
+}
+  }
 
-const auto& out_mode = checked_cast(*value.value[0]);
-ASSERT_EQ(expected_mode, out_mode.value);
+  void AssertModesAre(const std::string& json, const int n,
+  const std::vector& expected_modes,
+  const std::vector& expected_counts) {
+auto array = ArrayFromJSON(type_singleton(), json);
+AssertModesAre(array, n, expected_modes, expected_counts);
+  }
 
-const auto& out_count = checked_cast(*value.value[1]);
-ASSERT_EQ(expected_count, out_count.value);
+  void AssertModeIs(const Datum& array, CType expected_mode, int64_t 
expected_count) {
+AssertModesAre(array, 1, {expected_mode}, {expected_count});
   }
 
-  void AssertModeIs(const std::string& json, c_type expected_mode,
+  void AssertModeIs(const std::string& json, CType expected_mode,
 int64_t expected_count) {
 auto array = ArrayFromJSON(type_singleton(), json);
 AssertModeIs(array, expected_mode, expected_count);
   }
 
-  void AssertModeIs(const std::vector& json, c_type expected_mode,
+  void AssertModeIs(const std::vector& json, CType expected_mode,
 int64_t expected_count) {
 auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
 AssertModeIs(chunked, expected_mode, expected_count);
   }
 
-  void AssertModeIsNull(const Datum& array) {
-ASSERT_OK_AND_ASSIGN(Datum out, Mode(array));
-const StructScalar& value = out.scalar_as();
-
-for (const auto& val : value.value) {
-  ASSERT_FALSE(val->is_valid);
-}
+  void AssertModeIsNull(const Datum& array, int n) {

Review comment:
   This should be renamed, since it's not null anymore. `AssertModesEmpty`?

##
File path: cpp/src/arrow/compute/api_aggregate.h
##
@@ -76,6 +76,18 @@ struct ARROW_EXPORT MinMaxOptions : public FunctionOptions {
   enum Mode null_handling;
 };
 
+/// \brief Control Mode kernel behavior
+///
+/// Returns top-n common values and counts.
+/// By default, returns the most common value and count.
+struct ARROW_EXPORT ModeOptions : public FunctionOptions {
+  explicit ModeOptions(int n = 1) : n(n) {}

Review comment:
   Can we use `int64_t` everywhere for this?

##
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##
@@ -735,71 +735,63 @@ class TestPrimitiveModeKernel : public ::testing::Test {
  public:
   using ArrowType = T;
   using Traits = TypeTraits;
-  using c_type = typename ArrowType::c_type;
-  using ModeType = typename Traits::ScalarType;
-  using CountType = typename TypeTraits::ScalarType;
-
-  void

[GitHub] [arrow] alamb commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



alamb commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725416573


   So my personal suggestion is change all the benches to use `seed_from_u64` 
or equivalent. I don't think there is any need for a new additional dependency 
-- https://crates.io/crates/rand seems much more popular than 
https://crates.io/crates/bastion-utils. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support

2020-11-11 Thread GitBox



pitrou commented on a change in pull request #8542:
URL: https://github.com/apache/arrow/pull/8542#discussion_r521349565



##
File path: cpp/src/arrow/util/basic_decimal.cc
##
@@ -395,33 +395,49 @@ BasicDecimal128& BasicDecimal128::operator*=(const 
BasicDecimal128& right) {
   return *this;
 }
 
-/// Expands the given value into an array of ints so that we can work on
-/// it. The array will be converted to an absolute value and the wasNegative
+/// Expands the given little endian array of uint64_t into a big endian array 
of
+/// uint32_t. The value of input array is expected to be non-negative. The 
result_array
+/// will remove leading zeros from the input array.
+/// \param value_array a little endian array to represent the value
+/// \param result_array a big endian array of length N*2 to set with the value
+/// \result the output length of the array
+template 
+static int64_t FillInArray(const std::array& value_array,
+   uint32_t* result_array) {
+  int64_t next_index = 0;
+  for (int64_t i = N - 1; i >= 0; i--) {
+if (value_array[i] != 0) {
+  if (value_array[i] <= std::numeric_limits::max()) {
+result_array[next_index++] = static_cast(value_array[i]);
+i--;
+  }
+  for (int64_t j = i; j >= 0; j--) {
+result_array[next_index++] = static_cast(value_array[j] >> 
32);
+result_array[next_index++] = static_cast(value_array[j]);
+  }
+  break;
+}
+  }
+  return next_index;
+}
+
+/// Expands the given value into a big endian array of ints so that we can 
work on
+/// it. The array will be converted to an absolute value and the was_negative
 /// flag will be set appropriately. The array will remove leading zeros from
 /// the value.
-/// \param array an array of length 4 to set with the value
+/// \param array a big endian array of length 4 to set with the value
 /// \param was_negative a flag for whether the value was original negative
 /// \result the output length of the array
 static int64_t FillInArray(const BasicDecimal128& value, uint32_t* array,
bool& was_negative) {
-  uint64_t high;
-  uint64_t low;
-  const int64_t highbits = value.high_bits();
-  const uint64_t lowbits = value.low_bits();
-
-  if (highbits < 0) {
-low = ~lowbits + 1;
-high = static_cast(~highbits);
-if (low == 0) {
-  ++high;
-}
-was_negative = true;
-  } else {
-low = lowbits;
-high = static_cast(highbits);
-was_negative = false;
-  }
-
+  BasicDecimal128 abs_value = BasicDecimal128::Abs(value);
+  was_negative = value.high_bits() < 0;
+  uint64_t high = static_cast(abs_value.high_bits());
+  uint64_t low = abs_value.low_bits();
+
+  // FillInArray(std::array& value_array, uint32_t* result_array) 
is not
+  // called here as the following code has better performance, to avoid 
regression on
+  // BasicDecimal128 Division.
   if (high != 0) {
 if (high > std::numeric_limits::max()) {

Review comment:
   I'm afraid that doesn't answer my question about non-strict inequality, 
does it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support

2020-11-11 Thread GitBox



pitrou commented on a change in pull request #8542:
URL: https://github.com/apache/arrow/pull/8542#discussion_r521349186



##
File path: cpp/src/arrow/util/basic_decimal.cc
##
@@ -490,49 +527,60 @@ static void FixDivisionSigns(BasicDecimal128* result, 
BasicDecimal128* remainder
   }
 }
 
-/// \brief Build a BasicDecimal128 from a list of ints.
+/// \brief Build a little endian array of uint64_t from a big endian array of 
uint32_t.
+template 
+static DecimalStatus BuildFromArray(std::array* result_array,
+uint32_t* array, int64_t length) {
+  for (int64_t i = length - 2 * N - 1; i >= 0; i--) {
+if (array[i] != 0) {
+  return DecimalStatus::kOverflow;
+}
+  }
+  int64_t next_index = length - 1;
+  for (size_t i = 0; i < N; i++) {
+uint64_t lower_bits = (next_index < 0) ? 0 : array[next_index--];
+(*result_array)[i] =
+(next_index < 0)

Review comment:
   Why is there still a test for `next_index < 0` here? Also, why are some 
indices `int64_t` and others `size_t`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



vertexclique edited a comment on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469


   > If the argument is that two runs of the same benchmark do not agree due to 
randomness, then I do not understand how having the same seed per thread or 
different seeds per thread changes this behavior: the seed will still be set 
when the process starts and will be different per run.
   
   I don't think that it should be different per run per benchmark, that's why 
it is variating.
   
https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210
   
   > Seeding the random number generators can be done using the existing rand 
crate, in the following way (this took me longer to figure out from the rand 
crate's documentation than I would like to admit):
   
   Yeah, I didn't want to write a const compiled seed and people forget to use 
the same seed every time. My implementation does some extra faster stuff 
compared to rand too. Here:
   https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28
   
   But I am ok with that too.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



vertexclique edited a comment on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469


   > If the argument is that two runs of the same benchmark do not agree due to 
randomness, then I do not understand how having the same seed per thread or 
different seeds per thread changes this behavior: the seed will still be set 
when the process starts and will be different per run.
   
   I don't think that it should be different per run per benchmark, that's why 
it is variating.
   
https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210
   
   > Seeding the random number generators can be done using the existing rand 
crate, in the following way (this took me longer to figure out from the rand 
crate's documentation than I would like to admit):
   
   Yeah, I didn't want to write a const compiled seed and people forget to use 
the same seed every time. My implementation does some extra faster stuff 
compared to rand too. Here:
   https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28
   
   But I am ok with that too if we won't forget it in benches :smile: 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



vertexclique edited a comment on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469


   > If the argument is that two runs of the same benchmark do not agree due to 
randomness, then I do not understand how having the same seed per thread or 
different seeds per thread changes this behavior: the seed will still be set 
when the process starts and will be different per run.
   
   I don't think that it should be different per run per benchmark, that's why 
it is variating.
   
https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210
   
   > Seeding the random number generators can be done using the existing rand 
crate, in the following way (this took me longer to figure out from the rand 
crate's documentation than I would like to admit):
   
   Yeah, I didn't want to write a const compiled seed and people forget to use 
the same seed every time. My implementation does some extra faster stuff 
compared to rand too. Here:
   https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28
   
   But I am ok with that too if we won't forget it places :smile: 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



vertexclique edited a comment on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469


   > If the argument is that two runs of the same benchmark do not agree due to 
randomness, then I do not understand how having the same seed per thread or 
different seeds per thread changes this behavior: the seed will still be set 
when the process starts and will be different per run.
   
   I don't think that it should be different per run per benchmark, that's why 
it is variating.
   
https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210
   
   > Seeding the random number generators can be done using the existing rand 
crate, in the following way (this took me longer to figure out from the rand 
crate's documentation than I would like to admit):
   
   Yeah, I didn't want to write a const compiled seed and people forget to use 
the same seed every time. My implementation does some extra faster stuff 
compared to rand too. Here:
   https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb closed pull request #8619: ARROW-10531: [Rust][DataFusion]: Add schema and graphviz formatting for LogicalPlans and a PlanVisitor

2020-11-11 Thread GitBox



alamb closed pull request #8619:
URL: https://github.com/apache/arrow/pull/8619


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on a change in pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2020-11-11 Thread GitBox



maartenbreddels commented on a change in pull request #8621:
URL: https://github.com/apache/arrow/pull/8621#discussion_r521422359



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -1231,6 +1251,305 @@ Result StrptimeResolve(KernelContext* ctx, 
const std::vector
+struct UTF8TrimWhitespaceBase : StringTransform {
+  using Base = StringTransform;
+  using offset_type = typename Base::offset_type;
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+const uint8_t* begin = input;
+const uint8_t* end = input + input_string_ncodeunits;
+const uint8_t* end_trimmed = end;
+const uint8_t* begin_trimmed = begin;
+
+auto predicate = [](uint32_t c) { return !IsSpaceCharacterUnicode(c); };
+if (left && !ARROW_PREDICT_TRUE(
+arrow::util::UTF8FindIf(begin, end, predicate, 
_trimmed))) {
+  return false;
+}
+if (right && (begin_trimmed < end)) {
+  if (!ARROW_PREDICT_TRUE(arrow::util::UTF8FindIfReverse(begin_trimmed, 
end,
+ predicate, 
_trimmed))) {
+return false;
+  }
+}
+std::copy(begin_trimmed, end_trimmed, output);
+*output_written = static_cast(end_trimmed - begin_trimmed);
+return true;
+  }
+  void Execute(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+EnsureLookupTablesFilled();
+Base::Execute(ctx, batch, out);
+  }
+};
+
+template 
+struct UTF8TrimWhitespace
+: UTF8TrimWhitespaceBase> {};
+
+template 
+struct UTF8LTrimWhitespace
+: UTF8TrimWhitespaceBase> {};
+
+template 
+struct UTF8RTrimWhitespace
+: UTF8TrimWhitespaceBase> {};
+
+template 
+struct UTF8TrimBase : StringTransform {
+  using Base = StringTransform;
+  using offset_type = typename Base::offset_type;
+  using State = OptionsWrapper;
+  TrimOptions options;
+  std::vector codepoints;
+
+  explicit UTF8TrimBase(TrimOptions options) : options(options) {
+// TODO: check return / can we raise an exception here?
+arrow::util::UTF8ForEach(options.characters, [&](uint32_t c) {
+  codepoints.resize(std::max(c + 1, 
static_cast(codepoints.size(;
+  codepoints.at(c) = true;
+});
+  }
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+TrimOptions options = State::Get(ctx);
+Derived(options).Execute(ctx, batch, out);
+  }
+
+  void Execute(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+EnsureLookupTablesFilled();
+Base::Execute(ctx, batch, out);
+  }
+
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+const uint8_t* begin = input;
+const uint8_t* end = input + input_string_ncodeunits;
+const uint8_t* end_trimmed = end;
+const uint8_t* begin_trimmed = begin;
+
+auto predicate = [&](uint32_t c) {
+  bool contains = codepoints[c];
+  return !contains;
+};
+if (left && !ARROW_PREDICT_TRUE(
+arrow::util::UTF8FindIf(begin, end, predicate, 
_trimmed))) {
+  return false;
+}
+if (right && (begin_trimmed < end)) {
+  if (!ARROW_PREDICT_TRUE(arrow::util::UTF8FindIfReverse(begin_trimmed, 
end,
+ predicate, 
_trimmed))) {
+return false;
+  }
+}
+std::copy(begin_trimmed, end_trimmed, output);
+*output_written = static_cast(end_trimmed - begin_trimmed);
+return true;
+  }
+};
+template 
+struct UTF8Trim : UTF8TrimBase> {
+  using Base = UTF8TrimBase>;
+  using Base::Base;
+};
+
+template 
+struct UTF8LTrim : UTF8TrimBase> {
+  using Base = UTF8TrimBase>;
+  using Base::Base;
+};
+
+template 
+struct UTF8RTrim : UTF8TrimBase> {
+  using Base = UTF8TrimBase>;
+  using Base::Base;
+};
+
+#endif
+
+template 
+struct AsciiTrimWhitespaceBase : StringTransform {
+  using offset_type = typename Type::offset_type;
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+const uint8_t* begin = input;
+const uint8_t* end = input + input_string_ncodeunits;
+const uint8_t* end_trimmed = end;
+
+auto predicate = [](unsigned char c) { return !IsSpaceCharacterAscii(c); };
+const uint8_t* begin_trimmed = left ? std::find_if(begin, end, predicate) 
: begin;
+if (right & (begin_trimmed < end)) {
+  std::reverse_iterator rbegin(end);
+  std::reverse_iterator rend(begin_trimmed);
+  end_trimmed = std::find_if(rbegin, rend, predicate).base();
+}
+std::copy(begin_trimmed, end_trimmed, output);
+*output_written = static_cast(end_trimmed - begin_trimmed);
+return true;
+  }
+};
+
+template 
+struct AsciiTrimWhitespace
+: AsciiTrimWhitespaceBase> {};
+
+template 
+struct AsciiLTrimWhitespace
+: AsciiTrimWhitespaceBase>

[GitHub] [arrow] pitrou commented on pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2020-11-11 Thread GitBox



pitrou commented on pull request #8621:
URL: https://github.com/apache/arrow/pull/8621#issuecomment-725493506


   > I guess we still need to manually add content to compute.rst?
   
   Yes, you do :-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



jorgecarleitao edited a comment on pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#issuecomment-725500367


   Thanks a lot, @alamb and @vertexclique . I agree with the naming issues 
here, and great insight into those crates. I do not have strong feelings about 
naming; I tried to be close to what I am familiar with from Rust (e.g. 
`Vec::extend`).
   
   A bit of background: @yordan-pavlov did an impressive job on #7798 to make 
filtering performant; It is _really_ hard to get that performance with a 
generalization. I had to re-write this code at least 3 times to get it to a 
stage with comparable performance and even then, you can see that it depends on 
the filter conditions. But there is a (IMO good) reason for generalizing the 
code in `filter`.
   
   Specifically, the use-case I am looking at is to remove most code from the 
`take` kernel, since it is doing what this struct is doing (and what `filter` 
was doing). There are implementation details that are different (`take`'s 
indices can be null and `filter` does not support `ArrayStruct` atm), but they 
are fundamentally doing the same thing: constructing an `ArrayData` by 
memcopying chunks of another `ArrayData`.
   
   Also, note that this goes beyond masking: it supports taking values 
repeatedly. I see it as a "builder" bound to a specific `ArrayData` 
(`ArrayDataBuilder` is already taken, though). It is not a builder like the 
builders in `builder.rs` because those memcopy data from rust native types, not 
Arrow types.
   
   The reason this performance remains comparable(-ish) to master is that when 
it binds to an `ArrayData`, it inspects its `DataType` to initialize functions 
(the `type Extend`) bound to that `ArrayData` that are performant.
   
   For example,
   
   ```rust
   let values = ()[0].data()[array.offset() * size_of::()..];
   Box::new(move |mutable:  _MutableArrayData, start: usize, len: usize| {
   ```
   
   instead of
   
   ```rust
   Box::new(move |mutable:  _MutableArrayData, array: , start: 
usize, len: usize| {
   let values = ()[0].data()[array.offset() * 
size_of::()..];
   ```
   
   has a meaningful performance impact because it avoids checks on both 
`buffers()[0]` and `[something..]` on every call of `extend`. This check may 
seem small, but it significantly impacts performance because `extend` can be as 
small as: `copy 2 bits` (in the case of a boolean array). When there are many 
calls of this operation (e.g. filter every other slot), these checks are as or 
more expensive than the `memcopy` themselves.
   
   There is a further boost possible that I explored but did not finish, but it 
requires a bit of `unsafe` code: we know that these functions never outlive 
`MutableDataArray`. Therefore, we could actually unsafely pass `` to 
their initialization and avoid all costs associated with accessing 
`mutable.buffers()` and the like inside `extend`. In this case, we would use
   
   ```rust
   type Extend<'a> = Box () + 'a>;
   ```
   
   and bind the `_MutableArrayData` on `build_primitive_extend<'a, T>(array, 
data)`. This would be the closest to the current filter implementation that 
remains generic for other use-cases.
   
   My point is that there are non-trivial optimizations done in this proposal 
that were borrowed from the excellent work from @yordan-pavlov and that are 
required to keep up with very high bar set by #7798  This draft is trying to 
leverage that work on two of our kernels, `take` and `filter`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



alamb commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725414703


   In my measurements, the `sum` compute kernels do appear to have significant 
variability from run to run on my machine (details below). The variability 
still exists with this PR though it appears to be less.
   
   I think one issue, as @vertexclique has expalined, is that that different 
random numbers are being used between runs (and between threasd) and thus the 
actual computations performed from run to run are changing.
   
   Seeding the random number generator so it always produces the same sequence 
of random numbers is definitely a classic way to reduce such variability and 
seems like a good idea to me. 
   
   Seeding the random number generators *can* be done using the existing `rand` 
crate, in the following way (this took me longer to figure out from the  `rand` 
crate's documentation than I would like to admit):
   
   Instead of
   ```
   let mut rng = rand::thread_rng();
   ```
   
   Use
   ```
   use rand::{
   Rng, SeedableRng,
   rngs::StdRng
   };
   let mut rng = StdRng::seed_from_u64(42);
   ```
   
   
   Here are some measurements I ran on my machine:
   
   Master @ f7027b43d10bab4d8ca9397a753dc3553d88f146
   ```
   sum 512 time:   [452.64 ns 456.16 ns 459.94 ns]
   sum 512 time:   [459.08 ns 462.78 ns 466.55 ns]
   sum 512 time:   [457.80 ns 461.87 ns 466.06 ns]
   sum nulls 512   time:   [246.69 ns 248.65 ns 250.87 ns]
   sum nulls 512   time:   [269.41 ns 271.79 ns 274.21 ns]
   sum nulls 512   time:   [247.98 ns 250.42 ns 252.92 ns]
   ```
   
   ARROW-10551-fix-unreproducible-benches
   ```
   sum 512 time:   [476.98 ns 482.19 ns 488.17 ns]
   sum 512 time:   [470.75 ns 474.96 ns 479.42 ns]
   sum 512 time:   [506.47 ns 508.00 ns 509.59 ns]
   sum nulls 512   time:   [268.39 ns 270.32 ns 272.56 ns]
   sum nulls 512   time:   [272.20 ns 274.46 ns 276.81 ns]
   sum nulls 512   time:   [266.60 ns 269.12 ns 272.28 ns]
   ```
   
   Master @ f7027b43d10bab4d8ca9397a753dc3553d88f146 w/ `StdRng::seed_from_u64`:
   ```
   sum 512 time:   [463.24 ns 467.51 ns 472.28 ns]
   sum 512 time:   [457.42 ns 460.00 ns 462.66 ns]
   sum 512 time:   [466.96 ns 471.60 ns 476.52 ns]
   sum nulls 512   time:   [236.96 ns 238.34 ns 239.76 ns]
   sum nulls 512   time:   [241.24 ns 243.24 ns 245.50 ns]
   sum nulls 512   time:   [246.68 ns 248.60 ns 250.61 ns]
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



jorgecarleitao commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725414955


   > Current, code is reseeding after 32 MB of data, and every thread has 
different randomness. So no data is same as another data and totally different 
in different use cases.
   
   What I am trying to understand is why this is a concern: in the benchmarks, 
the data is generated only once, during the setup and not on every iteration of 
the benchmark. Thus, any randomness will be frozen for every iteration of the 
benchmark.
   
   If the argument is that two runs of the same benchmark do not agree due to 
randomness, then I do not understand how having the same seed per thread or 
different seeds per thread changes this behavior: the seed will still be set 
when the process starts and will be different per run.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



alamb commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725438655


   > Can you use https://github.com/vertexclique/zor/blob/master/zor if you are 
on Linux?
   
   I am running on mac osx -- I looked at zor and it looked like a good tool to 
try and make reproducible results by clearing kernel caches, etc. I wonder if 
you can try using the seeded random approach with zor to see if that is as 
reproducible.
   
   > Yeah, I didn't want to write a const compiled seed and people forget to 
use the same seed every time.
   That is a good point -- maybe we can make a utility function or something 
for the tests to use. I don't think the actual value of the seed (e.g. `42`) is 
is important to keep consistent in all the .rs sources - just that it is a 
constant number between invocations of `cargo bench ...` across runs
   
   > My implementation does some extra faster stuff compared to rand too. Here:
   https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28
   
   That looks cool, but I am not enough of a numerical algorithms expert to 
evaluate the random ness properties of that algorithm



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



jorgecarleitao commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521407494



##
File path: rust/arrow/src/array/transform/primitive.rs
##
@@ -0,0 +1,37 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{io::Write, mem::size_of};
+
+use crate::{array::ArrayData, datatypes::ArrowNativeType};
+
+use super::{Extend, _MutableArrayData};
+
+pub(super) fn build_extend(array: ) -> Extend {
+let values = ()[0].data()[array.offset() * size_of::()..];
+Box::new(
+move |mutable:  _MutableArrayData, start: usize, len: usize| {
+let start = start * size_of::();

Review comment:
   I try to use generics for things that are known at compile time and 
arguments for things that are only known at runtime. Maybe that is not the 
correct way of thinking?
   
   In this case in particular, the difference would be between 
`build_extend::(array)` and `build_extend::(array, size_of::())`.
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche opened a new pull request #8638: ARROW-10558: [Python] Fix python S3 filesystem tests interdependence

2020-11-11 Thread GitBox



jorisvandenbossche opened a new pull request #8638:
URL: https://github.com/apache/arrow/pull/8638


   Follow-up on https://github.com/apache/arrow/pull/8573, where I introduced a 
test that was only passing because of state from other S3 tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



vertexclique commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725419093


   Can you use https://github.com/vertexclique/zor/blob/master/zor if you are 
on Linux?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



jorgecarleitao commented on pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#issuecomment-725500367


   Thanks a lot, @alamb and @vertexclique . I agree with the naming issues 
here, and great insight into those crates. I do not have strong feelings about 
naming; I tried to be close to what I am familiar with from Rust (e.g. 
`Vec::extend`).
   
   A bit of background: @yordan-pavlov did an impressive job on #7798 to make 
filtering performant; It is _really_ hard to get that performance with a 
generalization. I had to re-write this code at least 3 times to get it to a 
stage with comparable performance and even then, you can see that it depends on 
the filter conditions. But there is a (IMO good) reason for generalizing the 
code in `filter`.
   
   Specifically, the use-case I am looking at is to remove most code from the 
`take` kernel, since it is doing what this struct is doing (and what `filter` 
was doing). There are implementation details that are different (`take`'s 
indices can be null and `filter` does not support `ArrayStruct` atm), but they 
are fundamentally doing the same thing: constructing an `ArrayData` by 
memcopying chunks of another `ArrayData`.
   
   Also, note that this goes beyond masking: it supports taking values 
repeatedly. I see it as a "builder" bound to a specific `ArrayData` 
(`ArrayDataBuilder` is already taken, though). It is not a builder like the 
builders in `builder.rs` because those memcopy data from rust native types, not 
Arrow types.
   
   The reason this performant remains comparable(-ish) to master is that when 
it binds to an `ArrayData`, it inspects its `DataType` to initialize functions 
(the `type Extend`) bound to that `ArrayData` that are performant.
   
   For example,
   
   ```rust
   let values = ()[0].data()[array.offset() * size_of::()..];
   Box::new(move |mutable:  _MutableArrayData, start: usize, len: usize| {
   ```
   
   instead of
   
   ```rust
   Box::new(move |mutable:  _MutableArrayData, array: , start: 
usize, len: usize| {
   let values = ()[0].data()[array.offset() * 
size_of::()..];
   ```
   
   has a meaningful performance impact because it avoids checks on both 
`buffers()[0]` and `[something..]` on every call of `extend`. This check may 
seem small, but it significantly impacts performance because `extend` can be as 
small as: `copy 2 bits` (in the case of a boolean array). When there are many 
calls of this operation (e.g. filter every other slot), these checks are as or 
more expensive than the `memcopy` themselves.
   
   There is a further boost possible that I explored but did not finish, but it 
requires a bit of `unsafe` code: we know that these functions never outlive 
`MutableDataArray`. Therefore, we could actually unsafely pass `` to 
their initialization and avoid all costs associated with accessing 
`mutable.buffers()` and the like inside `extend`. In this case, we would use
   
   ```rust
   type Extend<'a> = Box () + 'a>;
   ```
   
   and bind the `_MutableArrayData` on `build_primitive_extend<'a, T>(array, 
data)`. This would be the closest to the current filter implementation that 
remains generic for other use-cases.
   
   My point is that there are really non-trivial optimizations done in this 
proposal that were borrowed from the excellent work from @yordan-pavlov and 
that are required to keep up with very high bar set by #7798  This draft is 
trying to leverage that work on two of our kernels, `take` and `filter`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on a change in pull request #8636: ARROW-10552: [Rust] Removed un-used Result

2020-11-11 Thread GitBox



vertexclique commented on a change in pull request #8636:
URL: https://github.com/apache/arrow/pull/8636#discussion_r521339244



##
File path: rust/arrow/src/compute/kernels/filter.rs
##
@@ -112,7 +112,7 @@ impl<'a> CopyNullBit for NullBitSetter<'a> {
 }
 
 fn null_buffer( self) -> Buffer {
-self.target_buffer.resize(self.target_index).unwrap();
+self.target_buffer.resize(self.target_index);

Review comment:
   These are most probably generating clippy warnings.

##
File path: rust/arrow/src/buffer.rs
##
@@ -676,9 +676,9 @@ impl MutableBuffer {
 /// `new_len` will be zeroed out.
 ///
 /// If `new_len` is less than `len`, the buffer will be truncated.
-pub fn resize( self, new_len: usize) -> Result<()> {
+pub fn resize( self, new_len: usize) -> () {

Review comment:
   ```suggestion
   pub fn resize( self, new_len: usize) {
   ```
   
   It is the same in other positions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



jorgecarleitao edited a comment on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725414955


   > Current, code is reseeding after 32 MB of data, and every thread has 
different randomness. So no data is same as another data and totally different 
in different use cases.
   
   What I am trying to understand is why this is a concern: in the benchmarks, 
the data is generated only once, during the setup and not on every iteration of 
the benchmark. Thus, any randomness will be frozen for every iteration of the 
benchmark.
   
   If the argument is that two runs of the same benchmark do not agree due to 
randomness, then I do not understand how having the same seed per thread or 
different seeds per thread changes this behavior: the seed will still be set 
when the process starts and will be different per run.
   
   EDIT: @alamb beat me to seconds (and evidence)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



alamb commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521354110



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   As @vertexclique  mentions, he has provided a good PR related to this 
conversation https://github.com/apache/arrow/pull/8635





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8619: ARROW-10531: [Rust][DataFusion]: Add schema and graphviz formatting for LogicalPlans and a PlanVisitor

2020-11-11 Thread GitBox



alamb commented on pull request #8619:
URL: https://github.com/apache/arrow/pull/8619#issuecomment-725433951


   Rebased and will merge when it passes CI



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2020-11-11 Thread GitBox



maartenbreddels commented on pull request #8621:
URL: https://github.com/apache/arrow/pull/8621#issuecomment-725476725


   I've opened an issue at https://issues.apache.org/jira/browse/ARROW-10556
   
   I guess we still need to manually add content to compute.rst? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8636: ARROW-10552: [Rust] Removed un-used Result

2020-11-11 Thread GitBox



alamb commented on a change in pull request #8636:
URL: https://github.com/apache/arrow/pull/8636#discussion_r521368094



##
File path: rust/arrow/src/compute/kernels/filter.rs
##
@@ -112,7 +112,7 @@ impl<'a> CopyNullBit for NullBitSetter<'a> {
 }
 
 fn null_buffer( self) -> Buffer {
-self.target_buffer.resize(self.target_index).unwrap();
+self.target_buffer.resize(self.target_index);

Review comment:
   You are likely correct -- we should eventually turn on a clippy for the 
CI checks in this crate





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jhorstmann commented on pull request #8598: ARROW-10500: [Rust] Refactor bit slice, bit view iterator for array buffers

2020-11-11 Thread GitBox



jhorstmann commented on pull request #8598:
URL: https://github.com/apache/arrow/pull/8598#issuecomment-725477741


   > > I think we should address jhorstmann 's measurements of performance 
regressions before this pR is merged.
   > 
   > I measured the performance. upside_down_face It is in the PR description.
   
   That's exactly where I took the benchmark results from. But yes, the 
regression in `buffer_bit_ops and` does not seem to have any big effect.
   
   I had one other comment about the separate testcases for big-endian 
architectures, or restricting tests to little-endian, that was not yet 
addressed:
   
   > I'm wondering whether this is really correct, the way I understood it is 
that little/big endian only affect the layout of bytes in memory, not how 
individual bits are accessed in a number. In this testcase the least 
significant bit of the first byte is zero and would be considered the first 
value if this was a boolean array or null bitmap. Same for the 4th least 
significant bit, which is where the slice here should start. This means the 
least significant bit of the chunk should be zero.
   
   Consider the following buffer of u8, used as bit-packed data, with the 
indices of bytes and bits written below
   
   ```
    0001
  01
   76543210 76543210
   ```
   
   To get the value of the 12th bit we would check bit (12%8) of byte (12/8). 
Viewing this as a larger type (u16 for simplification):
   
   ```
   0001
  0
   11
   5432109876543210
   ```
   
   To check the same bit we would need to check bit (12%16) of word (12/16). So 
the value as u16 would be 4096 and this should be independent of the 
machine-endianness. Endianness only influences how the u16 would be stored in 
memory, but our underlying data consists of u8 in memory.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



vertexclique commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469


   > If the argument is that two runs of the same benchmark do not agree due to 
randomness, then I do not understand how having the same seed per thread or 
different seeds per thread changes this behavior: the seed will still be set 
when the process starts and will be different per run.
   
   I don't think that it should be different per run per benchmark, that's why 
it is variating.
   
https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210
   
   > Seeding the random number generators can be done using the existing rand 
crate, in the following way (this took me longer to figure out from the rand 
crate's documentation than I would like to admit):
   
   Yeah, I didn't want to write a const compiled and people forget to use the 
same seed every time. My implementation does some extra faster stuff compared 
to rand too. Here:
   https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



vertexclique commented on pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#issuecomment-725436304


   ```
   Naming: I have seen similar concepts called "Masks" (as they are similar to 
bit masks) -- so perhaps ArrayDataMask or MaskedArrayData. Or perhaps 
ArrayRowSet
   
   When I actually read the code about MutableArrayData I realize that it isn't 
quite the mask concept, but it is similar (intermediate results want to 
represent "what indexes pass a certain test" and then eventually copying only 
those indexes to a new array"
   
   This type of structure might also useful for performing multiple boolean 
operations (eg. when doing (A > 5) AND A < 10 you can compute the row 
ids/indexes for that pass A > 5 and then rather than actually copying those 
rows to then compare them less than B you can operate directly on the original 
copy of A (check only the rows where the mask is true)
   ```
   I found the whole block of this comment true. There are other approaches to 
do this but the main approach is that. I believe scratchpad implementation can 
also solve this problem from the different looking glass.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support

2020-11-11 Thread GitBox



pitrou commented on a change in pull request #8542:
URL: https://github.com/apache/arrow/pull/8542#discussion_r521362458



##
File path: cpp/src/arrow/util/basic_decimal.cc
##
@@ -549,24 +603,27 @@ static DecimalStatus SingleDivide(const uint32_t* 
dividend, int64_t dividend_len
   return DecimalStatus::kSuccess;
 }
 
-DecimalStatus BasicDecimal128::Divide(const BasicDecimal128& divisor,
-  BasicDecimal128* result,
-  BasicDecimal128* remainder) const {
+/// \brief Do a division where the divisor fits into a single 32 bit value.
+template 
+static inline DecimalStatus DecimalDivide(const DecimalClass& dividend,
+  const DecimalClass& divisor,
+  DecimalClass* result, DecimalClass* 
remainder) {
+  constexpr int64_t kDecimalArrayLength = sizeof(DecimalClass) / 
sizeof(uint32_t);

Review comment:
   Use of `sizeof(DecimalClass)` here seems a bit fragile. Perhaps add/use 
`DecimalClass::bit_width`?

##
File path: cpp/src/arrow/util/decimal_test.cc
##
@@ -1296,6 +1296,56 @@ TEST(Decimal256Test, Multiply) {
   }
 }
 
+TEST(Decimal256Test, Divide) {
+  ASSERT_EQ(Decimal256(33), Decimal256(100) / Decimal256(3));
+  ASSERT_EQ(Decimal256(66), Decimal256(200) / Decimal256(3));
+  ASSERT_EQ(Decimal256(66), Decimal256(20100) / Decimal256(301));
+  ASSERT_EQ(Decimal256(-66), Decimal256(-20100) / Decimal256(301));
+  ASSERT_EQ(Decimal256(-66), Decimal256(20100) / Decimal256(-301));
+  ASSERT_EQ(Decimal256(66), Decimal256(-20100) / Decimal256(-301));
+  ASSERT_EQ(Decimal256("-5192296858534827628530496329343552"),
+
Decimal256("-269599466671506397946670150910580797473777870509761363"
+   "24636208709184") /
+Decimal256("5192296858534827628530496329874417"));
+  ASSERT_EQ(Decimal256("5192296858534827628530496329343552"),
+
Decimal256("-269599466671506397946670150910580797473777870509761363"
+   "24636208709184") /
+Decimal256("-5192296858534827628530496329874417"));
+  ASSERT_EQ(Decimal256("5192296858534827628530496329343552"),
+
Decimal256("2695994666715063979466701509105807974737778705097613632"
+   "4636208709184") /
+Decimal256("5192296858534827628530496329874417"));
+  ASSERT_EQ(Decimal256("-5192296858534827628530496329343552"),
+
Decimal256("2695994666715063979466701509105807974737778705097613632"
+   "4636208709184") /
+Decimal256("-5192296858534827628530496329874417"));
+
+  // Test some random numbers.
+  for (auto x : GetRandomNumbers(16)) {
+for (auto y : GetRandomNumbers(16)) {
+  if (y == 0) {
+continue;
+  }
+
+  Decimal256 result = Decimal256(x) / Decimal256(y);
+  ASSERT_EQ(Decimal256(static_cast(x) / y), result)
+  << " x: " << x << " y: " << y;
+}
+  }
+
+  // Test some edge cases
+  for (auto x : std::vector{-INT64_MAX, -INT32_MAX, 0, INT32_MAX, 
INT64_MAX}) {
+for (auto y : std::vector{-INT32_MAX, -32, -2, -1, 1, 2, 32, 
INT32_MAX}) {

Review comment:
   I would expect those additional values to also be part of `y`.

##
File path: cpp/src/arrow/util/basic_decimal.cc
##
@@ -549,24 +603,27 @@ static DecimalStatus SingleDivide(const uint32_t* 
dividend, int64_t dividend_len
   return DecimalStatus::kSuccess;
 }
 
-DecimalStatus BasicDecimal128::Divide(const BasicDecimal128& divisor,
-  BasicDecimal128* result,
-  BasicDecimal128* remainder) const {
+/// \brief Do a division where the divisor fits into a single 32 bit value.

Review comment:
   The comment here seems wrong (the divisor doesn't fit into a single 
32bit value).

##
File path: cpp/src/arrow/util/decimal_test.cc
##
@@ -40,6 +40,9 @@ namespace arrow {
 using internal::int128_t;
 using internal::uint128_t;
 
+constexpr int128_t kInt128Max = (static_cast(INT64_MAX) << 64) - 1 +
+2 * (static_cast(INT64_MAX) + 1);

Review comment:
   Why not `(static_cast(INT64_MAX) << 64) + 
static_cast(UINT64_MAX)` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



jorgecarleitao commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521392544



##
File path: rust/arrow/src/array/transform/mod.rs
##
@@ -0,0 +1,532 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{io::Write, mem::size_of, sync::Arc};
+
+use crate::{buffer::MutableBuffer, datatypes::DataType, util::bit_util};
+
+use super::{ArrayData, ArrayDataRef};
+
+mod boolean;
+mod list;
+mod primitive;
+mod utils;
+mod variable_size;
+
+type ExtendNullBits<'a> = Box () + 
'a>;
+// function that extends `[start..start+len]` to the mutable array.
+// this is dynamic because different data_types influence how buffers and 
childs are extended.
+type Extend<'a> = Box () + 'a>;

Review comment:
   That unfortunately would only work for primitive buffers. For string 
arrays, extending an array data requires a complex operation that is 
fundamentally different from extending a single buffer. For nested types, the 
operation is recursive on the child data.
   
   This is fundamentally a dynamic operation: we only know what to do when we 
see which `DataType` the user wants to build an `ArrayData` from. We can see 
that the `Builders` use a similar approach: they use `dyn Builder` for the same 
reason.
   
   The builders have an extra complexity associated with the fact that their 
input type is not uniform: i.e. their API supports extending from a `&[T]` 
(e.g. `i32` or `i16`), which is the reason why they need to be implemented via 
a dynamic type, whose each implementation has methods for each type. In the 
`MutableArrayData`, the only "thing" that we extend from is an `ArrayData`, 
which has a uniform (rust) type, but requires a different behavior based on its 
`data_type` => function pointer per data-type.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8612: ARROW-8199: [C++] Add support for multi-column sort indices on Table

2020-11-11 Thread GitBox



pitrou commented on a change in pull request #8612:
URL: https://github.com/apache/arrow/pull/8612#discussion_r521405090



##
File path: cpp/src/arrow/compute/api_vector.h
##
@@ -58,6 +58,34 @@ struct ARROW_EXPORT TakeOptions : public FunctionOptions {
   static TakeOptions Defaults() { return BoundsCheck(); }
 };
 
+enum SortOrder {

Review comment:
   Make this a `enum class`?

##
File path: cpp/src/arrow/compute/api_vector.h
##
@@ -58,6 +58,34 @@ struct ARROW_EXPORT TakeOptions : public FunctionOptions {
   static TakeOptions Defaults() { return BoundsCheck(); }
 };
 
+enum SortOrder {
+  ASCENDING,
+  DESCENDING,

Review comment:
   Nit, but sometimes all-uppercase values conflict with some C macros 
(especially on Windows). `Ascending` perhaps?

##
File path: cpp/src/arrow/compute/api_vector.h
##
@@ -165,8 +194,50 @@ Result> NthToIndices(const Array& 
values, int64_t n,
 /// \param[in] ctx the function execution context, optional
 /// \return offsets indices that would sort an array
 ARROW_EXPORT
-Result> SortToIndices(const Array& values,
- ExecContext* ctx = NULLPTR);
+Result> SortIndices(const Array& values,
+   ExecContext* ctx = NULLPTR);
+
+/// \brief Returns the indices that would sort an array in the
+/// specified order.
+///
+/// Perform an indirect sort of array. The output array will contain
+/// indices that would sort an array, which would be the same length
+/// as input. Nulls will be stably partitioned to the end of the output.

Review comment:
   Add "regardless of order"?

##
File path: cpp/src/arrow/compute/kernels/vector_sort.cc
##
@@ -346,14 +374,396 @@ void AddSortingKernels(VectorKernel base, 
VectorFunction* func) {
   }
 }
 
+class TableSorter : public TypeVisitor {
+ private:
+  struct ResolvedSortKey {
+ResolvedSortKey(const ChunkedArray& chunked_array, const SortOrder order)
+: order(order) {
+  type = chunked_array.type().get();
+  null_count = chunked_array.null_count();
+  num_chunks = chunked_array.num_chunks();
+  for (const auto& chunk : chunked_array.chunks()) {
+chunks.push_back(chunk.get());
+  }
+}
+
+template 
+ArrayType* ResolveChunk(int64_t index, int64_t& chunk_index) const {
+  if (num_chunks == 1) {
+chunk_index = index;
+return static_cast(chunks[0]);
+  } else {
+int64_t offset = 0;
+for (size_t i = 0; i < num_chunks; ++i) {
+  if (index < offset + chunks[i]->length()) {
+chunk_index = index - offset;
+return static_cast(chunks[i]);
+  }
+  offset += chunks[i]->length();
+}
+return nullptr;
+  }
+}
+
+SortOrder order;
+DataType* type;
+int64_t null_count;
+size_t num_chunks;

Review comment:
   `int64_t`?

##
File path: cpp/src/arrow/compute/kernels/vector_sort.cc
##
@@ -346,14 +374,396 @@ void AddSortingKernels(VectorKernel base, 
VectorFunction* func) {
   }
 }
 
+class TableSorter : public TypeVisitor {
+ private:
+  struct ResolvedSortKey {
+ResolvedSortKey(const ChunkedArray& chunked_array, const SortOrder order)
+: order(order) {
+  type = chunked_array.type().get();
+  null_count = chunked_array.null_count();
+  num_chunks = chunked_array.num_chunks();
+  for (const auto& chunk : chunked_array.chunks()) {
+chunks.push_back(chunk.get());
+  }
+}
+
+template 
+ArrayType* ResolveChunk(int64_t index, int64_t& chunk_index) const {
+  if (num_chunks == 1) {
+chunk_index = index;
+return static_cast(chunks[0]);
+  } else {
+int64_t offset = 0;
+for (size_t i = 0; i < num_chunks; ++i) {
+  if (index < offset + chunks[i]->length()) {
+chunk_index = index - offset;
+return static_cast(chunks[i]);
+  }
+  offset += chunks[i]->length();
+}
+return nullptr;
+  }
+}
+
+SortOrder order;
+DataType* type;
+int64_t null_count;
+size_t num_chunks;
+std::vector chunks;
+  };
+
+  class Comparer : public TypeVisitor {
+   public:
+Comparer(const Table& table, const std::vector& sort_keys)
+: TypeVisitor(), status_(Status::OK()) {
+  for (const auto& sort_key : sort_keys) {
+const auto& chunked_array = table.GetColumnByName(sort_key.name);
+if (!chunked_array) {
+  status_ = Status::Invalid("Nonexistent sort key column: ", 
sort_key.name);
+  return;
+}
+sort_keys_.emplace_back(*chunked_array, sort_key.order);
+  }
+}
+
+Status status() { return status_; }
+
+const std::vector& sort_keys() { return sort_keys_; }
+
+bool Compare(uint64_t left, uint64_t right, size_t start_sort_key_index) {
+  current_left_ = left;
+  current_right_ = right;
+

[GitHub] [arrow] alamb opened a new pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



alamb opened a new pull request #8639:
URL: https://github.com/apache/arrow/pull/8639


   The module has gotten fairly large and so refactoring it into smaller chunks 
will improve readability – as suggested by Jorge 
https://github.com/apache/arrow/pull/8619#pullrequestreview-527391221
   
   This PR just moves code around -- it is not intended to change any semantics
   
   Reviewing it commit-by-commit might be helpful to see how each piece went
   
   I can also break it up into a sequence of smaller PRs if that would help 
review
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support

2020-11-11 Thread GitBox



pitrou commented on a change in pull request #8542:
URL: https://github.com/apache/arrow/pull/8542#discussion_r521350486



##
File path: cpp/src/arrow/util/basic_decimal.cc
##
@@ -490,49 +527,60 @@ static void FixDivisionSigns(BasicDecimal128* result, 
BasicDecimal128* remainder
   }
 }
 
-/// \brief Build a BasicDecimal128 from a list of ints.
+/// \brief Build a little endian array of uint64_t from a big endian array of 
uint32_t.
+template 
+static DecimalStatus BuildFromArray(std::array* result_array,
+uint32_t* array, int64_t length) {
+  for (int64_t i = length - 2 * N - 1; i >= 0; i--) {
+if (array[i] != 0) {
+  return DecimalStatus::kOverflow;
+}
+  }
+  int64_t next_index = length - 1;
+  for (size_t i = 0; i < N; i++) {
+uint64_t lower_bits = (next_index < 0) ? 0 : array[next_index--];
+(*result_array)[i] =
+(next_index < 0)

Review comment:
   Hmm, I hadn't noticed the `next_index--` above...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



alamb commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521362589



##
File path: rust/arrow/src/array/transform/mod.rs
##
@@ -0,0 +1,532 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{io::Write, mem::size_of, sync::Arc};
+
+use crate::{buffer::MutableBuffer, datatypes::DataType, util::bit_util};
+
+use super::{ArrayData, ArrayDataRef};
+
+mod boolean;
+mod list;
+mod primitive;
+mod utils;
+mod variable_size;
+
+type ExtendNullBits<'a> = Box () + 
'a>;
+// function that extends `[start..start+len]` to the mutable array.
+// this is dynamic because different data_types influence how buffers and 
childs are extended.
+type Extend<'a> = Box () + 'a>;

Review comment:
   Rather than a dynamic function pointer to extend such a structure, I 
wonder if you could get by with  'element_length` and `number_of_elements`

##
File path: rust/arrow/src/array/transform/primitive.rs
##
@@ -0,0 +1,37 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{io::Write, mem::size_of};
+
+use crate::{array::ArrayData, datatypes::ArrowNativeType};
+
+use super::{Extend, _MutableArrayData};
+
+pub(super) fn build_extend(array: ) -> Extend {
+let values = ()[0].data()[array.offset() * size_of::()..];
+Box::new(
+move |mutable:  _MutableArrayData, start: usize, len: usize| {
+let start = start * size_of::();

Review comment:
   it seems to me that `size_of::` is the part here that is dependent on 
type -- maybe you could just use that as a number





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



jorgecarleitao edited a comment on pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#issuecomment-725500367


   Thanks a lot, @alamb and @vertexclique . I agree with the naming issues 
here, and great insight into those crates. I do not have strong feelings about 
naming; I tried to be close to what I am familiar with from Rust (e.g. 
`Vec::extend`).
   
   A bit of background: @yordan-pavlov did an impressive job on #7798 to make 
filtering performant; It is _really_ hard to get that performance with a 
generalization. I had to re-write this code at least 3 times to get it to a 
stage with comparable performance and even then, you can see that it depends on 
the filter conditions. But there is a (IMO good) reason for generalizing the 
code in `filter`.
   
   Specifically, the use-case I am looking at is to remove most code from the 
`take` kernel, since it is doing what this struct is doing (and what `filter` 
was doing). There are implementation details that are different (`take`'s 
indices can be null and `filter` does not support `ArrayStruct` atm), but they 
are fundamentally doing the same thing: constructing an `ArrayData` by 
memcopying chunks of another `ArrayData`.
   
   Also, note that this goes beyond masking: it supports taking values 
repeatedly. I see it as a "builder" bound to a specific `ArrayData` 
(`ArrayDataBuilder` is already taken, though). It is not a builder like the 
builders in `builder.rs` because those memcopy data from rust native types, not 
Arrow types.
   
   The reason this performance remains comparable(-ish) to master is that when 
it binds to an `ArrayData`, it inspects its `DataType` to initialize functions 
(the `type Extend`) bound to that `ArrayData` that are performant.
   
   For example,
   
   ```rust
   let values = ()[0].data()[array.offset() * size_of::()..];
   Box::new(move |mutable:  _MutableArrayData, start: usize, len: usize| {
   ```
   
   instead of
   
   ```rust
   Box::new(move |mutable:  _MutableArrayData, array: , start: 
usize, len: usize| {
   let values = ()[0].data()[array.offset() * 
size_of::()..];
   ```
   
   has a meaningful performance impact because it avoids checks on both 
`buffers()[0]` and `[something..]` on every call of `extend`. This check may 
seem small, but it significantly impacts performance because `extend` can be as 
small as: `copy 2 bits` (in the case of a boolean array). When there are many 
calls of this operation (e.g. filter every other slot), these checks are as or 
more expensive than the `memcopy` themselves.
   
   There is a further boost possible that I explored but did not finish, but it 
requires a bit of `unsafe` code: we know that these functions never outlive 
`MutableDataArray`. Therefore, we could actually unsafely pass `` to 
their initialization and avoid all costs associated with accessing 
`mutable.buffers()` and the like inside `extend`. In this case, we would use
   
   ```rust
   type Extend<'a> = Box () + 'a>;
   ```
   
   and bind the `_MutableArrayData` on `build_primitive_extend<'a, T>(array, 
data)`. This would be the closest to the current filter implementation that 
remains generic for other use-cases.
   
   My point is that there are really non-trivial optimizations done in this 
proposal that were borrowed from the excellent work from @yordan-pavlov and 
that are required to keep up with very high bar set by #7798  This draft is 
trying to leverage that work on two of our kernels, `take` and `filter`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8619: ARROW-10531: [Rust][DataFusion]: Add schema and graphviz formatting for LogicalPlans and a PlanVisitor

2020-11-11 Thread GitBox



alamb commented on pull request #8619:
URL: https://github.com/apache/arrow/pull/8619#issuecomment-725433229


   Thanks @andygrove  and @jorgecarleitao  -- I plan to merge this PR and then 
make a new one to break the code into smaller modules



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches

2020-11-11 Thread GitBox



vertexclique commented on pull request #8635:
URL: https://github.com/apache/arrow/pull/8635#issuecomment-725439351


   > I wonder if you can try using the seeded random approach with zor to see 
if that is as reproducible.
   
   will do.
   
   > That looks cool, but I am not enough of a numerical algorithms expert to 
evaluate the random ness properties of that algorithm
   
   dw, I will check if that's like how you reported, I will change to constant 
seed with seededrng. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



jorgecarleitao commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521407494



##
File path: rust/arrow/src/array/transform/primitive.rs
##
@@ -0,0 +1,37 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{io::Write, mem::size_of};
+
+use crate::{array::ArrayData, datatypes::ArrowNativeType};
+
+use super::{Extend, _MutableArrayData};
+
+pub(super) fn build_extend(array: ) -> Extend {
+let values = ()[0].data()[array.offset() * size_of::()..];
+Box::new(
+move |mutable:  _MutableArrayData, start: usize, len: usize| {
+let start = start * size_of::();

Review comment:
   I try to use generics for things that are known at compile time and 
arguments for things that are only known at runtime. Maybe that is not the 
correct way of thinking?
   
   In this case in particular, the difference would be between 
`build_extend::(array)` and `build_extend(array, size_of::())`.
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8638: ARROW-10558: [Python] Fix python S3 filesystem tests interdependence

2020-11-11 Thread GitBox



github-actions[bot] commented on pull request #8638:
URL: https://github.com/apache/arrow/pull/8638#issuecomment-725498272


   https://issues.apache.org/jira/browse/ARROW-10558



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



alamb commented on a change in pull request #8639:
URL: https://github.com/apache/arrow/pull/8639#discussion_r521484903



##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -1264,31 +1050,6 @@ mod tests {
 Ok(())
 }
 
-#[test]
-fn test_visitor() {
-let schema = Schema::new(vec![]);
-assert_eq!("[]", format!("{}", display_schema()));
-}
-
-#[test]
-fn test_display_empty_schema() {
-let schema = Schema::new(vec![]);
-assert_eq!("[]", format!("{}", display_schema()));
-}
-
-#[test]
-fn test_display_schema() {
-let schema = Schema::new(vec![
-Field::new("id", DataType::Int32, false),
-Field::new("first_name", DataType::Utf8, true),
-]);
-
-assert_eq!(
-"[id:Int32, first_name:Utf8;N]",
-format!("{}", display_schema())
-);
-}

Review comment:
   Good ️ 
   
   I removed 
   ```
   fn test_visitor() {
   let schema = Schema::new(vec![]);
   assert_eq!("[]", format!("{}", display_schema()));
   }
   ```
   
   Which is redundant, except for the name, with the test immediately below: 
   ```
   #[test]
   fn test_display_empty_schema() {
   let schema = Schema::new(vec![]);
   assert_eq!("[]", format!("{}", display_schema()));
   }
   ```
   
   Aka it was removing a copy / paste bug (that I added  )





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sweb opened a new pull request #8640: WIP: ARROW-4193: [Rust] Add support for decimal data type

2020-11-11 Thread GitBox



sweb opened a new pull request #8640:
URL: https://github.com/apache/arrow/pull/8640


   This is very much work in progress.
   
   The idea is to implement `DecimalArray` based on the existing 
`FixedSizeBinaryArray`. At the current state this is mostly C The values are 
returned as `i128`. The underlying fixed size is calculated from the precision.
   
   ```
   let values: [u8; 20] = [0, 0, 0, 0, 0, 2, 17, 180, 219, 192, 255, 
255, 255, 255, 255, 253, 238, 75, 36, 64];
   
   let array_data = ArrayData::builder(DataType::Decimal(23, 6))
   .len(2)
   .add_buffer(Buffer::from([..]))
   .build();
   let fixed_size_binary_array = DecimalArray::from(array_data);
   
   assert_eq!(
   8_887_000_000,
   fixed_size_binary_array.value(0)
   );
   assert_eq!(
   -8_887_000_000,
   fixed_size_binary_array.value(1)
   );
   assert_eq!(10, fixed_size_binary_array.value_length());
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



jorgecarleitao commented on a change in pull request #8639:
URL: https://github.com/apache/arrow/pull/8639#discussion_r521494620



##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -21,2300 +21,21 @@
 //! Logical query plans can then be optimized and executed directly, or 
translated into
 //! physical query plans and executed.
 
-use std::fmt::{self, Debug, Display};
-use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc};
-
-use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction};
-use arrow::{
-compute::can_cast_types,
-datatypes::{DataType, Field, Schema, SchemaRef},
-};
-
-use crate::datasource::parquet::ParquetTable;
-use crate::datasource::TableProvider;
-use crate::error::{DataFusionError, Result};
-use crate::{
-datasource::csv::{CsvFile, CsvReadOptions},
-physical_plan::udaf::AggregateUDF,
-scalar::ScalarValue,
-};
-use crate::{
-physical_plan::{
-aggregates, expressions::binary_operator_data_type, functions, 
udf::ScalarUDF,
-},
-sql::parser::FileType,
-};
-use arrow::record_batch::RecordBatch;
-use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature};
-
+mod builder;
+mod display;
+mod expr;
+mod extension;
 mod operators;
+mod plan;
+mod registry;
+
+pub use builder::LogicalPlanBuilder;
+pub use display::display_schema;
+pub use expr::{
+and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf,
+exprlist_to_fields, length, lit, max, min, sum, Expr, Literal,
+};
+pub use extension::UserDefinedLogicalNode;
 pub use operators::Operator;
-
-fn create_function_name(
-fun: ,
-distinct: bool,
-args: &[Expr],
-input_schema: ,
-) -> Result {
-let names: Vec = args
-.iter()
-.map(|e| create_name(e, input_schema))
-.collect::>()?;
-let distinct_str = match distinct {
-true => "DISTINCT ",
-false => "",
-};
-Ok(format!("{}({}{})", fun, distinct_str, names.join(",")))
-}
-
-/// Returns a readable name of an expression based on the input schema.
-/// This function recursively transverses the expression for names such as 
"CAST(a > 2)".
-fn create_name(e: , input_schema: ) -> Result {
-match e {
-Expr::Alias(_, name) => Ok(name.clone()),
-Expr::Column(name) => Ok(name.clone()),
-Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")),
-Expr::Literal(value) => Ok(format!("{:?}", value)),
-Expr::BinaryExpr { left, op, right } => {
-let left = create_name(left, input_schema)?;
-let right = create_name(right, input_schema)?;
-Ok(format!("{} {:?} {}", left, op, right))
-}
-Expr::Cast { expr, data_type } => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("CAST({} AS {:?})", expr, data_type))
-}
-Expr::Not(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("NOT {}", expr))
-}
-Expr::IsNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NULL", expr))
-}
-Expr::IsNotNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NOT NULL", expr))
-}
-Expr::ScalarFunction { fun, args, .. } => {
-create_function_name(_string(), false, args, input_schema)
-}
-Expr::ScalarUDF { fun, args, .. } => {
-create_function_name(, false, args, input_schema)
-}
-Expr::AggregateFunction {
-fun,
-distinct,
-args,
-..
-} => create_function_name(_string(), *distinct, args, 
input_schema),
-Expr::AggregateUDF { fun, args } => {
-let mut names = Vec::with_capacity(args.len());
-for e in args {
-names.push(create_name(e, input_schema)?);
-}
-Ok(format!("{}({})", fun.name, names.join(",")))
-}
-other => Err(DataFusionError::NotImplemented(format!(
-"Physical plan does not support logical expression {:?}",
-other
-))),
-}
-}
-
-/// Create field meta-data from an expression, for use in a result set schema
-pub fn exprlist_to_fields(expr: &[Expr], input_schema: ) -> 
Result> {
-expr.iter().map(|e| e.to_field(input_schema)).collect()
-}
-
-/// `Expr` is a logical expression. A logical expression is something like `1 
+ 1`, or `CAST(c1 AS int)`.
-/// Logical expressions know how to compute its [arrow::datatypes::DataType] 
and nullability.
-/// `Expr` is a central struct of DataFusion's query API.
-///
-/// # Examples
-///
-/// ```
-/// # use datafusion::logical_plan::Expr;
-/// # use datafusion::error::Result;
-/// # fn main() -> Result<()> {
-/// let expr = Expr::Column("c1".to_string()) + Expr::Column("c2".to_string());
-/// println!("{:?}", expr);

[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian

2020-11-11 Thread GitBox



kiszk commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r521518379



##
File path: cpp/src/arrow/type.h
##
@@ -1604,13 +1605,26 @@ class ARROW_EXPORT FieldRef {
 // --
 // Schema
 
+enum class Endianness {
+  LITTLE = 0,
+  BIG = 1,
+#if ARROW_LITTLE_ENDIAN
+  NATIVE = LITTLE
+#else
+  NATIVE = BIG
+#endif
+};

Review comment:
   @wesm Or, would it be possible to create a small PR to add an endian 
field to `DataType`? Then, I will continue to work for this PR after merging 
your PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm edited a comment on pull request #8585: ARROW-10475: [C++][FlightRPC] handle IPv6 hosts

2020-11-11 Thread GitBox



lidavidm edited a comment on pull request #8585:
URL: https://github.com/apache/arrow/pull/8585#issuecomment-725554055


   I've added a method on the Uri class now to format the host, though the 
implementation is essentially the naive one still (if ipv6 add brackets else 
pass through the host).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #8585: ARROW-10475: [C++][FlightRPC] handle IPv6 hosts

2020-11-11 Thread GitBox



lidavidm commented on pull request #8585:
URL: https://github.com/apache/arrow/pull/8585#issuecomment-725554055


   I've added a method on the Uri class now to format the host, though the 
implementation is essentially the naive one still. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou opened a new pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts

2020-11-11 Thread GitBox



pitrou opened a new pull request #8642:
URL: https://github.com/apache/arrow/pull/8642


   Implement all flavours of binary-to-binary casting, except for fixed-size 
binary.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts

2020-11-11 Thread GitBox



github-actions[bot] commented on pull request #8642:
URL: https://github.com/apache/arrow/pull/8642#issuecomment-725595691


   https://issues.apache.org/jira/browse/ARROW-6071



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #8617: ARROW-10525: [C++] Fix crash on unsupported IPC stream

2020-11-11 Thread GitBox



bkietz closed pull request #8617:
URL: https://github.com/apache/arrow/pull/8617


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function

2020-11-11 Thread GitBox



github-actions[bot] commented on pull request #8256:
URL: https://github.com/apache/arrow/pull/8256#issuecomment-725640851


   Revision: 2c3dd4279cd5fd9d403bc8aa1f4e0ec3323d081c
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-702](https://github.com/ursa-labs/crossbow/branches/all?query=actions-702)
   
   |Task|Status|
   ||--|
   |test-r-linux-as-cran|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-702-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-702-github-test-r-linux-as-cran)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



alamb commented on a change in pull request #8639:
URL: https://github.com/apache/arrow/pull/8639#discussion_r521488424



##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -21,2300 +21,21 @@
 //! Logical query plans can then be optimized and executed directly, or 
translated into
 //! physical query plans and executed.
 
-use std::fmt::{self, Debug, Display};
-use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc};
-
-use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction};
-use arrow::{
-compute::can_cast_types,
-datatypes::{DataType, Field, Schema, SchemaRef},
-};
-
-use crate::datasource::parquet::ParquetTable;
-use crate::datasource::TableProvider;
-use crate::error::{DataFusionError, Result};
-use crate::{
-datasource::csv::{CsvFile, CsvReadOptions},
-physical_plan::udaf::AggregateUDF,
-scalar::ScalarValue,
-};
-use crate::{
-physical_plan::{
-aggregates, expressions::binary_operator_data_type, functions, 
udf::ScalarUDF,
-},
-sql::parser::FileType,
-};
-use arrow::record_batch::RecordBatch;
-use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature};
-
+mod builder;
+mod display;
+mod expr;
+mod extension;
 mod operators;
+mod plan;
+mod registry;
+
+pub use builder::LogicalPlanBuilder;
+pub use display::display_schema;
+pub use expr::{
+and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf,
+exprlist_to_fields, length, lit, max, min, sum, Expr, Literal,
+};
+pub use extension::UserDefinedLogicalNode;
 pub use operators::Operator;
-
-fn create_function_name(
-fun: ,
-distinct: bool,
-args: &[Expr],
-input_schema: ,
-) -> Result {
-let names: Vec = args
-.iter()
-.map(|e| create_name(e, input_schema))
-.collect::>()?;
-let distinct_str = match distinct {
-true => "DISTINCT ",
-false => "",
-};
-Ok(format!("{}({}{})", fun, distinct_str, names.join(",")))
-}
-
-/// Returns a readable name of an expression based on the input schema.
-/// This function recursively transverses the expression for names such as 
"CAST(a > 2)".
-fn create_name(e: , input_schema: ) -> Result {
-match e {
-Expr::Alias(_, name) => Ok(name.clone()),
-Expr::Column(name) => Ok(name.clone()),
-Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")),
-Expr::Literal(value) => Ok(format!("{:?}", value)),
-Expr::BinaryExpr { left, op, right } => {
-let left = create_name(left, input_schema)?;
-let right = create_name(right, input_schema)?;
-Ok(format!("{} {:?} {}", left, op, right))
-}
-Expr::Cast { expr, data_type } => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("CAST({} AS {:?})", expr, data_type))
-}
-Expr::Not(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("NOT {}", expr))
-}
-Expr::IsNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NULL", expr))
-}
-Expr::IsNotNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NOT NULL", expr))
-}
-Expr::ScalarFunction { fun, args, .. } => {
-create_function_name(_string(), false, args, input_schema)
-}
-Expr::ScalarUDF { fun, args, .. } => {
-create_function_name(, false, args, input_schema)
-}
-Expr::AggregateFunction {
-fun,
-distinct,
-args,
-..
-} => create_function_name(_string(), *distinct, args, 
input_schema),
-Expr::AggregateUDF { fun, args } => {
-let mut names = Vec::with_capacity(args.len());
-for e in args {
-names.push(create_name(e, input_schema)?);
-}
-Ok(format!("{}({})", fun.name, names.join(",")))
-}
-other => Err(DataFusionError::NotImplemented(format!(
-"Physical plan does not support logical expression {:?}",
-other
-))),
-}
-}
-
-/// Create field meta-data from an expression, for use in a result set schema
-pub fn exprlist_to_fields(expr: &[Expr], input_schema: ) -> 
Result> {
-expr.iter().map(|e| e.to_field(input_schema)).collect()
-}
-
-/// `Expr` is a logical expression. A logical expression is something like `1 
+ 1`, or `CAST(c1 AS int)`.
-/// Logical expressions know how to compute its [arrow::datatypes::DataType] 
and nullability.
-/// `Expr` is a central struct of DataFusion's query API.
-///
-/// # Examples
-///
-/// ```
-/// # use datafusion::logical_plan::Expr;
-/// # use datafusion::error::Result;
-/// # fn main() -> Result<()> {
-/// let expr = Expr::Column("c1".to_string()) + Expr::Column("c2".to_string());
-/// println!("{:?}", expr);
-/// #

[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian

2020-11-11 Thread GitBox



kiszk commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r521499870



##
File path: cpp/src/arrow/ipc/reader.cc
##
@@ -664,14 +690,15 @@ Result> ReadRecordBatch(
   std::shared_ptr out_schema;
   // Empty means do not use
   std::vector inclusion_mask;
-  RETURN_NOT_OK(GetInclusionMaskAndOutSchema(schema, options.included_fields,
+  DictionaryKind kind;

Review comment:
   Yes, it is not used, This is a dummy variable to allocate an 
IpcReadContext().





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



alamb commented on a change in pull request #8639:
URL: https://github.com/apache/arrow/pull/8639#discussion_r521553659



##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -21,2300 +21,21 @@
 //! Logical query plans can then be optimized and executed directly, or 
translated into
 //! physical query plans and executed.
 
-use std::fmt::{self, Debug, Display};
-use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc};
-
-use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction};
-use arrow::{
-compute::can_cast_types,
-datatypes::{DataType, Field, Schema, SchemaRef},
-};
-
-use crate::datasource::parquet::ParquetTable;
-use crate::datasource::TableProvider;
-use crate::error::{DataFusionError, Result};
-use crate::{
-datasource::csv::{CsvFile, CsvReadOptions},
-physical_plan::udaf::AggregateUDF,
-scalar::ScalarValue,
-};
-use crate::{
-physical_plan::{
-aggregates, expressions::binary_operator_data_type, functions, 
udf::ScalarUDF,
-},
-sql::parser::FileType,
-};
-use arrow::record_batch::RecordBatch;
-use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature};
-
+mod builder;
+mod display;
+mod expr;
+mod extension;
 mod operators;
+mod plan;
+mod registry;
+
+pub use builder::LogicalPlanBuilder;
+pub use display::display_schema;
+pub use expr::{
+and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf,
+exprlist_to_fields, length, lit, max, min, sum, Expr, Literal,
+};
+pub use extension::UserDefinedLogicalNode;
 pub use operators::Operator;
-
-fn create_function_name(
-fun: ,
-distinct: bool,
-args: &[Expr],
-input_schema: ,
-) -> Result {
-let names: Vec = args
-.iter()
-.map(|e| create_name(e, input_schema))
-.collect::>()?;
-let distinct_str = match distinct {
-true => "DISTINCT ",
-false => "",
-};
-Ok(format!("{}({}{})", fun, distinct_str, names.join(",")))
-}
-
-/// Returns a readable name of an expression based on the input schema.
-/// This function recursively transverses the expression for names such as 
"CAST(a > 2)".
-fn create_name(e: , input_schema: ) -> Result {
-match e {
-Expr::Alias(_, name) => Ok(name.clone()),
-Expr::Column(name) => Ok(name.clone()),
-Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")),
-Expr::Literal(value) => Ok(format!("{:?}", value)),
-Expr::BinaryExpr { left, op, right } => {
-let left = create_name(left, input_schema)?;
-let right = create_name(right, input_schema)?;
-Ok(format!("{} {:?} {}", left, op, right))
-}
-Expr::Cast { expr, data_type } => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("CAST({} AS {:?})", expr, data_type))
-}
-Expr::Not(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("NOT {}", expr))
-}
-Expr::IsNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NULL", expr))
-}
-Expr::IsNotNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NOT NULL", expr))
-}
-Expr::ScalarFunction { fun, args, .. } => {
-create_function_name(_string(), false, args, input_schema)
-}
-Expr::ScalarUDF { fun, args, .. } => {
-create_function_name(, false, args, input_schema)
-}
-Expr::AggregateFunction {
-fun,
-distinct,
-args,
-..
-} => create_function_name(_string(), *distinct, args, 
input_schema),
-Expr::AggregateUDF { fun, args } => {
-let mut names = Vec::with_capacity(args.len());
-for e in args {
-names.push(create_name(e, input_schema)?);
-}
-Ok(format!("{}({})", fun.name, names.join(",")))
-}
-other => Err(DataFusionError::NotImplemented(format!(
-"Physical plan does not support logical expression {:?}",
-other
-))),
-}
-}
-
-/// Create field meta-data from an expression, for use in a result set schema
-pub fn exprlist_to_fields(expr: &[Expr], input_schema: ) -> 
Result> {
-expr.iter().map(|e| e.to_field(input_schema)).collect()
-}
-
-/// `Expr` is a logical expression. A logical expression is something like `1 
+ 1`, or `CAST(c1 AS int)`.
-/// Logical expressions know how to compute its [arrow::datatypes::DataType] 
and nullability.
-/// `Expr` is a central struct of DataFusion's query API.
-///
-/// # Examples
-///
-/// ```
-/// # use datafusion::logical_plan::Expr;
-/// # use datafusion::error::Result;
-/// # fn main() -> Result<()> {
-/// let expr = Expr::Column("c1".to_string()) + Expr::Column("c2".to_string());
-/// println!("{:?}", expr);
-/// #

[GitHub] [arrow] github-actions[bot] commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function

2020-11-11 Thread GitBox



github-actions[bot] commented on pull request #8256:
URL: https://github.com/apache/arrow/pull/8256#issuecomment-725616996


   Revision: 2c3dd4279cd5fd9d403bc8aa1f4e0ec3323d081c
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-701](https://github.com/ursa-labs/crossbow/branches/all?query=actions-701)
   
   |Task|Status|
   ||--|
   
|homebrew-r-autobrew|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-701-travis-homebrew-r-autobrew.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |test-r-linux-as-cran|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-701-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-701-github-test-r-linux-as-cran)|
   
|test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rhub-ubuntu-gcc-release)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rocker-r-base-latest)|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-centos8)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   
|test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-opensuse42)|
   
|test-ubuntu-18.04-r-sanitizer|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-ubuntu-18.04-r-sanitizer)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-ubuntu-18.04-r-sanitizer)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function

2020-11-11 Thread GitBox



bkietz commented on pull request #8256:
URL: https://github.com/apache/arrow/pull/8256#issuecomment-725571750


   @nealrichardson @romainfrancois I've rebased and
   - moved forward declarations to c++ headers, so it's no longer necessary to 
forward declare things in arrow_exports.h
   - #include all type_fwd.h headers in arrow_types.h so all forward 
declarations are visible to `r/src/*`
   - moved specializations of `r6_class_name` to arrow_types.h so they'll also 
be visible to `r/src/*`
   - deleted arrow_exports.h; arrowExports.cpp now includes arrow_types.h



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8641: ARROW-8853: [Rust] [Integration Testing] Enable Flight tests

2020-11-11 Thread GitBox



github-actions[bot] commented on pull request #8641:
URL: https://github.com/apache/arrow/pull/8641#issuecomment-725588664


   https://issues.apache.org/jira/browse/ARROW-8853



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts

2020-11-11 Thread GitBox



pitrou commented on a change in pull request #8642:
URL: https://github.com/apache/arrow/pull/8642#discussion_r521577628



##
File path: cpp/src/arrow/compute/kernels/scalar_cast_string.cc
##
@@ -92,12 +93,74 @@ struct Utf8Validator {
 };
 
 template 
-struct BinaryToStringSameWidthCastFunctor {
+constexpr bool has_smaller_width() {
+  return sizeof(typename I::offset_type) < sizeof(typename O::offset_type);
+}
+
+template 
+constexpr bool has_same_width() {
+  return sizeof(typename I::offset_type) == sizeof(typename O::offset_type);
+}
+
+// Cast same-width offsets (no-op)
+template 
+void CastBinaryToBinaryOffsets(enable_if_t(), 
KernelContext*> ctx,
+   const ArrayData& input, ArrayData* output) {}
+
+// Upcast offsets
+template 
+void CastBinaryToBinaryOffsets(enable_if_t(), 
KernelContext*> ctx,
+   const ArrayData& input, ArrayData* output) {

Review comment:
   Lack of `constexpr if` risks producing compile errors.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts

2020-11-11 Thread GitBox



bkietz commented on a change in pull request #8642:
URL: https://github.com/apache/arrow/pull/8642#discussion_r521577222



##
File path: cpp/src/arrow/compute/kernels/scalar_cast_string.cc
##
@@ -92,12 +93,74 @@ struct Utf8Validator {
 };
 
 template 
-struct BinaryToStringSameWidthCastFunctor {
+constexpr bool has_smaller_width() {
+  return sizeof(typename I::offset_type) < sizeof(typename O::offset_type);
+}
+
+template 
+constexpr bool has_same_width() {
+  return sizeof(typename I::offset_type) == sizeof(typename O::offset_type);
+}
+
+// Cast same-width offsets (no-op)
+template 
+void CastBinaryToBinaryOffsets(enable_if_t(), 
KernelContext*> ctx,
+   const ArrayData& input, ArrayData* output) {}
+
+// Upcast offsets
+template 
+void CastBinaryToBinaryOffsets(enable_if_t(), 
KernelContext*> ctx,
+   const ArrayData& input, ArrayData* output) {

Review comment:
   ```suggestion
   template 
   void CastBinaryToBinaryOffsets(enable_if_t() || 
has_smaller_width(), KernelContext*> ctx,
  const ArrayData& input, ArrayData* output) {
 if (has_smaller_width()) {
   //
 } else {
   //
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts

2020-11-11 Thread GitBox



pitrou commented on a change in pull request #8642:
URL: https://github.com/apache/arrow/pull/8642#discussion_r521584030



##
File path: cpp/src/arrow/compute/kernels/scalar_cast_string.cc
##
@@ -92,12 +93,74 @@ struct Utf8Validator {
 };
 
 template 
-struct BinaryToStringSameWidthCastFunctor {
+constexpr bool has_smaller_width() {
+  return sizeof(typename I::offset_type) < sizeof(typename O::offset_type);
+}
+
+template 
+constexpr bool has_same_width() {
+  return sizeof(typename I::offset_type) == sizeof(typename O::offset_type);
+}
+
+// Cast same-width offsets (no-op)
+template 
+void CastBinaryToBinaryOffsets(enable_if_t(), 
KernelContext*> ctx,
+   const ArrayData& input, ArrayData* output) {}
+
+// Upcast offsets
+template 
+void CastBinaryToBinaryOffsets(enable_if_t(), 
KernelContext*> ctx,
+   const ArrayData& input, ArrayData* output) {

Review comment:
   Horror.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



jorgecarleitao commented on a change in pull request #8639:
URL: https://github.com/apache/arrow/pull/8639#discussion_r521475124



##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -1264,31 +1050,6 @@ mod tests {
 Ok(())
 }
 
-#[test]
-fn test_visitor() {
-let schema = Schema::new(vec![]);
-assert_eq!("[]", format!("{}", display_schema()));
-}
-
-#[test]
-fn test_display_empty_schema() {
-let schema = Schema::new(vec![]);
-assert_eq!("[]", format!("{}", display_schema()));
-}
-
-#[test]
-fn test_display_schema() {
-let schema = Schema::new(vec![
-Field::new("id", DataType::Int32, false),
-Field::new("first_name", DataType::Utf8, true),
-]);
-
-assert_eq!(
-"[id:Int32, first_name:Utf8;N]",
-format!("{}", display_schema())
-);
-}

Review comment:
   Maybe one of the tests disapeared in this commit? (I count -3 +2)

##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -21,2300 +21,21 @@
 //! Logical query plans can then be optimized and executed directly, or 
translated into
 //! physical query plans and executed.
 
-use std::fmt::{self, Debug, Display};
-use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc};
-
-use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction};
-use arrow::{
-compute::can_cast_types,
-datatypes::{DataType, Field, Schema, SchemaRef},
-};
-
-use crate::datasource::parquet::ParquetTable;
-use crate::datasource::TableProvider;
-use crate::error::{DataFusionError, Result};
-use crate::{
-datasource::csv::{CsvFile, CsvReadOptions},
-physical_plan::udaf::AggregateUDF,
-scalar::ScalarValue,
-};
-use crate::{
-physical_plan::{
-aggregates, expressions::binary_operator_data_type, functions, 
udf::ScalarUDF,
-},
-sql::parser::FileType,
-};
-use arrow::record_batch::RecordBatch;
-use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature};
-
+mod builder;
+mod display;
+mod expr;
+mod extension;
 mod operators;
+mod plan;
+mod registry;
+
+pub use builder::LogicalPlanBuilder;
+pub use display::display_schema;
+pub use expr::{
+and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf,
+exprlist_to_fields, length, lit, max, min, sum, Expr, Literal,
+};
+pub use extension::UserDefinedLogicalNode;
 pub use operators::Operator;
-
-fn create_function_name(
-fun: ,
-distinct: bool,
-args: &[Expr],
-input_schema: ,
-) -> Result {
-let names: Vec = args
-.iter()
-.map(|e| create_name(e, input_schema))
-.collect::>()?;
-let distinct_str = match distinct {
-true => "DISTINCT ",
-false => "",
-};
-Ok(format!("{}({}{})", fun, distinct_str, names.join(",")))
-}
-
-/// Returns a readable name of an expression based on the input schema.
-/// This function recursively transverses the expression for names such as 
"CAST(a > 2)".
-fn create_name(e: , input_schema: ) -> Result {
-match e {
-Expr::Alias(_, name) => Ok(name.clone()),
-Expr::Column(name) => Ok(name.clone()),
-Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")),
-Expr::Literal(value) => Ok(format!("{:?}", value)),
-Expr::BinaryExpr { left, op, right } => {
-let left = create_name(left, input_schema)?;
-let right = create_name(right, input_schema)?;
-Ok(format!("{} {:?} {}", left, op, right))
-}
-Expr::Cast { expr, data_type } => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("CAST({} AS {:?})", expr, data_type))
-}
-Expr::Not(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("NOT {}", expr))
-}
-Expr::IsNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NULL", expr))
-}
-Expr::IsNotNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NOT NULL", expr))
-}
-Expr::ScalarFunction { fun, args, .. } => {
-create_function_name(_string(), false, args, input_schema)
-}
-Expr::ScalarUDF { fun, args, .. } => {
-create_function_name(, false, args, input_schema)
-}
-Expr::AggregateFunction {
-fun,
-distinct,
-args,
-..
-} => create_function_name(_string(), *distinct, args, 
input_schema),
-Expr::AggregateUDF { fun, args } => {
-let mut names = Vec::with_capacity(args.len());
-for e in args {
-names.push(create_name(e, input_schema)?);
-}
-Ok(format!("{}({})", fun.name, names.join(",")))
-}
-other =>

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521574396



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   @jorgecarleitao  the filter benchmarks may not simulate real-world use 
cases, but they are designed to test the code under specific conditions such as 
the worst case scenario with alternating 1s and 0s where no batch can be 
skipped and all selected values have to be copied individually; how can this 
scenario be achieved with a randomly generated filter array?
   
   the other scenarios which test mostly 0s (best performance because most 
filter batches can be skipped and only a small number of selected values have 
to be copied) and mostly 1s (which is not as fast, but still faster than worst 
case because filter batches can be checked quickly and most values are copied 
in slices) should be easier to achieve with random filter arrays but are they 
going to be repeatable?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function

2020-11-11 Thread GitBox



bkietz commented on pull request #8256:
URL: https://github.com/apache/arrow/pull/8256#issuecomment-725615670


   [ASAN UBSAN CI 
failure](https://github.com/apache/arrow/pull/8256/checks?check_run_id=1386783871#step:8:2353)
 is ARROW-10525



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function

2020-11-11 Thread GitBox



nealrichardson commented on pull request #8256:
URL: https://github.com/apache/arrow/pull/8256#issuecomment-725616136


   @github-actions crossbow submit -g r



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521588710



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   @jorgecarleitao once the approach to filter benchmarks has been 
finalized would it be possible to rearrange the commits so that the same 
benchmark code can be used to benchmark both the old and new implementations so 
that we can do a direct comparison?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



alamb commented on a change in pull request #8639:
URL: https://github.com/apache/arrow/pull/8639#discussion_r521467373



##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -21,2300 +21,21 @@
 //! Logical query plans can then be optimized and executed directly, or 
translated into
 //! physical query plans and executed.
 
-use std::fmt::{self, Debug, Display};
-use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc};
-
-use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction};
-use arrow::{
-compute::can_cast_types,
-datatypes::{DataType, Field, Schema, SchemaRef},
-};
-
-use crate::datasource::parquet::ParquetTable;
-use crate::datasource::TableProvider;
-use crate::error::{DataFusionError, Result};
-use crate::{
-datasource::csv::{CsvFile, CsvReadOptions},
-physical_plan::udaf::AggregateUDF,
-scalar::ScalarValue,
-};
-use crate::{
-physical_plan::{
-aggregates, expressions::binary_operator_data_type, functions, 
udf::ScalarUDF,
-},
-sql::parser::FileType,
-};
-use arrow::record_batch::RecordBatch;
-use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature};
-
+mod builder;

Review comment:
   This module now  `pub use` s all `struct`, `enum` and `traits` that were 
public in this module so no client code should need to change.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8619: ARROW-10531: [Rust][DataFusion]: Add schema and graphviz formatting for LogicalPlans and a PlanVisitor

2020-11-11 Thread GitBox



alamb commented on pull request #8619:
URL: https://github.com/apache/arrow/pull/8619#issuecomment-725513681


   https://github.com/apache/arrow/pull/8639 is the PR for breaking up the 
logical_plan module



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian

2020-11-11 Thread GitBox



kiszk commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r521501051



##
File path: cpp/src/arrow/ipc/reader.cc
##
@@ -699,42 +726,45 @@ Status ReadDictionary(const Buffer& metadata, 
DictionaryMemo* dictionary_memo,
 
   // Look up the dictionary value type, which must have been added to the
   // DictionaryMemo already prior to invoking this function
-  ARROW_ASSIGN_OR_RAISE(auto value_type, 
dictionary_memo->GetDictionaryType(id));
+  ARROW_ASSIGN_OR_RAISE(auto value_type, 
context.dictionary_memo->GetDictionaryType(id));
 
   // Load the dictionary data from the dictionary batch
   ArrayLoader loader(batch_meta, 
internal::GetMetadataVersion(message->version()),
- options, file);
-  const auto dict_data = std::make_shared();
+ context.options, file);
+  auto dict_data = std::make_shared();
   const Field dummy_field("", value_type);
   RETURN_NOT_OK(loader.Load(_field, dict_data.get()));
 
   if (compression != Compression::UNCOMPRESSED) {
 ArrayDataVector dict_fields{dict_data};
-RETURN_NOT_OK(DecompressBuffers(compression, options, _fields));
+RETURN_NOT_OK(DecompressBuffers(compression, context.options, 
_fields));
+  }
+
+  // swap endian in dict_data if necessary (swap_endian == true)
+  if (context.swap_endian) {
+SwapEndianArrayData(dict_data);
   }
 
   if (dictionary_batch->isDelta()) {
-if (kind != nullptr) {
-  *kind = DictionaryKind::Delta;
+if (context.kind != nullptr) {
+  *context.kind = DictionaryKind::Delta;
 }
-return dictionary_memo->AddDictionaryDelta(id, dict_data);
+return context.dictionary_memo->AddDictionaryDelta(id, dict_data);
   }
   ARROW_ASSIGN_OR_RAISE(bool inserted,
-dictionary_memo->AddOrReplaceDictionary(id, 
dict_data));
-  if (kind != nullptr) {
-*kind = inserted ? DictionaryKind::New : DictionaryKind::Replacement;
+context.dictionary_memo->AddOrReplaceDictionary(id, 
dict_data));
+  if (context.kind != nullptr) {
+*context.kind = inserted ? DictionaryKind::New : 
DictionaryKind::Replacement;
   }

Review comment:
   I see.
   
   @kou Is it ok to exclude `param` from `IpcReadContext`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



alamb commented on a change in pull request #8639:
URL: https://github.com/apache/arrow/pull/8639#discussion_r521541312



##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -21,2300 +21,21 @@
 //! Logical query plans can then be optimized and executed directly, or 
translated into
 //! physical query plans and executed.
 
-use std::fmt::{self, Debug, Display};
-use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc};
-
-use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction};
-use arrow::{
-compute::can_cast_types,
-datatypes::{DataType, Field, Schema, SchemaRef},
-};
-
-use crate::datasource::parquet::ParquetTable;
-use crate::datasource::TableProvider;
-use crate::error::{DataFusionError, Result};
-use crate::{
-datasource::csv::{CsvFile, CsvReadOptions},
-physical_plan::udaf::AggregateUDF,
-scalar::ScalarValue,
-};
-use crate::{
-physical_plan::{
-aggregates, expressions::binary_operator_data_type, functions, 
udf::ScalarUDF,
-},
-sql::parser::FileType,
-};
-use arrow::record_batch::RecordBatch;
-use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature};
-
+mod builder;
+mod display;
+mod expr;
+mod extension;
 mod operators;
+mod plan;
+mod registry;
+
+pub use builder::LogicalPlanBuilder;
+pub use display::display_schema;
+pub use expr::{
+and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf,
+exprlist_to_fields, length, lit, max, min, sum, Expr, Literal,
+};
+pub use extension::UserDefinedLogicalNode;
 pub use operators::Operator;
-
-fn create_function_name(
-fun: ,
-distinct: bool,
-args: &[Expr],
-input_schema: ,
-) -> Result {
-let names: Vec = args
-.iter()
-.map(|e| create_name(e, input_schema))
-.collect::>()?;
-let distinct_str = match distinct {
-true => "DISTINCT ",
-false => "",
-};
-Ok(format!("{}({}{})", fun, distinct_str, names.join(",")))
-}
-
-/// Returns a readable name of an expression based on the input schema.
-/// This function recursively transverses the expression for names such as 
"CAST(a > 2)".
-fn create_name(e: , input_schema: ) -> Result {
-match e {
-Expr::Alias(_, name) => Ok(name.clone()),
-Expr::Column(name) => Ok(name.clone()),
-Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")),
-Expr::Literal(value) => Ok(format!("{:?}", value)),
-Expr::BinaryExpr { left, op, right } => {
-let left = create_name(left, input_schema)?;
-let right = create_name(right, input_schema)?;
-Ok(format!("{} {:?} {}", left, op, right))
-}
-Expr::Cast { expr, data_type } => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("CAST({} AS {:?})", expr, data_type))
-}
-Expr::Not(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("NOT {}", expr))
-}
-Expr::IsNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NULL", expr))
-}
-Expr::IsNotNull(expr) => {
-let expr = create_name(expr, input_schema)?;
-Ok(format!("{} IS NOT NULL", expr))
-}
-Expr::ScalarFunction { fun, args, .. } => {
-create_function_name(_string(), false, args, input_schema)
-}
-Expr::ScalarUDF { fun, args, .. } => {
-create_function_name(, false, args, input_schema)
-}
-Expr::AggregateFunction {
-fun,
-distinct,
-args,
-..
-} => create_function_name(_string(), *distinct, args, 
input_schema),
-Expr::AggregateUDF { fun, args } => {
-let mut names = Vec::with_capacity(args.len());
-for e in args {
-names.push(create_name(e, input_schema)?);
-}
-Ok(format!("{}({})", fun.name, names.join(",")))
-}
-other => Err(DataFusionError::NotImplemented(format!(
-"Physical plan does not support logical expression {:?}",
-other
-))),
-}
-}
-
-/// Create field meta-data from an expression, for use in a result set schema
-pub fn exprlist_to_fields(expr: &[Expr], input_schema: ) -> 
Result> {
-expr.iter().map(|e| e.to_field(input_schema)).collect()
-}
-
-/// `Expr` is a logical expression. A logical expression is something like `1 
+ 1`, or `CAST(c1 AS int)`.
-/// Logical expressions know how to compute its [arrow::datatypes::DataType] 
and nullability.
-/// `Expr` is a central struct of DataFusion's query API.
-///
-/// # Examples
-///
-/// ```
-/// # use datafusion::logical_plan::Expr;
-/// # use datafusion::error::Result;
-/// # fn main() -> Result<()> {
-/// let expr = Expr::Column("c1".to_string()) + Expr::Column("c2".to_string());
-/// println!("{:?}", expr);
-/// #

[GitHub] [arrow] nevi-me commented on a change in pull request #8641: ARROW-8853: [Rust] [Integration Testing] Enable Flight tests

2020-11-11 Thread GitBox



nevi-me commented on a change in pull request #8641:
URL: https://github.com/apache/arrow/pull/8641#discussion_r521574538



##
File path: rust/integration-testing/src/bin/flight-test-integration-client.rs
##
@@ -0,0 +1,377 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_integration_testing::{
+read_json_file, ArrowFile, AUTH_PASSWORD, AUTH_USERNAME,
+};
+
+use arrow::datatypes::SchemaRef;
+use arrow::record_batch::RecordBatch;
+
+use arrow_flight::flight_service_client::FlightServiceClient;
+use arrow_flight::{
+flight_descriptor::DescriptorType, BasicAuth, FlightData, 
HandshakeRequest, Location,
+Ticket,
+};
+use arrow_flight::{utils::flight_data_to_arrow_batch, FlightDescriptor};
+
+use clap::{App, Arg};
+use futures::{channel::mpsc, sink::SinkExt, StreamExt};
+use prost::Message;
+use tonic::{metadata::MetadataValue, Request, Status};
+
+use std::sync::Arc;
+
+type Error = Box;
+type Result = std::result::Result;
+
+type Client = FlightServiceClient;
+
+#[tokio::main]
+async fn main() -> Result {
+let matches = App::new("rust flight-test-integration-client")
+.arg(Arg::with_name("host").long("host").takes_value(true))
+.arg(Arg::with_name("port").long("port").takes_value(true))
+.arg(Arg::with_name("path").long("path").takes_value(true))
+.arg(
+Arg::with_name("scenario")
+.long("scenario")
+.takes_value(true),
+)
+.get_matches();
+
+let host = matches.value_of("host").expect("Host is required");
+let port = matches.value_of("port").expect("Port is required");
+
+match matches.value_of("scenario") {
+Some("middleware") => middleware_scenario(host, port).await?,
+Some("auth:basic_proto") => auth_basic_proto_scenario(host, 
port).await?,
+Some(scenario_name) => unimplemented!("Scenario not found: {}", 
scenario_name),
+None => {
+let path = matches
+.value_of("path")
+.expect("Path is required if scenario is not specified");
+integration_test_scenario(host, port, path).await?;
+}
+}
+
+Ok(())
+}
+
+async fn middleware_scenario(host: , port: ) -> Result {
+let url = format!("http://{}:{};, host, port);
+let conn = tonic::transport::Endpoint::new(url)?.connect().await?;
+let mut client = FlightServiceClient::with_interceptor(conn, 
middleware_interceptor);
+
+let mut descriptor = FlightDescriptor::default();
+descriptor.set_type(DescriptorType::Cmd);
+descriptor.cmd = b"".to_vec();
+
+// This call is expected to fail.
+let resp = client
+.get_flight_info(Request::new(descriptor.clone()))
+.await;
+match resp {
+Ok(_) => return Err(Box::new(Status::internal("Expected call to 
fail"))),
+Err(e) => {
+let headers = e.metadata();
+let middleware_header = headers.get("x-middleware");
+let value = middleware_header.map(|v| 
v.to_str().unwrap()).unwrap_or("");
+
+if value != "expected value" {
+let msg = format!(
+"Expected to receive header 'x-middleware: expected 
value', \
+but instead got: '{}'",
+value
+);
+return Err(Box::new(Status::internal(msg)));
+}
+
+eprintln!("Headers received successfully on failing call.");
+}
+}
+
+// This call should succeed
+descriptor.cmd = b"success".to_vec();
+let resp = client.get_flight_info(Request::new(descriptor)).await?;
+
+let headers = resp.metadata();
+let middleware_header = headers.get("x-middleware");
+let value = middleware_header.map(|v| v.to_str().unwrap()).unwrap_or("");
+
+if value != "expected value" {
+let msg = format!(
+"Expected to receive header 'x-middleware: expected value', \
+but instead got: '{}'",
+value
+);
+return Err(Box::new(Status::internal(msg)));
+}
+
+eprintln!("Headers received successfully on passing call.");
+
+Ok(())
+}
+
+fn middleware_interceptor(mut req: Request<()>) ->

[GitHub] [arrow] vivkong commented on pull request #8011: ARROW-9803: [Go] Add initial support for s390x

2020-11-11 Thread GitBox



vivkong commented on pull request #8011:
URL: https://github.com/apache/arrow/pull/8011#issuecomment-725525017


   Hello @emkornfield, wondering if this can be considered for merging?  I've 
updated the PR to use a constant to check for endianess.  Thanks. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules

2020-11-11 Thread GitBox



github-actions[bot] commented on pull request #8639:
URL: https://github.com/apache/arrow/pull/8639#issuecomment-725524192


   https://issues.apache.org/jira/browse/ARROW-10559



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian

2020-11-11 Thread GitBox



kiszk commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r521499370



##
File path: cpp/src/arrow/ipc/reader.cc
##
@@ -107,6 +108,23 @@ Status InvalidMessageType(MessageType expected, 
MessageType actual) {
 // --
 // Record batch read path
 
+/// \brief Structure to keep common arguments to be passed
+struct IpcReadContext {
+  IpcReadContext(DictionaryMemo* memo, const IpcReadOptions& option, 
DictionaryKind* kind,
+ bool swap)
+  : dictionary_memo(memo), options(option), kind(kind), swap_endian(swap) 
{}
+
+  DictionaryMemo* dictionary_memo;
+
+  const IpcReadOptions& options;
+
+  DictionaryKind* kind;

Review comment:
   This style, which uses a mutable pointer, follows the original code. 
Here is an example. My guess is that this code needs to have a special state 
`uninitialized`.
   
   
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L716-L726





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8640: WIP: ARROW-4193: [Rust] Add support for decimal data type

2020-11-11 Thread GitBox



github-actions[bot] commented on pull request #8640:
URL: https://github.com/apache/arrow/pull/8640#issuecomment-725542282


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] carols10cents opened a new pull request #8641: ARROW-8853: [Rust] [Integration Testing] Enable Flight tests

2020-11-11 Thread GitBox



carols10cents opened a new pull request #8641:
URL: https://github.com/apache/arrow/pull/8641


   There are some parts of the tests that should probably be part of the 
`arrow-flight` crate instead, but that can be done in a future PR. Everything 
here is in the integration tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts

2020-11-11 Thread GitBox



bkietz commented on a change in pull request #8642:
URL: https://github.com/apache/arrow/pull/8642#discussion_r521583351



##
File path: cpp/src/arrow/compute/kernels/scalar_cast_string.cc
##
@@ -92,12 +93,74 @@ struct Utf8Validator {
 };
 
 template 
-struct BinaryToStringSameWidthCastFunctor {
+constexpr bool has_smaller_width() {
+  return sizeof(typename I::offset_type) < sizeof(typename O::offset_type);
+}
+
+template 
+constexpr bool has_same_width() {
+  return sizeof(typename I::offset_type) == sizeof(typename O::offset_type);
+}
+
+// Cast same-width offsets (no-op)
+template 
+void CastBinaryToBinaryOffsets(enable_if_t(), 
KernelContext*> ctx,
+   const ArrayData& input, ArrayData* output) {}
+
+// Upcast offsets
+template 
+void CastBinaryToBinaryOffsets(enable_if_t(), 
KernelContext*> ctx,
+   const ArrayData& input, ArrayData* output) {

Review comment:
   If constexpr can be emulated using 
`std::conditional(), NarrowerImpl, WiderImpl>::template 
Exec(ctx, input, output)`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



yordan-pavlov commented on pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#issuecomment-725613377


   @jorgecarleitao thank you for this PR; overall I think it's a great idea to 
reuse the code between the take and filter kernels if possible - and you have 
demonstrated how it can be possible; we just have to find a way to keep 
performance at a good level;
   
   I have been thinking whether the BitChunkIterator from here 
https://github.com/apache/arrow/blob/master/rust/arrow/src/util/bit_chunk_iterator.rs
 can be used to improve the filter kernel so I am happy to see you have already 
done it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function

2020-11-11 Thread GitBox



nealrichardson commented on pull request #8256:
URL: https://github.com/apache/arrow/pull/8256#issuecomment-725639997


   @github-actions crossbow submit test-r-linux-as-cran



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian

2020-11-11 Thread GitBox



kiszk commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r521553432



##
File path: cpp/src/arrow/ipc/reader.cc
##
@@ -699,42 +726,45 @@ Status ReadDictionary(const Buffer& metadata, 
DictionaryMemo* dictionary_memo,
 
   // Look up the dictionary value type, which must have been added to the
   // DictionaryMemo already prior to invoking this function
-  ARROW_ASSIGN_OR_RAISE(auto value_type, 
dictionary_memo->GetDictionaryType(id));
+  ARROW_ASSIGN_OR_RAISE(auto value_type, 
context.dictionary_memo->GetDictionaryType(id));
 
   // Load the dictionary data from the dictionary batch
   ArrayLoader loader(batch_meta, 
internal::GetMetadataVersion(message->version()),
- options, file);
-  const auto dict_data = std::make_shared();
+ context.options, file);
+  auto dict_data = std::make_shared();
   const Field dummy_field("", value_type);
   RETURN_NOT_OK(loader.Load(_field, dict_data.get()));
 
   if (compression != Compression::UNCOMPRESSED) {
 ArrayDataVector dict_fields{dict_data};
-RETURN_NOT_OK(DecompressBuffers(compression, options, _fields));
+RETURN_NOT_OK(DecompressBuffers(compression, context.options, 
_fields));
+  }
+
+  // swap endian in dict_data if necessary (swap_endian == true)
+  if (context.swap_endian) {
+SwapEndianArrayData(dict_data);
   }
 
   if (dictionary_batch->isDelta()) {
-if (kind != nullptr) {
-  *kind = DictionaryKind::Delta;
+if (context.kind != nullptr) {
+  *context.kind = DictionaryKind::Delta;
 }
-return dictionary_memo->AddDictionaryDelta(id, dict_data);
+return context.dictionary_memo->AddDictionaryDelta(id, dict_data);
   }
   ARROW_ASSIGN_OR_RAISE(bool inserted,
-dictionary_memo->AddOrReplaceDictionary(id, 
dict_data));
-  if (kind != nullptr) {
-*kind = inserted ? DictionaryKind::New : DictionaryKind::Replacement;
+context.dictionary_memo->AddOrReplaceDictionary(id, 
dict_data));
+  if (context.kind != nullptr) {
+*context.kind = inserted ? DictionaryKind::New : 
DictionaryKind::Replacement;
   }

Review comment:
   nvm, after your suggestion, I added `param` into `IpcReadContext`. This 
is because `param` has recently added.
   
   I will keep the DictionaryKind as an out param





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox



yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521574396



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   the filter benchmarks may not simulate real-world use cases, but they 
are designed to test the code under specific conditions such as the worst case 
scenario with alternating 1s and 0s where no batch can be skipped and all 
selected values have to be copied individually; how can this scenario be 
achieved with a randomly generated filter array?
   
   the other scenarios which test mostly 0s (best performance because most 
filter batches can be skipped and only a small number of selected values have 
to be copied) and mostly 1s (which is not as fast, but still faster than worst 
case because filter batches can be checked quickly and most values are copied 
in slices) should be easier to achieve with random filter arrays but are they 
going to be repeatable?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] carols10cents commented on a change in pull request #8641: ARROW-8853: [Rust] [Integration Testing] Enable Flight tests

2020-11-11 Thread GitBox



carols10cents commented on a change in pull request #8641:
URL: https://github.com/apache/arrow/pull/8641#discussion_r521618644



##
File path: rust/integration-testing/src/bin/flight-test-integration-server.rs
##
@@ -0,0 +1,634 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::collections::HashMap;
+use std::convert::TryFrom;
+use std::net::SocketAddr;
+use std::pin::Pin;
+use std::sync::Arc;
+
+use clap::{App, Arg};
+use futures::{channel::mpsc, sink::SinkExt, Stream, StreamExt};
+use prost::Message;
+use tokio::net::TcpListener;
+use tokio::sync::Mutex;
+use tonic::transport::Server;
+use tonic::{metadata::MetadataMap, Request, Response, Status, Streaming};
+
+use arrow::{datatypes::Schema, record_batch::RecordBatch};
+use arrow_flight::{
+flight_descriptor::DescriptorType, flight_service_server::FlightService,
+flight_service_server::FlightServiceServer, 
utils::flight_data_to_arrow_batch,
+Action, ActionType, BasicAuth, Criteria, Empty, FlightData, 
FlightDescriptor,
+FlightEndpoint, FlightInfo, HandshakeRequest, HandshakeResponse, Location, 
PutResult,
+SchemaResult, Ticket,
+};
+
+use arrow_integration_testing::{AUTH_PASSWORD, AUTH_USERNAME};
+
+type TonicStream = Pin + Send + Sync + 'static>>;
+
+#[derive(Debug, Clone)]
+struct IntegrationDataset {
+schema: Schema,
+chunks: Vec,
+}
+
+#[derive(Clone, Default)]
+pub struct FlightServiceImpl {
+server_location: String,
+uploaded_chunks: Arc>>,
+}
+
+#[tonic::async_trait]
+impl FlightService for FlightServiceImpl {
+type HandshakeStream = TonicStream>;
+type ListFlightsStream = TonicStream>;
+type DoGetStream = TonicStream>;
+type DoPutStream = TonicStream>;
+type DoActionStream = TonicStream>;
+type ListActionsStream = TonicStream>;
+type DoExchangeStream = TonicStream>;
+
+async fn get_schema(
+,
+_request: Request,
+) -> Result, Status> {
+Err(Status::unimplemented("Not yet implemented"))

Review comment:
   Yes, there's nothing I can see that calls these methods in the 
integration tests.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function

2020-11-11 Thread GitBox



nealrichardson closed pull request #8256:
URL: https://github.com/apache/arrow/pull/8256


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] a-campbell opened a new issue #8646: Predicate pushdown question

2020-11-11 Thread GitBox



a-campbell opened a new issue #8646:
URL: https://github.com/apache/arrow/issues/8646


   Hi Arrow community,
   
   I'm new to the project and am trying to understand exactly what is happening 
under the hood when I run a filter-collect query on an Arrow Dataset (backed by 
Parquet).
   
   Let's say I created a Parquet dataset with no file-level partitions. I just 
wrote a bunch of separate files to a dataset. Now I want to run a query that 
returns the rows corresponding to a specific range of datetimes in the 
dataset's dt column.
   
   My understanding is that the Dataset API will push this query down to the 
file level, checking the footer of each file for the min/max value of dt and 
determining whether this block of rows should be read.
   
   Assuming this is correct, a few questions:
   
   Will every query result in the reading all of the file footers? Is there any 
caching of these min/max values?
   
   Is there a way to profile query performance? A way to view a query plan 
before it is executed?
   
   I appreciate your time in helping me better understand.
   
   Andrew
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #8461: ARROW-10197: [python][Gandiva] Execute expression on filtered data

2020-11-11 Thread GitBox



wesm commented on pull request #8461:
URL: https://github.com/apache/arrow/pull/8461#issuecomment-725667526


   @pitrou or @bkietz could one of you have a look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #8579: ARROW-10481: [R] Bindings to add, remove, replace Table columns

2020-11-11 Thread GitBox



nealrichardson commented on a change in pull request #8579:
URL: https://github.com/apache/arrow/pull/8579#discussion_r521631559



##
File path: r/R/table.R
##
@@ -254,6 +257,68 @@ names.Table <- function(x) x$ColumnNames()
 #' @export
 `[[.Table` <- `[[.RecordBatch`
 
+#' @export
+`[[<-.Table` <- function(x, i, value) {
+  if (!is.character(i) & !is.numeric(i)) {
+stop("'i' must be character or numeric, not ", class(i), call. = FALSE)
+  } else if (is.na(i)) {
+# Catch if a NA_character or NA_integer is passed. These are caught 
elsewhere
+# in cpp (i.e. _arrow_RecordBatch__column_name)
+# TODO: figure out if catching in cpp like ^^^ is preferred

Review comment:
   I suspect that error comes from 
https://github.com/apache/arrow/pull/8579/files#diff-242f7f20fb4999f3bbe719d6977d2580231b89334326b588d46e9236c2e28fcdR285,
 i.e. before we hit cpp.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts

2020-11-11 Thread GitBox



bkietz closed pull request #8642:
URL: https://github.com/apache/arrow/pull/8642


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #8616: ARROW-10522: [R] Allow rename Table and RecordBatch columns with names()

2020-11-11 Thread GitBox



nealrichardson closed pull request #8616:
URL: https://github.com/apache/arrow/pull/8616


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou closed pull request #8011: ARROW-9803: [Go] Add initial support for s390x

2020-11-11 Thread GitBox



kou closed pull request #8011:
URL: https://github.com/apache/arrow/pull/8011


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #8643: ARROW-10522: [R] Allow rename Table and RecordBatch columns with names()

2020-11-11 Thread GitBox



nealrichardson closed pull request #8643:
URL: https://github.com/apache/arrow/pull/8643


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao opened a new pull request #8645: ARROW-10561: [Rust] Small simplification of `write` and `write_bytes`

2020-11-11 Thread GitBox



jorgecarleitao opened a new pull request #8645:
URL: https://github.com/apache/arrow/pull/8645


   This PR addresses 3 small issues on `MutableBuffer`:
   
   1. `write_bytes` is incorrect, as it double-increments `len`: the length is 
incremented both on `self.write` and also by `write_bytes` itself. This leads 
to more allocations than necessary.
   2. `write` is implemented from the trait `io::Write`. However, this trait is 
suitable for fallible IO operations. In the case of a write to memory, it isn't 
really fallible because we can always call `reserve` to allocate more space.
   3. `write` and `write_bytes` are really similar.
   
   This PR replaces both `write_bytes` and `write` by `extend_from_slice` 
(inspired by 
[`Vec::extend_from_slice`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.extend_from_slice).
 This has the same performance (a single capacity comparison) or better. 
Compared to the current code, the main difference is that this is infallible as 
it `reserve` more space if the required size is insufficient.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Bei-z commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support

2020-11-11 Thread GitBox



Bei-z commented on a change in pull request #8542:
URL: https://github.com/apache/arrow/pull/8542#discussion_r521719735



##
File path: cpp/src/arrow/util/basic_decimal.cc
##
@@ -395,33 +395,49 @@ BasicDecimal128& BasicDecimal128::operator*=(const 
BasicDecimal128& right) {
   return *this;
 }
 
-/// Expands the given value into an array of ints so that we can work on
-/// it. The array will be converted to an absolute value and the wasNegative
+/// Expands the given little endian array of uint64_t into a big endian array 
of
+/// uint32_t. The value of input array is expected to be non-negative. The 
result_array
+/// will remove leading zeros from the input array.
+/// \param value_array a little endian array to represent the value
+/// \param result_array a big endian array of length N*2 to set with the value
+/// \result the output length of the array
+template 
+static int64_t FillInArray(const std::array& value_array,
+   uint32_t* result_array) {
+  int64_t next_index = 0;
+  for (int64_t i = N - 1; i >= 0; i--) {
+if (value_array[i] != 0) {
+  if (value_array[i] <= std::numeric_limits::max()) {
+result_array[next_index++] = static_cast(value_array[i]);
+i--;
+  }
+  for (int64_t j = i; j >= 0; j--) {
+result_array[next_index++] = static_cast(value_array[j] >> 
32);
+result_array[next_index++] = static_cast(value_array[j]);
+  }
+  break;
+}
+  }
+  return next_index;
+}
+
+/// Expands the given value into a big endian array of ints so that we can 
work on
+/// it. The array will be converted to an absolute value and the was_negative
 /// flag will be set appropriately. The array will remove leading zeros from
 /// the value.
-/// \param array an array of length 4 to set with the value
+/// \param array a big endian array of length 4 to set with the value
 /// \param was_negative a flag for whether the value was original negative
 /// \result the output length of the array
 static int64_t FillInArray(const BasicDecimal128& value, uint32_t* array,
bool& was_negative) {
-  uint64_t high;
-  uint64_t low;
-  const int64_t highbits = value.high_bits();
-  const uint64_t lowbits = value.low_bits();
-
-  if (highbits < 0) {
-low = ~lowbits + 1;
-high = static_cast(~highbits);
-if (low == 0) {
-  ++high;
-}
-was_negative = true;
-  } else {
-low = lowbits;
-high = static_cast(highbits);
-was_negative = false;
-  }
-
+  BasicDecimal128 abs_value = BasicDecimal128::Abs(value);
+  was_negative = value.high_bits() < 0;
+  uint64_t high = static_cast(abs_value.high_bits());
+  uint64_t low = abs_value.low_bits();
+
+  // FillInArray(std::array& value_array, uint32_t* result_array) 
is not
+  // called here as the following code has better performance, to avoid 
regression on
+  // BasicDecimal128 Division.
   if (high != 0) {
 if (high > std::numeric_limits::max()) {

Review comment:
   Ah sorry just got what you mean here! And yes it should be strict 
inequality for both places. Changed accordingly.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Bei-z commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support

2020-11-11 Thread GitBox



Bei-z commented on a change in pull request #8542:
URL: https://github.com/apache/arrow/pull/8542#discussion_r521719841



##
File path: cpp/src/arrow/util/basic_decimal.cc
##
@@ -490,49 +527,60 @@ static void FixDivisionSigns(BasicDecimal128* result, 
BasicDecimal128* remainder
   }
 }
 
-/// \brief Build a BasicDecimal128 from a list of ints.
+/// \brief Build a little endian array of uint64_t from a big endian array of 
uint32_t.
+template 
+static DecimalStatus BuildFromArray(std::array* result_array,
+uint32_t* array, int64_t length) {
+  for (int64_t i = length - 2 * N - 1; i >= 0; i--) {
+if (array[i] != 0) {
+  return DecimalStatus::kOverflow;
+}
+  }
+  int64_t next_index = length - 1;
+  for (size_t i = 0; i < N; i++) {
+uint64_t lower_bits = (next_index < 0) ? 0 : array[next_index--];
+(*result_array)[i] =
+(next_index < 0)

Review comment:
   In this case, would you recommend to keep it this way, or changed back 
to one loop?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 125 matches

Mail list logo