[GitHub] [arrow] liyafan82 commented on pull request #8210: ARROW-10031: [CI][Java] Support Java benchmark in Ursabot
liyafan82 commented on pull request #8210: URL: https://github.com/apache/arrow/pull/8210#issuecomment-725281667 > At my end, I can generate the following JSON file by `archery benchmark diff --language=java ...` > > @liyafan82 any comments regarding format and parameters are appreciated. > > ``` > benchmark baseline contender change % counters > 0 org.apache.arrow.vector.IntBenchmarks.setIntDirectly 22.500 us/op 11.209 us/op -50.182 {'mode': 'avgt', 'threads': 1, 'warmups': 5, 'warmupTime': '10 s', 'measurements': 4, 'measurementTime': '10 s', 'jvmArgs': ['-Darrow.enable_null_check_for_get=false -Darrow.enable_unsafe_memory_access=true']} > 2 org.apache.arrow.vector.IntBenchmarks.setWithValueHolder 19.031 us/op 6.627 us/op -65.179 {'mode': 'avgt', 'threads': 1, 'warmups': 5, 'warmupTime': '10 s', 'measurements': 4, 'measurementTime': '10 s', 'jvmArgs': ['-Darrow.enable_null_check_for_get=false -Darrow.enable_unsafe_memory_access=true']} > 1 org.apache.arrow.vector.IntBenchmarks.setWithWriter 32.626 us/op 10.246 us/op -68.594 {'mode': 'avgt', 'threads': 1, 'warmups': 5, 'warmupTime': '10 s', 'measurements': 4, 'measurementTime': '10 s', 'jvmArgs': ['-Darrow.enable_null_check_for_get=false -Darrow.enable_unsafe_memory_access=true']} > ``` @kiszk Thanks for your effort. Generally, it looks great! Some minor comments: 1. It is clearer to rename title 'counters' to 'configuration'? 2. I am curious how are the benchmarks sorted, by 'change %'? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8637: ARROW-10021: [C++][Compute] Return top-n modes in mode kernel
github-actions[bot] commented on pull request #8637: URL: https://github.com/apache/arrow/pull/8637#issuecomment-725320345 https://issues.apache.org/jira/browse/ARROW-10021 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
vertexclique commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725349668 Answered This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
jorisvandenbossche commented on a change in pull request #8621: URL: https://github.com/apache/arrow/pull/8621#discussion_r521311397 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -1231,6 +1251,305 @@ Result StrptimeResolve(KernelContext* ctx, const std::vector +struct UTF8TrimWhitespaceBase : StringTransform { + using Base = StringTransform; + using offset_type = typename Base::offset_type; + bool Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output, offset_type* output_written) { +const uint8_t* begin = input; +const uint8_t* end = input + input_string_ncodeunits; +const uint8_t* end_trimmed = end; +const uint8_t* begin_trimmed = begin; + +auto predicate = [](uint32_t c) { return !IsSpaceCharacterUnicode(c); }; +if (left && !ARROW_PREDICT_TRUE( +arrow::util::UTF8FindIf(begin, end, predicate, _trimmed))) { + return false; +} +if (right && (begin_trimmed < end)) { + if (!ARROW_PREDICT_TRUE(arrow::util::UTF8FindIfReverse(begin_trimmed, end, + predicate, _trimmed))) { +return false; + } +} +std::copy(begin_trimmed, end_trimmed, output); +*output_written = static_cast(end_trimmed - begin_trimmed); +return true; + } + void Execute(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +EnsureLookupTablesFilled(); +Base::Execute(ctx, batch, out); + } +}; + +template +struct UTF8TrimWhitespace +: UTF8TrimWhitespaceBase> {}; + +template +struct UTF8LTrimWhitespace +: UTF8TrimWhitespaceBase> {}; + +template +struct UTF8RTrimWhitespace +: UTF8TrimWhitespaceBase> {}; + +template +struct UTF8TrimBase : StringTransform { + using Base = StringTransform; + using offset_type = typename Base::offset_type; + using State = OptionsWrapper; + TrimOptions options; + std::vector codepoints; + + explicit UTF8TrimBase(TrimOptions options) : options(options) { +// TODO: check return / can we raise an exception here? +arrow::util::UTF8ForEach(options.characters, [&](uint32_t c) { + codepoints.resize(std::max(c + 1, static_cast(codepoints.size(; + codepoints.at(c) = true; +}); + } + + static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +TrimOptions options = State::Get(ctx); +Derived(options).Execute(ctx, batch, out); + } + + void Execute(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +EnsureLookupTablesFilled(); +Base::Execute(ctx, batch, out); + } + + bool Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output, offset_type* output_written) { +const uint8_t* begin = input; +const uint8_t* end = input + input_string_ncodeunits; +const uint8_t* end_trimmed = end; +const uint8_t* begin_trimmed = begin; + +auto predicate = [&](uint32_t c) { + bool contains = codepoints[c]; + return !contains; +}; +if (left && !ARROW_PREDICT_TRUE( +arrow::util::UTF8FindIf(begin, end, predicate, _trimmed))) { + return false; +} +if (right && (begin_trimmed < end)) { + if (!ARROW_PREDICT_TRUE(arrow::util::UTF8FindIfReverse(begin_trimmed, end, + predicate, _trimmed))) { +return false; + } +} +std::copy(begin_trimmed, end_trimmed, output); +*output_written = static_cast(end_trimmed - begin_trimmed); +return true; + } +}; +template +struct UTF8Trim : UTF8TrimBase> { + using Base = UTF8TrimBase>; + using Base::Base; +}; + +template +struct UTF8LTrim : UTF8TrimBase> { + using Base = UTF8TrimBase>; + using Base::Base; +}; + +template +struct UTF8RTrim : UTF8TrimBase> { + using Base = UTF8TrimBase>; + using Base::Base; +}; + +#endif + +template +struct AsciiTrimWhitespaceBase : StringTransform { + using offset_type = typename Type::offset_type; + bool Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output, offset_type* output_written) { +const uint8_t* begin = input; +const uint8_t* end = input + input_string_ncodeunits; +const uint8_t* end_trimmed = end; + +auto predicate = [](unsigned char c) { return !IsSpaceCharacterAscii(c); }; +const uint8_t* begin_trimmed = left ? std::find_if(begin, end, predicate) : begin; +if (right & (begin_trimmed < end)) { + std::reverse_iterator rbegin(end); + std::reverse_iterator rend(begin_trimmed); + end_trimmed = std::find_if(rbegin, rend, predicate).base(); +} +std::copy(begin_trimmed, end_trimmed, output); +*output_written = static_cast(end_trimmed - begin_trimmed); +return true; + } +}; + +template +struct AsciiTrimWhitespace +: AsciiTrimWhitespaceBase> {}; + +template +struct AsciiLTrimWhitespace +: AsciiTrimWhitespaceBase>
[GitHub] [arrow] alamb commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
alamb commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725397069 I am checking / testing this PR out locally. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on pull request #8210: ARROW-10031: [CI][Java] Support Java benchmark in Ursabot
kiszk commented on pull request #8210: URL: https://github.com/apache/arrow/pull/8210#issuecomment-725290440 For 1., I will rename the title for cpp and Java. For 2, you are right as sorted [here](https://github.com/apache/arrow/blob/master/dev/archery/archery/cli.py#L595). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 opened a new pull request #8637: ARROW-10021: [C++][Compute] Return top-n modes in mode kernel
cyb70289 opened a new pull request #8637: URL: https://github.com/apache/arrow/pull/8637 This patch generalize mode kernel to return top-n modes. No performance difference for normal benchmarks. About 20% performance drop for 100% null benchmarks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8626: ARROW-10545: [C++] Fix crash on invalid Parquet file (OSS-Fuzz)
pitrou commented on pull request #8626: URL: https://github.com/apache/arrow/pull/8626#issuecomment-725298582 As I said above, this will break CI until #8617 is merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
maartenbreddels commented on pull request #8621: URL: https://github.com/apache/arrow/pull/8621#issuecomment-725345905 The `std::vector` was a good idea, and indeed because of it's bit usage, the memory usage for Unicode isn't that heavy (most extreme: `0x10 bits = 140kb` in case of a contiguous array implementation). Benchmarks: ``` set: TrimManyAscii_median 28346892 ns 28345125 ns 25 558.956MB/s 35.2794M items/s TrimManyUtf8_median28302644 ns 28294883 ns 25 559.949MB/s 35.3421M items/s unordered_set: TrimManyAscii_median 32017530 ns 32014024 ns 22 494.898MB/s 31.2363M items/s TrimManyUtf8_median (not run) vector TrimManyAscii_median 14911543 ns 14910620 ns 47 1062.58MB/s 67.0663M items/s TrimManyUtf8_median16148001 ns 16146053 ns 44 981.273MB/s 61.9346M items/s bitset<256> TrimManyAscii_median 14304925 ns 14304010 ns 49 1107.64MB/s 69.9105M items/s ``` `vector` is good enough I think, the bitset is consistently faster (5%), but I'd rather have similar code for both solutions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #8636: ARROW-10552: [Rust] Removed un-used Result
alamb closed pull request #8636: URL: https://github.com/apache/arrow/pull/8636 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8637: ARROW-10021: [C++][Compute] Return top-n modes in mode kernel
pitrou commented on a change in pull request #8637: URL: https://github.com/apache/arrow/pull/8637#discussion_r521255222 ## File path: docs/source/cpp/compute.rst ## @@ -140,7 +140,7 @@ Aggregations +--+++---++ | min_max | Unary | Numeric| Scalar Struct (1)| :struct:`MinMaxOptions`| +--+++---++ -| mode | Unary | Numeric| Scalar Struct (2)|| +| mode | Unary | Numeric| Scalar Struct (2)| :struct:`ModeOptions` | Review comment: Replace "Scalar Struct" with "Struct". ## File path: cpp/src/arrow/compute/kernels/aggregate_test.cc ## @@ -735,71 +735,63 @@ class TestPrimitiveModeKernel : public ::testing::Test { public: using ArrowType = T; using Traits = TypeTraits; - using c_type = typename ArrowType::c_type; - using ModeType = typename Traits::ScalarType; - using CountType = typename TypeTraits::ScalarType; - - void AssertModeIs(const Datum& array, c_type expected_mode, int64_t expected_count) { -ASSERT_OK_AND_ASSIGN(Datum out, Mode(array)); -const StructScalar& value = out.scalar_as(); + using CType = typename ArrowType::c_type; + + void AssertModesAre(const Datum& array, const int n, + const std::vector& expected_modes, + const std::vector& expected_counts) { +ASSERT_OK_AND_ASSIGN(Datum out, Mode(array, ModeOptions{n})); +const StructArray out_array(out.array()); +ASSERT_EQ(out_array.length(), expected_modes.size()); +ASSERT_EQ(out_array.num_fields(), 2); + +const CType* out_modes = out_array.field(0)->data()->GetValues(1); +const int64_t* out_counts = out_array.field(1)->data()->GetValues(1); +for (int i = 0; i < out_array.length(); ++i) { + // equal or nan equal + ASSERT_TRUE( + (expected_modes[i] == out_modes[i]) || + (expected_modes[i] != expected_modes[i] && out_modes[i] != out_modes[i])); + ASSERT_EQ(expected_counts[i], out_counts[i]); +} + } -const auto& out_mode = checked_cast(*value.value[0]); -ASSERT_EQ(expected_mode, out_mode.value); + void AssertModesAre(const std::string& json, const int n, + const std::vector& expected_modes, + const std::vector& expected_counts) { +auto array = ArrayFromJSON(type_singleton(), json); +AssertModesAre(array, n, expected_modes, expected_counts); + } -const auto& out_count = checked_cast(*value.value[1]); -ASSERT_EQ(expected_count, out_count.value); + void AssertModeIs(const Datum& array, CType expected_mode, int64_t expected_count) { +AssertModesAre(array, 1, {expected_mode}, {expected_count}); } - void AssertModeIs(const std::string& json, c_type expected_mode, + void AssertModeIs(const std::string& json, CType expected_mode, int64_t expected_count) { auto array = ArrayFromJSON(type_singleton(), json); AssertModeIs(array, expected_mode, expected_count); } - void AssertModeIs(const std::vector& json, c_type expected_mode, + void AssertModeIs(const std::vector& json, CType expected_mode, int64_t expected_count) { auto chunked = ChunkedArrayFromJSON(type_singleton(), json); AssertModeIs(chunked, expected_mode, expected_count); } - void AssertModeIsNull(const Datum& array) { -ASSERT_OK_AND_ASSIGN(Datum out, Mode(array)); -const StructScalar& value = out.scalar_as(); - -for (const auto& val : value.value) { - ASSERT_FALSE(val->is_valid); -} + void AssertModeIsNull(const Datum& array, int n) { Review comment: This should be renamed, since it's not null anymore. `AssertModesEmpty`? ## File path: cpp/src/arrow/compute/api_aggregate.h ## @@ -76,6 +76,18 @@ struct ARROW_EXPORT MinMaxOptions : public FunctionOptions { enum Mode null_handling; }; +/// \brief Control Mode kernel behavior +/// +/// Returns top-n common values and counts. +/// By default, returns the most common value and count. +struct ARROW_EXPORT ModeOptions : public FunctionOptions { + explicit ModeOptions(int n = 1) : n(n) {} Review comment: Can we use `int64_t` everywhere for this? ## File path: cpp/src/arrow/compute/kernels/aggregate_test.cc ## @@ -735,71 +735,63 @@ class TestPrimitiveModeKernel : public ::testing::Test { public: using ArrowType = T; using Traits = TypeTraits; - using c_type = typename ArrowType::c_type; - using ModeType = typename Traits::ScalarType; - using CountType = typename TypeTraits::ScalarType; - - void
[GitHub] [arrow] alamb commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
alamb commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725416573 So my personal suggestion is change all the benches to use `seed_from_u64` or equivalent. I don't think there is any need for a new additional dependency -- https://crates.io/crates/rand seems much more popular than https://crates.io/crates/bastion-utils. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support
pitrou commented on a change in pull request #8542: URL: https://github.com/apache/arrow/pull/8542#discussion_r521349565 ## File path: cpp/src/arrow/util/basic_decimal.cc ## @@ -395,33 +395,49 @@ BasicDecimal128& BasicDecimal128::operator*=(const BasicDecimal128& right) { return *this; } -/// Expands the given value into an array of ints so that we can work on -/// it. The array will be converted to an absolute value and the wasNegative +/// Expands the given little endian array of uint64_t into a big endian array of +/// uint32_t. The value of input array is expected to be non-negative. The result_array +/// will remove leading zeros from the input array. +/// \param value_array a little endian array to represent the value +/// \param result_array a big endian array of length N*2 to set with the value +/// \result the output length of the array +template +static int64_t FillInArray(const std::array& value_array, + uint32_t* result_array) { + int64_t next_index = 0; + for (int64_t i = N - 1; i >= 0; i--) { +if (value_array[i] != 0) { + if (value_array[i] <= std::numeric_limits::max()) { +result_array[next_index++] = static_cast(value_array[i]); +i--; + } + for (int64_t j = i; j >= 0; j--) { +result_array[next_index++] = static_cast(value_array[j] >> 32); +result_array[next_index++] = static_cast(value_array[j]); + } + break; +} + } + return next_index; +} + +/// Expands the given value into a big endian array of ints so that we can work on +/// it. The array will be converted to an absolute value and the was_negative /// flag will be set appropriately. The array will remove leading zeros from /// the value. -/// \param array an array of length 4 to set with the value +/// \param array a big endian array of length 4 to set with the value /// \param was_negative a flag for whether the value was original negative /// \result the output length of the array static int64_t FillInArray(const BasicDecimal128& value, uint32_t* array, bool& was_negative) { - uint64_t high; - uint64_t low; - const int64_t highbits = value.high_bits(); - const uint64_t lowbits = value.low_bits(); - - if (highbits < 0) { -low = ~lowbits + 1; -high = static_cast(~highbits); -if (low == 0) { - ++high; -} -was_negative = true; - } else { -low = lowbits; -high = static_cast(highbits); -was_negative = false; - } - + BasicDecimal128 abs_value = BasicDecimal128::Abs(value); + was_negative = value.high_bits() < 0; + uint64_t high = static_cast(abs_value.high_bits()); + uint64_t low = abs_value.low_bits(); + + // FillInArray(std::array& value_array, uint32_t* result_array) is not + // called here as the following code has better performance, to avoid regression on + // BasicDecimal128 Division. if (high != 0) { if (high > std::numeric_limits::max()) { Review comment: I'm afraid that doesn't answer my question about non-strict inequality, does it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support
pitrou commented on a change in pull request #8542: URL: https://github.com/apache/arrow/pull/8542#discussion_r521349186 ## File path: cpp/src/arrow/util/basic_decimal.cc ## @@ -490,49 +527,60 @@ static void FixDivisionSigns(BasicDecimal128* result, BasicDecimal128* remainder } } -/// \brief Build a BasicDecimal128 from a list of ints. +/// \brief Build a little endian array of uint64_t from a big endian array of uint32_t. +template +static DecimalStatus BuildFromArray(std::array* result_array, +uint32_t* array, int64_t length) { + for (int64_t i = length - 2 * N - 1; i >= 0; i--) { +if (array[i] != 0) { + return DecimalStatus::kOverflow; +} + } + int64_t next_index = length - 1; + for (size_t i = 0; i < N; i++) { +uint64_t lower_bits = (next_index < 0) ? 0 : array[next_index--]; +(*result_array)[i] = +(next_index < 0) Review comment: Why is there still a test for `next_index < 0` here? Also, why are some indices `int64_t` and others `size_t`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
vertexclique edited a comment on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469 > If the argument is that two runs of the same benchmark do not agree due to randomness, then I do not understand how having the same seed per thread or different seeds per thread changes this behavior: the seed will still be set when the process starts and will be different per run. I don't think that it should be different per run per benchmark, that's why it is variating. https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210 > Seeding the random number generators can be done using the existing rand crate, in the following way (this took me longer to figure out from the rand crate's documentation than I would like to admit): Yeah, I didn't want to write a const compiled seed and people forget to use the same seed every time. My implementation does some extra faster stuff compared to rand too. Here: https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28 But I am ok with that too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
vertexclique edited a comment on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469 > If the argument is that two runs of the same benchmark do not agree due to randomness, then I do not understand how having the same seed per thread or different seeds per thread changes this behavior: the seed will still be set when the process starts and will be different per run. I don't think that it should be different per run per benchmark, that's why it is variating. https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210 > Seeding the random number generators can be done using the existing rand crate, in the following way (this took me longer to figure out from the rand crate's documentation than I would like to admit): Yeah, I didn't want to write a const compiled seed and people forget to use the same seed every time. My implementation does some extra faster stuff compared to rand too. Here: https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28 But I am ok with that too if we won't forget it in benches :smile: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
vertexclique edited a comment on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469 > If the argument is that two runs of the same benchmark do not agree due to randomness, then I do not understand how having the same seed per thread or different seeds per thread changes this behavior: the seed will still be set when the process starts and will be different per run. I don't think that it should be different per run per benchmark, that's why it is variating. https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210 > Seeding the random number generators can be done using the existing rand crate, in the following way (this took me longer to figure out from the rand crate's documentation than I would like to admit): Yeah, I didn't want to write a const compiled seed and people forget to use the same seed every time. My implementation does some extra faster stuff compared to rand too. Here: https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28 But I am ok with that too if we won't forget it places :smile: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
vertexclique edited a comment on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469 > If the argument is that two runs of the same benchmark do not agree due to randomness, then I do not understand how having the same seed per thread or different seeds per thread changes this behavior: the seed will still be set when the process starts and will be different per run. I don't think that it should be different per run per benchmark, that's why it is variating. https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210 > Seeding the random number generators can be done using the existing rand crate, in the following way (this took me longer to figure out from the rand crate's documentation than I would like to admit): Yeah, I didn't want to write a const compiled seed and people forget to use the same seed every time. My implementation does some extra faster stuff compared to rand too. Here: https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #8619: ARROW-10531: [Rust][DataFusion]: Add schema and graphviz formatting for LogicalPlans and a PlanVisitor
alamb closed pull request #8619: URL: https://github.com/apache/arrow/pull/8619 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
maartenbreddels commented on a change in pull request #8621: URL: https://github.com/apache/arrow/pull/8621#discussion_r521422359 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -1231,6 +1251,305 @@ Result StrptimeResolve(KernelContext* ctx, const std::vector +struct UTF8TrimWhitespaceBase : StringTransform { + using Base = StringTransform; + using offset_type = typename Base::offset_type; + bool Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output, offset_type* output_written) { +const uint8_t* begin = input; +const uint8_t* end = input + input_string_ncodeunits; +const uint8_t* end_trimmed = end; +const uint8_t* begin_trimmed = begin; + +auto predicate = [](uint32_t c) { return !IsSpaceCharacterUnicode(c); }; +if (left && !ARROW_PREDICT_TRUE( +arrow::util::UTF8FindIf(begin, end, predicate, _trimmed))) { + return false; +} +if (right && (begin_trimmed < end)) { + if (!ARROW_PREDICT_TRUE(arrow::util::UTF8FindIfReverse(begin_trimmed, end, + predicate, _trimmed))) { +return false; + } +} +std::copy(begin_trimmed, end_trimmed, output); +*output_written = static_cast(end_trimmed - begin_trimmed); +return true; + } + void Execute(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +EnsureLookupTablesFilled(); +Base::Execute(ctx, batch, out); + } +}; + +template +struct UTF8TrimWhitespace +: UTF8TrimWhitespaceBase> {}; + +template +struct UTF8LTrimWhitespace +: UTF8TrimWhitespaceBase> {}; + +template +struct UTF8RTrimWhitespace +: UTF8TrimWhitespaceBase> {}; + +template +struct UTF8TrimBase : StringTransform { + using Base = StringTransform; + using offset_type = typename Base::offset_type; + using State = OptionsWrapper; + TrimOptions options; + std::vector codepoints; + + explicit UTF8TrimBase(TrimOptions options) : options(options) { +// TODO: check return / can we raise an exception here? +arrow::util::UTF8ForEach(options.characters, [&](uint32_t c) { + codepoints.resize(std::max(c + 1, static_cast(codepoints.size(; + codepoints.at(c) = true; +}); + } + + static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +TrimOptions options = State::Get(ctx); +Derived(options).Execute(ctx, batch, out); + } + + void Execute(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +EnsureLookupTablesFilled(); +Base::Execute(ctx, batch, out); + } + + bool Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output, offset_type* output_written) { +const uint8_t* begin = input; +const uint8_t* end = input + input_string_ncodeunits; +const uint8_t* end_trimmed = end; +const uint8_t* begin_trimmed = begin; + +auto predicate = [&](uint32_t c) { + bool contains = codepoints[c]; + return !contains; +}; +if (left && !ARROW_PREDICT_TRUE( +arrow::util::UTF8FindIf(begin, end, predicate, _trimmed))) { + return false; +} +if (right && (begin_trimmed < end)) { + if (!ARROW_PREDICT_TRUE(arrow::util::UTF8FindIfReverse(begin_trimmed, end, + predicate, _trimmed))) { +return false; + } +} +std::copy(begin_trimmed, end_trimmed, output); +*output_written = static_cast(end_trimmed - begin_trimmed); +return true; + } +}; +template +struct UTF8Trim : UTF8TrimBase> { + using Base = UTF8TrimBase>; + using Base::Base; +}; + +template +struct UTF8LTrim : UTF8TrimBase> { + using Base = UTF8TrimBase>; + using Base::Base; +}; + +template +struct UTF8RTrim : UTF8TrimBase> { + using Base = UTF8TrimBase>; + using Base::Base; +}; + +#endif + +template +struct AsciiTrimWhitespaceBase : StringTransform { + using offset_type = typename Type::offset_type; + bool Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output, offset_type* output_written) { +const uint8_t* begin = input; +const uint8_t* end = input + input_string_ncodeunits; +const uint8_t* end_trimmed = end; + +auto predicate = [](unsigned char c) { return !IsSpaceCharacterAscii(c); }; +const uint8_t* begin_trimmed = left ? std::find_if(begin, end, predicate) : begin; +if (right & (begin_trimmed < end)) { + std::reverse_iterator rbegin(end); + std::reverse_iterator rend(begin_trimmed); + end_trimmed = std::find_if(rbegin, rend, predicate).base(); +} +std::copy(begin_trimmed, end_trimmed, output); +*output_written = static_cast(end_trimmed - begin_trimmed); +return true; + } +}; + +template +struct AsciiTrimWhitespace +: AsciiTrimWhitespaceBase> {}; + +template +struct AsciiLTrimWhitespace +: AsciiTrimWhitespaceBase>
[GitHub] [arrow] pitrou commented on pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
pitrou commented on pull request #8621: URL: https://github.com/apache/arrow/pull/8621#issuecomment-725493506 > I guess we still need to manually add content to compute.rst? Yes, you do :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8630: ARROW-10540 [Rust] Improve filtering
jorgecarleitao edited a comment on pull request #8630: URL: https://github.com/apache/arrow/pull/8630#issuecomment-725500367 Thanks a lot, @alamb and @vertexclique . I agree with the naming issues here, and great insight into those crates. I do not have strong feelings about naming; I tried to be close to what I am familiar with from Rust (e.g. `Vec::extend`). A bit of background: @yordan-pavlov did an impressive job on #7798 to make filtering performant; It is _really_ hard to get that performance with a generalization. I had to re-write this code at least 3 times to get it to a stage with comparable performance and even then, you can see that it depends on the filter conditions. But there is a (IMO good) reason for generalizing the code in `filter`. Specifically, the use-case I am looking at is to remove most code from the `take` kernel, since it is doing what this struct is doing (and what `filter` was doing). There are implementation details that are different (`take`'s indices can be null and `filter` does not support `ArrayStruct` atm), but they are fundamentally doing the same thing: constructing an `ArrayData` by memcopying chunks of another `ArrayData`. Also, note that this goes beyond masking: it supports taking values repeatedly. I see it as a "builder" bound to a specific `ArrayData` (`ArrayDataBuilder` is already taken, though). It is not a builder like the builders in `builder.rs` because those memcopy data from rust native types, not Arrow types. The reason this performance remains comparable(-ish) to master is that when it binds to an `ArrayData`, it inspects its `DataType` to initialize functions (the `type Extend`) bound to that `ArrayData` that are performant. For example, ```rust let values = ()[0].data()[array.offset() * size_of::()..]; Box::new(move |mutable: _MutableArrayData, start: usize, len: usize| { ``` instead of ```rust Box::new(move |mutable: _MutableArrayData, array: , start: usize, len: usize| { let values = ()[0].data()[array.offset() * size_of::()..]; ``` has a meaningful performance impact because it avoids checks on both `buffers()[0]` and `[something..]` on every call of `extend`. This check may seem small, but it significantly impacts performance because `extend` can be as small as: `copy 2 bits` (in the case of a boolean array). When there are many calls of this operation (e.g. filter every other slot), these checks are as or more expensive than the `memcopy` themselves. There is a further boost possible that I explored but did not finish, but it requires a bit of `unsafe` code: we know that these functions never outlive `MutableDataArray`. Therefore, we could actually unsafely pass `` to their initialization and avoid all costs associated with accessing `mutable.buffers()` and the like inside `extend`. In this case, we would use ```rust type Extend<'a> = Box () + 'a>; ``` and bind the `_MutableArrayData` on `build_primitive_extend<'a, T>(array, data)`. This would be the closest to the current filter implementation that remains generic for other use-cases. My point is that there are non-trivial optimizations done in this proposal that were borrowed from the excellent work from @yordan-pavlov and that are required to keep up with very high bar set by #7798 This draft is trying to leverage that work on two of our kernels, `take` and `filter`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
alamb commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725414703 In my measurements, the `sum` compute kernels do appear to have significant variability from run to run on my machine (details below). The variability still exists with this PR though it appears to be less. I think one issue, as @vertexclique has expalined, is that that different random numbers are being used between runs (and between threasd) and thus the actual computations performed from run to run are changing. Seeding the random number generator so it always produces the same sequence of random numbers is definitely a classic way to reduce such variability and seems like a good idea to me. Seeding the random number generators *can* be done using the existing `rand` crate, in the following way (this took me longer to figure out from the `rand` crate's documentation than I would like to admit): Instead of ``` let mut rng = rand::thread_rng(); ``` Use ``` use rand::{ Rng, SeedableRng, rngs::StdRng }; let mut rng = StdRng::seed_from_u64(42); ``` Here are some measurements I ran on my machine: Master @ f7027b43d10bab4d8ca9397a753dc3553d88f146 ``` sum 512 time: [452.64 ns 456.16 ns 459.94 ns] sum 512 time: [459.08 ns 462.78 ns 466.55 ns] sum 512 time: [457.80 ns 461.87 ns 466.06 ns] sum nulls 512 time: [246.69 ns 248.65 ns 250.87 ns] sum nulls 512 time: [269.41 ns 271.79 ns 274.21 ns] sum nulls 512 time: [247.98 ns 250.42 ns 252.92 ns] ``` ARROW-10551-fix-unreproducible-benches ``` sum 512 time: [476.98 ns 482.19 ns 488.17 ns] sum 512 time: [470.75 ns 474.96 ns 479.42 ns] sum 512 time: [506.47 ns 508.00 ns 509.59 ns] sum nulls 512 time: [268.39 ns 270.32 ns 272.56 ns] sum nulls 512 time: [272.20 ns 274.46 ns 276.81 ns] sum nulls 512 time: [266.60 ns 269.12 ns 272.28 ns] ``` Master @ f7027b43d10bab4d8ca9397a753dc3553d88f146 w/ `StdRng::seed_from_u64`: ``` sum 512 time: [463.24 ns 467.51 ns 472.28 ns] sum 512 time: [457.42 ns 460.00 ns 462.66 ns] sum 512 time: [466.96 ns 471.60 ns 476.52 ns] sum nulls 512 time: [236.96 ns 238.34 ns 239.76 ns] sum nulls 512 time: [241.24 ns 243.24 ns 245.50 ns] sum nulls 512 time: [246.68 ns 248.60 ns 250.61 ns] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
jorgecarleitao commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725414955 > Current, code is reseeding after 32 MB of data, and every thread has different randomness. So no data is same as another data and totally different in different use cases. What I am trying to understand is why this is a concern: in the benchmarks, the data is generated only once, during the setup and not on every iteration of the benchmark. Thus, any randomness will be frozen for every iteration of the benchmark. If the argument is that two runs of the same benchmark do not agree due to randomness, then I do not understand how having the same seed per thread or different seeds per thread changes this behavior: the seed will still be set when the process starts and will be different per run. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
alamb commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725438655 > Can you use https://github.com/vertexclique/zor/blob/master/zor if you are on Linux? I am running on mac osx -- I looked at zor and it looked like a good tool to try and make reproducible results by clearing kernel caches, etc. I wonder if you can try using the seeded random approach with zor to see if that is as reproducible. > Yeah, I didn't want to write a const compiled seed and people forget to use the same seed every time. That is a good point -- maybe we can make a utility function or something for the tests to use. I don't think the actual value of the seed (e.g. `42`) is is important to keep consistent in all the .rs sources - just that it is a constant number between invocations of `cargo bench ...` across runs > My implementation does some extra faster stuff compared to rand too. Here: https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28 That looks cool, but I am not enough of a numerical algorithms expert to evaluate the random ness properties of that algorithm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
jorgecarleitao commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521407494 ## File path: rust/arrow/src/array/transform/primitive.rs ## @@ -0,0 +1,37 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::{io::Write, mem::size_of}; + +use crate::{array::ArrayData, datatypes::ArrowNativeType}; + +use super::{Extend, _MutableArrayData}; + +pub(super) fn build_extend(array: ) -> Extend { +let values = ()[0].data()[array.offset() * size_of::()..]; +Box::new( +move |mutable: _MutableArrayData, start: usize, len: usize| { +let start = start * size_of::(); Review comment: I try to use generics for things that are known at compile time and arguments for things that are only known at runtime. Maybe that is not the correct way of thinking? In this case in particular, the difference would be between `build_extend::(array)` and `build_extend::(array, size_of::())`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche opened a new pull request #8638: ARROW-10558: [Python] Fix python S3 filesystem tests interdependence
jorisvandenbossche opened a new pull request #8638: URL: https://github.com/apache/arrow/pull/8638 Follow-up on https://github.com/apache/arrow/pull/8573, where I introduced a test that was only passing because of state from other S3 tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
vertexclique commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725419093 Can you use https://github.com/vertexclique/zor/blob/master/zor if you are on Linux? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #8630: ARROW-10540 [Rust] Improve filtering
jorgecarleitao commented on pull request #8630: URL: https://github.com/apache/arrow/pull/8630#issuecomment-725500367 Thanks a lot, @alamb and @vertexclique . I agree with the naming issues here, and great insight into those crates. I do not have strong feelings about naming; I tried to be close to what I am familiar with from Rust (e.g. `Vec::extend`). A bit of background: @yordan-pavlov did an impressive job on #7798 to make filtering performant; It is _really_ hard to get that performance with a generalization. I had to re-write this code at least 3 times to get it to a stage with comparable performance and even then, you can see that it depends on the filter conditions. But there is a (IMO good) reason for generalizing the code in `filter`. Specifically, the use-case I am looking at is to remove most code from the `take` kernel, since it is doing what this struct is doing (and what `filter` was doing). There are implementation details that are different (`take`'s indices can be null and `filter` does not support `ArrayStruct` atm), but they are fundamentally doing the same thing: constructing an `ArrayData` by memcopying chunks of another `ArrayData`. Also, note that this goes beyond masking: it supports taking values repeatedly. I see it as a "builder" bound to a specific `ArrayData` (`ArrayDataBuilder` is already taken, though). It is not a builder like the builders in `builder.rs` because those memcopy data from rust native types, not Arrow types. The reason this performant remains comparable(-ish) to master is that when it binds to an `ArrayData`, it inspects its `DataType` to initialize functions (the `type Extend`) bound to that `ArrayData` that are performant. For example, ```rust let values = ()[0].data()[array.offset() * size_of::()..]; Box::new(move |mutable: _MutableArrayData, start: usize, len: usize| { ``` instead of ```rust Box::new(move |mutable: _MutableArrayData, array: , start: usize, len: usize| { let values = ()[0].data()[array.offset() * size_of::()..]; ``` has a meaningful performance impact because it avoids checks on both `buffers()[0]` and `[something..]` on every call of `extend`. This check may seem small, but it significantly impacts performance because `extend` can be as small as: `copy 2 bits` (in the case of a boolean array). When there are many calls of this operation (e.g. filter every other slot), these checks are as or more expensive than the `memcopy` themselves. There is a further boost possible that I explored but did not finish, but it requires a bit of `unsafe` code: we know that these functions never outlive `MutableDataArray`. Therefore, we could actually unsafely pass `` to their initialization and avoid all costs associated with accessing `mutable.buffers()` and the like inside `extend`. In this case, we would use ```rust type Extend<'a> = Box () + 'a>; ``` and bind the `_MutableArrayData` on `build_primitive_extend<'a, T>(array, data)`. This would be the closest to the current filter implementation that remains generic for other use-cases. My point is that there are really non-trivial optimizations done in this proposal that were borrowed from the excellent work from @yordan-pavlov and that are required to keep up with very high bar set by #7798 This draft is trying to leverage that work on two of our kernels, `take` and `filter`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #8636: ARROW-10552: [Rust] Removed un-used Result
vertexclique commented on a change in pull request #8636: URL: https://github.com/apache/arrow/pull/8636#discussion_r521339244 ## File path: rust/arrow/src/compute/kernels/filter.rs ## @@ -112,7 +112,7 @@ impl<'a> CopyNullBit for NullBitSetter<'a> { } fn null_buffer( self) -> Buffer { -self.target_buffer.resize(self.target_index).unwrap(); +self.target_buffer.resize(self.target_index); Review comment: These are most probably generating clippy warnings. ## File path: rust/arrow/src/buffer.rs ## @@ -676,9 +676,9 @@ impl MutableBuffer { /// `new_len` will be zeroed out. /// /// If `new_len` is less than `len`, the buffer will be truncated. -pub fn resize( self, new_len: usize) -> Result<()> { +pub fn resize( self, new_len: usize) -> () { Review comment: ```suggestion pub fn resize( self, new_len: usize) { ``` It is the same in other positions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
jorgecarleitao edited a comment on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725414955 > Current, code is reseeding after 32 MB of data, and every thread has different randomness. So no data is same as another data and totally different in different use cases. What I am trying to understand is why this is a concern: in the benchmarks, the data is generated only once, during the setup and not on every iteration of the benchmark. Thus, any randomness will be frozen for every iteration of the benchmark. If the argument is that two runs of the same benchmark do not agree due to randomness, then I do not understand how having the same seed per thread or different seeds per thread changes this behavior: the seed will still be set when the process starts and will be different per run. EDIT: @alamb beat me to seconds (and evidence) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
alamb commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521354110 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: As @vertexclique mentions, he has provided a good PR related to this conversation https://github.com/apache/arrow/pull/8635 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #8619: ARROW-10531: [Rust][DataFusion]: Add schema and graphviz formatting for LogicalPlans and a PlanVisitor
alamb commented on pull request #8619: URL: https://github.com/apache/arrow/pull/8619#issuecomment-725433951 Rebased and will merge when it passes CI This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
maartenbreddels commented on pull request #8621: URL: https://github.com/apache/arrow/pull/8621#issuecomment-725476725 I've opened an issue at https://issues.apache.org/jira/browse/ARROW-10556 I guess we still need to manually add content to compute.rst? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8636: ARROW-10552: [Rust] Removed un-used Result
alamb commented on a change in pull request #8636: URL: https://github.com/apache/arrow/pull/8636#discussion_r521368094 ## File path: rust/arrow/src/compute/kernels/filter.rs ## @@ -112,7 +112,7 @@ impl<'a> CopyNullBit for NullBitSetter<'a> { } fn null_buffer( self) -> Buffer { -self.target_buffer.resize(self.target_index).unwrap(); +self.target_buffer.resize(self.target_index); Review comment: You are likely correct -- we should eventually turn on a clippy for the CI checks in this crate This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jhorstmann commented on pull request #8598: ARROW-10500: [Rust] Refactor bit slice, bit view iterator for array buffers
jhorstmann commented on pull request #8598: URL: https://github.com/apache/arrow/pull/8598#issuecomment-725477741 > > I think we should address jhorstmann 's measurements of performance regressions before this pR is merged. > > I measured the performance. upside_down_face It is in the PR description. That's exactly where I took the benchmark results from. But yes, the regression in `buffer_bit_ops and` does not seem to have any big effect. I had one other comment about the separate testcases for big-endian architectures, or restricting tests to little-endian, that was not yet addressed: > I'm wondering whether this is really correct, the way I understood it is that little/big endian only affect the layout of bytes in memory, not how individual bits are accessed in a number. In this testcase the least significant bit of the first byte is zero and would be considered the first value if this was a boolean array or null bitmap. Same for the 4th least significant bit, which is where the slice here should start. This means the least significant bit of the chunk should be zero. Consider the following buffer of u8, used as bit-packed data, with the indices of bytes and bits written below ``` 0001 01 76543210 76543210 ``` To get the value of the 12th bit we would check bit (12%8) of byte (12/8). Viewing this as a larger type (u16 for simplification): ``` 0001 0 11 5432109876543210 ``` To check the same bit we would need to check bit (12%16) of word (12/16). So the value as u16 would be 4096 and this should be independent of the machine-endianness. Endianness only influences how the u16 would be stored in memory, but our underlying data consists of u8 in memory. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
vertexclique commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725422469 > If the argument is that two runs of the same benchmark do not agree due to randomness, then I do not understand how having the same seed per thread or different seeds per thread changes this behavior: the seed will still be set when the process starts and will be different per run. I don't think that it should be different per run per benchmark, that's why it is variating. https://github.com/bheisler/criterion.rs/blob/4499ee5e32f59c0c15eb046dd7a2dfd0c15a96f7/src/stats/univariate/sample.rs#L210 > Seeding the random number generators can be done using the existing rand crate, in the following way (this took me longer to figure out from the rand crate's documentation than I would like to admit): Yeah, I didn't want to write a const compiled and people forget to use the same seed every time. My implementation does some extra faster stuff compared to rand too. Here: https://docs.rs/bastion-utils/0.3.2/src/bastion_utils/math.rs.html#4-28 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on pull request #8630: ARROW-10540 [Rust] Improve filtering
vertexclique commented on pull request #8630: URL: https://github.com/apache/arrow/pull/8630#issuecomment-725436304 ``` Naming: I have seen similar concepts called "Masks" (as they are similar to bit masks) -- so perhaps ArrayDataMask or MaskedArrayData. Or perhaps ArrayRowSet When I actually read the code about MutableArrayData I realize that it isn't quite the mask concept, but it is similar (intermediate results want to represent "what indexes pass a certain test" and then eventually copying only those indexes to a new array" This type of structure might also useful for performing multiple boolean operations (eg. when doing (A > 5) AND A < 10 you can compute the row ids/indexes for that pass A > 5 and then rather than actually copying those rows to then compare them less than B you can operate directly on the original copy of A (check only the rows where the mask is true) ``` I found the whole block of this comment true. There are other approaches to do this but the main approach is that. I believe scratchpad implementation can also solve this problem from the different looking glass. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support
pitrou commented on a change in pull request #8542: URL: https://github.com/apache/arrow/pull/8542#discussion_r521362458 ## File path: cpp/src/arrow/util/basic_decimal.cc ## @@ -549,24 +603,27 @@ static DecimalStatus SingleDivide(const uint32_t* dividend, int64_t dividend_len return DecimalStatus::kSuccess; } -DecimalStatus BasicDecimal128::Divide(const BasicDecimal128& divisor, - BasicDecimal128* result, - BasicDecimal128* remainder) const { +/// \brief Do a division where the divisor fits into a single 32 bit value. +template +static inline DecimalStatus DecimalDivide(const DecimalClass& dividend, + const DecimalClass& divisor, + DecimalClass* result, DecimalClass* remainder) { + constexpr int64_t kDecimalArrayLength = sizeof(DecimalClass) / sizeof(uint32_t); Review comment: Use of `sizeof(DecimalClass)` here seems a bit fragile. Perhaps add/use `DecimalClass::bit_width`? ## File path: cpp/src/arrow/util/decimal_test.cc ## @@ -1296,6 +1296,56 @@ TEST(Decimal256Test, Multiply) { } } +TEST(Decimal256Test, Divide) { + ASSERT_EQ(Decimal256(33), Decimal256(100) / Decimal256(3)); + ASSERT_EQ(Decimal256(66), Decimal256(200) / Decimal256(3)); + ASSERT_EQ(Decimal256(66), Decimal256(20100) / Decimal256(301)); + ASSERT_EQ(Decimal256(-66), Decimal256(-20100) / Decimal256(301)); + ASSERT_EQ(Decimal256(-66), Decimal256(20100) / Decimal256(-301)); + ASSERT_EQ(Decimal256(66), Decimal256(-20100) / Decimal256(-301)); + ASSERT_EQ(Decimal256("-5192296858534827628530496329343552"), + Decimal256("-269599466671506397946670150910580797473777870509761363" + "24636208709184") / +Decimal256("5192296858534827628530496329874417")); + ASSERT_EQ(Decimal256("5192296858534827628530496329343552"), + Decimal256("-269599466671506397946670150910580797473777870509761363" + "24636208709184") / +Decimal256("-5192296858534827628530496329874417")); + ASSERT_EQ(Decimal256("5192296858534827628530496329343552"), + Decimal256("2695994666715063979466701509105807974737778705097613632" + "4636208709184") / +Decimal256("5192296858534827628530496329874417")); + ASSERT_EQ(Decimal256("-5192296858534827628530496329343552"), + Decimal256("2695994666715063979466701509105807974737778705097613632" + "4636208709184") / +Decimal256("-5192296858534827628530496329874417")); + + // Test some random numbers. + for (auto x : GetRandomNumbers(16)) { +for (auto y : GetRandomNumbers(16)) { + if (y == 0) { +continue; + } + + Decimal256 result = Decimal256(x) / Decimal256(y); + ASSERT_EQ(Decimal256(static_cast(x) / y), result) + << " x: " << x << " y: " << y; +} + } + + // Test some edge cases + for (auto x : std::vector{-INT64_MAX, -INT32_MAX, 0, INT32_MAX, INT64_MAX}) { +for (auto y : std::vector{-INT32_MAX, -32, -2, -1, 1, 2, 32, INT32_MAX}) { Review comment: I would expect those additional values to also be part of `y`. ## File path: cpp/src/arrow/util/basic_decimal.cc ## @@ -549,24 +603,27 @@ static DecimalStatus SingleDivide(const uint32_t* dividend, int64_t dividend_len return DecimalStatus::kSuccess; } -DecimalStatus BasicDecimal128::Divide(const BasicDecimal128& divisor, - BasicDecimal128* result, - BasicDecimal128* remainder) const { +/// \brief Do a division where the divisor fits into a single 32 bit value. Review comment: The comment here seems wrong (the divisor doesn't fit into a single 32bit value). ## File path: cpp/src/arrow/util/decimal_test.cc ## @@ -40,6 +40,9 @@ namespace arrow { using internal::int128_t; using internal::uint128_t; +constexpr int128_t kInt128Max = (static_cast(INT64_MAX) << 64) - 1 + +2 * (static_cast(INT64_MAX) + 1); Review comment: Why not `(static_cast(INT64_MAX) << 64) + static_cast(UINT64_MAX)` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
jorgecarleitao commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521392544 ## File path: rust/arrow/src/array/transform/mod.rs ## @@ -0,0 +1,532 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::{io::Write, mem::size_of, sync::Arc}; + +use crate::{buffer::MutableBuffer, datatypes::DataType, util::bit_util}; + +use super::{ArrayData, ArrayDataRef}; + +mod boolean; +mod list; +mod primitive; +mod utils; +mod variable_size; + +type ExtendNullBits<'a> = Box () + 'a>; +// function that extends `[start..start+len]` to the mutable array. +// this is dynamic because different data_types influence how buffers and childs are extended. +type Extend<'a> = Box () + 'a>; Review comment: That unfortunately would only work for primitive buffers. For string arrays, extending an array data requires a complex operation that is fundamentally different from extending a single buffer. For nested types, the operation is recursive on the child data. This is fundamentally a dynamic operation: we only know what to do when we see which `DataType` the user wants to build an `ArrayData` from. We can see that the `Builders` use a similar approach: they use `dyn Builder` for the same reason. The builders have an extra complexity associated with the fact that their input type is not uniform: i.e. their API supports extending from a `&[T]` (e.g. `i32` or `i16`), which is the reason why they need to be implemented via a dynamic type, whose each implementation has methods for each type. In the `MutableArrayData`, the only "thing" that we extend from is an `ArrayData`, which has a uniform (rust) type, but requires a different behavior based on its `data_type` => function pointer per data-type. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8612: ARROW-8199: [C++] Add support for multi-column sort indices on Table
pitrou commented on a change in pull request #8612: URL: https://github.com/apache/arrow/pull/8612#discussion_r521405090 ## File path: cpp/src/arrow/compute/api_vector.h ## @@ -58,6 +58,34 @@ struct ARROW_EXPORT TakeOptions : public FunctionOptions { static TakeOptions Defaults() { return BoundsCheck(); } }; +enum SortOrder { Review comment: Make this a `enum class`? ## File path: cpp/src/arrow/compute/api_vector.h ## @@ -58,6 +58,34 @@ struct ARROW_EXPORT TakeOptions : public FunctionOptions { static TakeOptions Defaults() { return BoundsCheck(); } }; +enum SortOrder { + ASCENDING, + DESCENDING, Review comment: Nit, but sometimes all-uppercase values conflict with some C macros (especially on Windows). `Ascending` perhaps? ## File path: cpp/src/arrow/compute/api_vector.h ## @@ -165,8 +194,50 @@ Result> NthToIndices(const Array& values, int64_t n, /// \param[in] ctx the function execution context, optional /// \return offsets indices that would sort an array ARROW_EXPORT -Result> SortToIndices(const Array& values, - ExecContext* ctx = NULLPTR); +Result> SortIndices(const Array& values, + ExecContext* ctx = NULLPTR); + +/// \brief Returns the indices that would sort an array in the +/// specified order. +/// +/// Perform an indirect sort of array. The output array will contain +/// indices that would sort an array, which would be the same length +/// as input. Nulls will be stably partitioned to the end of the output. Review comment: Add "regardless of order"? ## File path: cpp/src/arrow/compute/kernels/vector_sort.cc ## @@ -346,14 +374,396 @@ void AddSortingKernels(VectorKernel base, VectorFunction* func) { } } +class TableSorter : public TypeVisitor { + private: + struct ResolvedSortKey { +ResolvedSortKey(const ChunkedArray& chunked_array, const SortOrder order) +: order(order) { + type = chunked_array.type().get(); + null_count = chunked_array.null_count(); + num_chunks = chunked_array.num_chunks(); + for (const auto& chunk : chunked_array.chunks()) { +chunks.push_back(chunk.get()); + } +} + +template +ArrayType* ResolveChunk(int64_t index, int64_t& chunk_index) const { + if (num_chunks == 1) { +chunk_index = index; +return static_cast(chunks[0]); + } else { +int64_t offset = 0; +for (size_t i = 0; i < num_chunks; ++i) { + if (index < offset + chunks[i]->length()) { +chunk_index = index - offset; +return static_cast(chunks[i]); + } + offset += chunks[i]->length(); +} +return nullptr; + } +} + +SortOrder order; +DataType* type; +int64_t null_count; +size_t num_chunks; Review comment: `int64_t`? ## File path: cpp/src/arrow/compute/kernels/vector_sort.cc ## @@ -346,14 +374,396 @@ void AddSortingKernels(VectorKernel base, VectorFunction* func) { } } +class TableSorter : public TypeVisitor { + private: + struct ResolvedSortKey { +ResolvedSortKey(const ChunkedArray& chunked_array, const SortOrder order) +: order(order) { + type = chunked_array.type().get(); + null_count = chunked_array.null_count(); + num_chunks = chunked_array.num_chunks(); + for (const auto& chunk : chunked_array.chunks()) { +chunks.push_back(chunk.get()); + } +} + +template +ArrayType* ResolveChunk(int64_t index, int64_t& chunk_index) const { + if (num_chunks == 1) { +chunk_index = index; +return static_cast(chunks[0]); + } else { +int64_t offset = 0; +for (size_t i = 0; i < num_chunks; ++i) { + if (index < offset + chunks[i]->length()) { +chunk_index = index - offset; +return static_cast(chunks[i]); + } + offset += chunks[i]->length(); +} +return nullptr; + } +} + +SortOrder order; +DataType* type; +int64_t null_count; +size_t num_chunks; +std::vector chunks; + }; + + class Comparer : public TypeVisitor { + public: +Comparer(const Table& table, const std::vector& sort_keys) +: TypeVisitor(), status_(Status::OK()) { + for (const auto& sort_key : sort_keys) { +const auto& chunked_array = table.GetColumnByName(sort_key.name); +if (!chunked_array) { + status_ = Status::Invalid("Nonexistent sort key column: ", sort_key.name); + return; +} +sort_keys_.emplace_back(*chunked_array, sort_key.order); + } +} + +Status status() { return status_; } + +const std::vector& sort_keys() { return sort_keys_; } + +bool Compare(uint64_t left, uint64_t right, size_t start_sort_key_index) { + current_left_ = left; + current_right_ = right; +
[GitHub] [arrow] alamb opened a new pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
alamb opened a new pull request #8639: URL: https://github.com/apache/arrow/pull/8639 The module has gotten fairly large and so refactoring it into smaller chunks will improve readability – as suggested by Jorge https://github.com/apache/arrow/pull/8619#pullrequestreview-527391221 This PR just moves code around -- it is not intended to change any semantics Reviewing it commit-by-commit might be helpful to see how each piece went I can also break it up into a sequence of smaller PRs if that would help review This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support
pitrou commented on a change in pull request #8542: URL: https://github.com/apache/arrow/pull/8542#discussion_r521350486 ## File path: cpp/src/arrow/util/basic_decimal.cc ## @@ -490,49 +527,60 @@ static void FixDivisionSigns(BasicDecimal128* result, BasicDecimal128* remainder } } -/// \brief Build a BasicDecimal128 from a list of ints. +/// \brief Build a little endian array of uint64_t from a big endian array of uint32_t. +template +static DecimalStatus BuildFromArray(std::array* result_array, +uint32_t* array, int64_t length) { + for (int64_t i = length - 2 * N - 1; i >= 0; i--) { +if (array[i] != 0) { + return DecimalStatus::kOverflow; +} + } + int64_t next_index = length - 1; + for (size_t i = 0; i < N; i++) { +uint64_t lower_bits = (next_index < 0) ? 0 : array[next_index--]; +(*result_array)[i] = +(next_index < 0) Review comment: Hmm, I hadn't noticed the `next_index--` above... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
alamb commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521362589 ## File path: rust/arrow/src/array/transform/mod.rs ## @@ -0,0 +1,532 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::{io::Write, mem::size_of, sync::Arc}; + +use crate::{buffer::MutableBuffer, datatypes::DataType, util::bit_util}; + +use super::{ArrayData, ArrayDataRef}; + +mod boolean; +mod list; +mod primitive; +mod utils; +mod variable_size; + +type ExtendNullBits<'a> = Box () + 'a>; +// function that extends `[start..start+len]` to the mutable array. +// this is dynamic because different data_types influence how buffers and childs are extended. +type Extend<'a> = Box () + 'a>; Review comment: Rather than a dynamic function pointer to extend such a structure, I wonder if you could get by with 'element_length` and `number_of_elements` ## File path: rust/arrow/src/array/transform/primitive.rs ## @@ -0,0 +1,37 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::{io::Write, mem::size_of}; + +use crate::{array::ArrayData, datatypes::ArrowNativeType}; + +use super::{Extend, _MutableArrayData}; + +pub(super) fn build_extend(array: ) -> Extend { +let values = ()[0].data()[array.offset() * size_of::()..]; +Box::new( +move |mutable: _MutableArrayData, start: usize, len: usize| { +let start = start * size_of::(); Review comment: it seems to me that `size_of::` is the part here that is dependent on type -- maybe you could just use that as a number This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8630: ARROW-10540 [Rust] Improve filtering
jorgecarleitao edited a comment on pull request #8630: URL: https://github.com/apache/arrow/pull/8630#issuecomment-725500367 Thanks a lot, @alamb and @vertexclique . I agree with the naming issues here, and great insight into those crates. I do not have strong feelings about naming; I tried to be close to what I am familiar with from Rust (e.g. `Vec::extend`). A bit of background: @yordan-pavlov did an impressive job on #7798 to make filtering performant; It is _really_ hard to get that performance with a generalization. I had to re-write this code at least 3 times to get it to a stage with comparable performance and even then, you can see that it depends on the filter conditions. But there is a (IMO good) reason for generalizing the code in `filter`. Specifically, the use-case I am looking at is to remove most code from the `take` kernel, since it is doing what this struct is doing (and what `filter` was doing). There are implementation details that are different (`take`'s indices can be null and `filter` does not support `ArrayStruct` atm), but they are fundamentally doing the same thing: constructing an `ArrayData` by memcopying chunks of another `ArrayData`. Also, note that this goes beyond masking: it supports taking values repeatedly. I see it as a "builder" bound to a specific `ArrayData` (`ArrayDataBuilder` is already taken, though). It is not a builder like the builders in `builder.rs` because those memcopy data from rust native types, not Arrow types. The reason this performance remains comparable(-ish) to master is that when it binds to an `ArrayData`, it inspects its `DataType` to initialize functions (the `type Extend`) bound to that `ArrayData` that are performant. For example, ```rust let values = ()[0].data()[array.offset() * size_of::()..]; Box::new(move |mutable: _MutableArrayData, start: usize, len: usize| { ``` instead of ```rust Box::new(move |mutable: _MutableArrayData, array: , start: usize, len: usize| { let values = ()[0].data()[array.offset() * size_of::()..]; ``` has a meaningful performance impact because it avoids checks on both `buffers()[0]` and `[something..]` on every call of `extend`. This check may seem small, but it significantly impacts performance because `extend` can be as small as: `copy 2 bits` (in the case of a boolean array). When there are many calls of this operation (e.g. filter every other slot), these checks are as or more expensive than the `memcopy` themselves. There is a further boost possible that I explored but did not finish, but it requires a bit of `unsafe` code: we know that these functions never outlive `MutableDataArray`. Therefore, we could actually unsafely pass `` to their initialization and avoid all costs associated with accessing `mutable.buffers()` and the like inside `extend`. In this case, we would use ```rust type Extend<'a> = Box () + 'a>; ``` and bind the `_MutableArrayData` on `build_primitive_extend<'a, T>(array, data)`. This would be the closest to the current filter implementation that remains generic for other use-cases. My point is that there are really non-trivial optimizations done in this proposal that were borrowed from the excellent work from @yordan-pavlov and that are required to keep up with very high bar set by #7798 This draft is trying to leverage that work on two of our kernels, `take` and `filter`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #8619: ARROW-10531: [Rust][DataFusion]: Add schema and graphviz formatting for LogicalPlans and a PlanVisitor
alamb commented on pull request #8619: URL: https://github.com/apache/arrow/pull/8619#issuecomment-725433229 Thanks @andygrove and @jorgecarleitao -- I plan to merge this PR and then make a new one to break the code into smaller modules This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on pull request #8635: ARROW-10551: [Rust] Fix unreproducible benches
vertexclique commented on pull request #8635: URL: https://github.com/apache/arrow/pull/8635#issuecomment-725439351 > I wonder if you can try using the seeded random approach with zor to see if that is as reproducible. will do. > That looks cool, but I am not enough of a numerical algorithms expert to evaluate the random ness properties of that algorithm dw, I will check if that's like how you reported, I will change to constant seed with seededrng. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
jorgecarleitao commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521407494 ## File path: rust/arrow/src/array/transform/primitive.rs ## @@ -0,0 +1,37 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::{io::Write, mem::size_of}; + +use crate::{array::ArrayData, datatypes::ArrowNativeType}; + +use super::{Extend, _MutableArrayData}; + +pub(super) fn build_extend(array: ) -> Extend { +let values = ()[0].data()[array.offset() * size_of::()..]; +Box::new( +move |mutable: _MutableArrayData, start: usize, len: usize| { +let start = start * size_of::(); Review comment: I try to use generics for things that are known at compile time and arguments for things that are only known at runtime. Maybe that is not the correct way of thinking? In this case in particular, the difference would be between `build_extend::(array)` and `build_extend(array, size_of::())`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8638: ARROW-10558: [Python] Fix python S3 filesystem tests interdependence
github-actions[bot] commented on pull request #8638: URL: https://github.com/apache/arrow/pull/8638#issuecomment-725498272 https://issues.apache.org/jira/browse/ARROW-10558 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
alamb commented on a change in pull request #8639: URL: https://github.com/apache/arrow/pull/8639#discussion_r521484903 ## File path: rust/datafusion/src/logical_plan/mod.rs ## @@ -1264,31 +1050,6 @@ mod tests { Ok(()) } -#[test] -fn test_visitor() { -let schema = Schema::new(vec![]); -assert_eq!("[]", format!("{}", display_schema())); -} - -#[test] -fn test_display_empty_schema() { -let schema = Schema::new(vec![]); -assert_eq!("[]", format!("{}", display_schema())); -} - -#[test] -fn test_display_schema() { -let schema = Schema::new(vec![ -Field::new("id", DataType::Int32, false), -Field::new("first_name", DataType::Utf8, true), -]); - -assert_eq!( -"[id:Int32, first_name:Utf8;N]", -format!("{}", display_schema()) -); -} Review comment: Good ️ I removed ``` fn test_visitor() { let schema = Schema::new(vec![]); assert_eq!("[]", format!("{}", display_schema())); } ``` Which is redundant, except for the name, with the test immediately below: ``` #[test] fn test_display_empty_schema() { let schema = Schema::new(vec![]); assert_eq!("[]", format!("{}", display_schema())); } ``` Aka it was removing a copy / paste bug (that I added ) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sweb opened a new pull request #8640: WIP: ARROW-4193: [Rust] Add support for decimal data type
sweb opened a new pull request #8640: URL: https://github.com/apache/arrow/pull/8640 This is very much work in progress. The idea is to implement `DecimalArray` based on the existing `FixedSizeBinaryArray`. At the current state this is mostly C The values are returned as `i128`. The underlying fixed size is calculated from the precision. ``` let values: [u8; 20] = [0, 0, 0, 0, 0, 2, 17, 180, 219, 192, 255, 255, 255, 255, 255, 253, 238, 75, 36, 64]; let array_data = ArrayData::builder(DataType::Decimal(23, 6)) .len(2) .add_buffer(Buffer::from([..])) .build(); let fixed_size_binary_array = DecimalArray::from(array_data); assert_eq!( 8_887_000_000, fixed_size_binary_array.value(0) ); assert_eq!( -8_887_000_000, fixed_size_binary_array.value(1) ); assert_eq!(10, fixed_size_binary_array.value_length()); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
jorgecarleitao commented on a change in pull request #8639: URL: https://github.com/apache/arrow/pull/8639#discussion_r521494620 ## File path: rust/datafusion/src/logical_plan/mod.rs ## @@ -21,2300 +21,21 @@ //! Logical query plans can then be optimized and executed directly, or translated into //! physical query plans and executed. -use std::fmt::{self, Debug, Display}; -use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc}; - -use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction}; -use arrow::{ -compute::can_cast_types, -datatypes::{DataType, Field, Schema, SchemaRef}, -}; - -use crate::datasource::parquet::ParquetTable; -use crate::datasource::TableProvider; -use crate::error::{DataFusionError, Result}; -use crate::{ -datasource::csv::{CsvFile, CsvReadOptions}, -physical_plan::udaf::AggregateUDF, -scalar::ScalarValue, -}; -use crate::{ -physical_plan::{ -aggregates, expressions::binary_operator_data_type, functions, udf::ScalarUDF, -}, -sql::parser::FileType, -}; -use arrow::record_batch::RecordBatch; -use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature}; - +mod builder; +mod display; +mod expr; +mod extension; mod operators; +mod plan; +mod registry; + +pub use builder::LogicalPlanBuilder; +pub use display::display_schema; +pub use expr::{ +and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf, +exprlist_to_fields, length, lit, max, min, sum, Expr, Literal, +}; +pub use extension::UserDefinedLogicalNode; pub use operators::Operator; - -fn create_function_name( -fun: , -distinct: bool, -args: &[Expr], -input_schema: , -) -> Result { -let names: Vec = args -.iter() -.map(|e| create_name(e, input_schema)) -.collect::>()?; -let distinct_str = match distinct { -true => "DISTINCT ", -false => "", -}; -Ok(format!("{}({}{})", fun, distinct_str, names.join(","))) -} - -/// Returns a readable name of an expression based on the input schema. -/// This function recursively transverses the expression for names such as "CAST(a > 2)". -fn create_name(e: , input_schema: ) -> Result { -match e { -Expr::Alias(_, name) => Ok(name.clone()), -Expr::Column(name) => Ok(name.clone()), -Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")), -Expr::Literal(value) => Ok(format!("{:?}", value)), -Expr::BinaryExpr { left, op, right } => { -let left = create_name(left, input_schema)?; -let right = create_name(right, input_schema)?; -Ok(format!("{} {:?} {}", left, op, right)) -} -Expr::Cast { expr, data_type } => { -let expr = create_name(expr, input_schema)?; -Ok(format!("CAST({} AS {:?})", expr, data_type)) -} -Expr::Not(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("NOT {}", expr)) -} -Expr::IsNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NULL", expr)) -} -Expr::IsNotNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NOT NULL", expr)) -} -Expr::ScalarFunction { fun, args, .. } => { -create_function_name(_string(), false, args, input_schema) -} -Expr::ScalarUDF { fun, args, .. } => { -create_function_name(, false, args, input_schema) -} -Expr::AggregateFunction { -fun, -distinct, -args, -.. -} => create_function_name(_string(), *distinct, args, input_schema), -Expr::AggregateUDF { fun, args } => { -let mut names = Vec::with_capacity(args.len()); -for e in args { -names.push(create_name(e, input_schema)?); -} -Ok(format!("{}({})", fun.name, names.join(","))) -} -other => Err(DataFusionError::NotImplemented(format!( -"Physical plan does not support logical expression {:?}", -other -))), -} -} - -/// Create field meta-data from an expression, for use in a result set schema -pub fn exprlist_to_fields(expr: &[Expr], input_schema: ) -> Result> { -expr.iter().map(|e| e.to_field(input_schema)).collect() -} - -/// `Expr` is a logical expression. A logical expression is something like `1 + 1`, or `CAST(c1 AS int)`. -/// Logical expressions know how to compute its [arrow::datatypes::DataType] and nullability. -/// `Expr` is a central struct of DataFusion's query API. -/// -/// # Examples -/// -/// ``` -/// # use datafusion::logical_plan::Expr; -/// # use datafusion::error::Result; -/// # fn main() -> Result<()> { -/// let expr = Expr::Column("c1".to_string()) + Expr::Column("c2".to_string()); -/// println!("{:?}", expr);
[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian
kiszk commented on a change in pull request #7507: URL: https://github.com/apache/arrow/pull/7507#discussion_r521518379 ## File path: cpp/src/arrow/type.h ## @@ -1604,13 +1605,26 @@ class ARROW_EXPORT FieldRef { // -- // Schema +enum class Endianness { + LITTLE = 0, + BIG = 1, +#if ARROW_LITTLE_ENDIAN + NATIVE = LITTLE +#else + NATIVE = BIG +#endif +}; Review comment: @wesm Or, would it be possible to create a small PR to add an endian field to `DataType`? Then, I will continue to work for this PR after merging your PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm edited a comment on pull request #8585: ARROW-10475: [C++][FlightRPC] handle IPv6 hosts
lidavidm edited a comment on pull request #8585: URL: https://github.com/apache/arrow/pull/8585#issuecomment-725554055 I've added a method on the Uri class now to format the host, though the implementation is essentially the naive one still (if ipv6 add brackets else pass through the host). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #8585: ARROW-10475: [C++][FlightRPC] handle IPv6 hosts
lidavidm commented on pull request #8585: URL: https://github.com/apache/arrow/pull/8585#issuecomment-725554055 I've added a method on the Uri class now to format the host, though the implementation is essentially the naive one still. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts
pitrou opened a new pull request #8642: URL: https://github.com/apache/arrow/pull/8642 Implement all flavours of binary-to-binary casting, except for fixed-size binary. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts
github-actions[bot] commented on pull request #8642: URL: https://github.com/apache/arrow/pull/8642#issuecomment-725595691 https://issues.apache.org/jira/browse/ARROW-6071 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz closed pull request #8617: ARROW-10525: [C++] Fix crash on unsupported IPC stream
bkietz closed pull request #8617: URL: https://github.com/apache/arrow/pull/8617 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function
github-actions[bot] commented on pull request #8256: URL: https://github.com/apache/arrow/pull/8256#issuecomment-725640851 Revision: 2c3dd4279cd5fd9d403bc8aa1f4e0ec3323d081c Submitted crossbow builds: [ursa-labs/crossbow @ actions-702](https://github.com/ursa-labs/crossbow/branches/all?query=actions-702) |Task|Status| ||--| |test-r-linux-as-cran|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-702-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-702-github-test-r-linux-as-cran)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
alamb commented on a change in pull request #8639: URL: https://github.com/apache/arrow/pull/8639#discussion_r521488424 ## File path: rust/datafusion/src/logical_plan/mod.rs ## @@ -21,2300 +21,21 @@ //! Logical query plans can then be optimized and executed directly, or translated into //! physical query plans and executed. -use std::fmt::{self, Debug, Display}; -use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc}; - -use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction}; -use arrow::{ -compute::can_cast_types, -datatypes::{DataType, Field, Schema, SchemaRef}, -}; - -use crate::datasource::parquet::ParquetTable; -use crate::datasource::TableProvider; -use crate::error::{DataFusionError, Result}; -use crate::{ -datasource::csv::{CsvFile, CsvReadOptions}, -physical_plan::udaf::AggregateUDF, -scalar::ScalarValue, -}; -use crate::{ -physical_plan::{ -aggregates, expressions::binary_operator_data_type, functions, udf::ScalarUDF, -}, -sql::parser::FileType, -}; -use arrow::record_batch::RecordBatch; -use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature}; - +mod builder; +mod display; +mod expr; +mod extension; mod operators; +mod plan; +mod registry; + +pub use builder::LogicalPlanBuilder; +pub use display::display_schema; +pub use expr::{ +and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf, +exprlist_to_fields, length, lit, max, min, sum, Expr, Literal, +}; +pub use extension::UserDefinedLogicalNode; pub use operators::Operator; - -fn create_function_name( -fun: , -distinct: bool, -args: &[Expr], -input_schema: , -) -> Result { -let names: Vec = args -.iter() -.map(|e| create_name(e, input_schema)) -.collect::>()?; -let distinct_str = match distinct { -true => "DISTINCT ", -false => "", -}; -Ok(format!("{}({}{})", fun, distinct_str, names.join(","))) -} - -/// Returns a readable name of an expression based on the input schema. -/// This function recursively transverses the expression for names such as "CAST(a > 2)". -fn create_name(e: , input_schema: ) -> Result { -match e { -Expr::Alias(_, name) => Ok(name.clone()), -Expr::Column(name) => Ok(name.clone()), -Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")), -Expr::Literal(value) => Ok(format!("{:?}", value)), -Expr::BinaryExpr { left, op, right } => { -let left = create_name(left, input_schema)?; -let right = create_name(right, input_schema)?; -Ok(format!("{} {:?} {}", left, op, right)) -} -Expr::Cast { expr, data_type } => { -let expr = create_name(expr, input_schema)?; -Ok(format!("CAST({} AS {:?})", expr, data_type)) -} -Expr::Not(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("NOT {}", expr)) -} -Expr::IsNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NULL", expr)) -} -Expr::IsNotNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NOT NULL", expr)) -} -Expr::ScalarFunction { fun, args, .. } => { -create_function_name(_string(), false, args, input_schema) -} -Expr::ScalarUDF { fun, args, .. } => { -create_function_name(, false, args, input_schema) -} -Expr::AggregateFunction { -fun, -distinct, -args, -.. -} => create_function_name(_string(), *distinct, args, input_schema), -Expr::AggregateUDF { fun, args } => { -let mut names = Vec::with_capacity(args.len()); -for e in args { -names.push(create_name(e, input_schema)?); -} -Ok(format!("{}({})", fun.name, names.join(","))) -} -other => Err(DataFusionError::NotImplemented(format!( -"Physical plan does not support logical expression {:?}", -other -))), -} -} - -/// Create field meta-data from an expression, for use in a result set schema -pub fn exprlist_to_fields(expr: &[Expr], input_schema: ) -> Result> { -expr.iter().map(|e| e.to_field(input_schema)).collect() -} - -/// `Expr` is a logical expression. A logical expression is something like `1 + 1`, or `CAST(c1 AS int)`. -/// Logical expressions know how to compute its [arrow::datatypes::DataType] and nullability. -/// `Expr` is a central struct of DataFusion's query API. -/// -/// # Examples -/// -/// ``` -/// # use datafusion::logical_plan::Expr; -/// # use datafusion::error::Result; -/// # fn main() -> Result<()> { -/// let expr = Expr::Column("c1".to_string()) + Expr::Column("c2".to_string()); -/// println!("{:?}", expr); -/// #
[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian
kiszk commented on a change in pull request #7507: URL: https://github.com/apache/arrow/pull/7507#discussion_r521499870 ## File path: cpp/src/arrow/ipc/reader.cc ## @@ -664,14 +690,15 @@ Result> ReadRecordBatch( std::shared_ptr out_schema; // Empty means do not use std::vector inclusion_mask; - RETURN_NOT_OK(GetInclusionMaskAndOutSchema(schema, options.included_fields, + DictionaryKind kind; Review comment: Yes, it is not used, This is a dummy variable to allocate an IpcReadContext(). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
alamb commented on a change in pull request #8639: URL: https://github.com/apache/arrow/pull/8639#discussion_r521553659 ## File path: rust/datafusion/src/logical_plan/mod.rs ## @@ -21,2300 +21,21 @@ //! Logical query plans can then be optimized and executed directly, or translated into //! physical query plans and executed. -use std::fmt::{self, Debug, Display}; -use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc}; - -use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction}; -use arrow::{ -compute::can_cast_types, -datatypes::{DataType, Field, Schema, SchemaRef}, -}; - -use crate::datasource::parquet::ParquetTable; -use crate::datasource::TableProvider; -use crate::error::{DataFusionError, Result}; -use crate::{ -datasource::csv::{CsvFile, CsvReadOptions}, -physical_plan::udaf::AggregateUDF, -scalar::ScalarValue, -}; -use crate::{ -physical_plan::{ -aggregates, expressions::binary_operator_data_type, functions, udf::ScalarUDF, -}, -sql::parser::FileType, -}; -use arrow::record_batch::RecordBatch; -use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature}; - +mod builder; +mod display; +mod expr; +mod extension; mod operators; +mod plan; +mod registry; + +pub use builder::LogicalPlanBuilder; +pub use display::display_schema; +pub use expr::{ +and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf, +exprlist_to_fields, length, lit, max, min, sum, Expr, Literal, +}; +pub use extension::UserDefinedLogicalNode; pub use operators::Operator; - -fn create_function_name( -fun: , -distinct: bool, -args: &[Expr], -input_schema: , -) -> Result { -let names: Vec = args -.iter() -.map(|e| create_name(e, input_schema)) -.collect::>()?; -let distinct_str = match distinct { -true => "DISTINCT ", -false => "", -}; -Ok(format!("{}({}{})", fun, distinct_str, names.join(","))) -} - -/// Returns a readable name of an expression based on the input schema. -/// This function recursively transverses the expression for names such as "CAST(a > 2)". -fn create_name(e: , input_schema: ) -> Result { -match e { -Expr::Alias(_, name) => Ok(name.clone()), -Expr::Column(name) => Ok(name.clone()), -Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")), -Expr::Literal(value) => Ok(format!("{:?}", value)), -Expr::BinaryExpr { left, op, right } => { -let left = create_name(left, input_schema)?; -let right = create_name(right, input_schema)?; -Ok(format!("{} {:?} {}", left, op, right)) -} -Expr::Cast { expr, data_type } => { -let expr = create_name(expr, input_schema)?; -Ok(format!("CAST({} AS {:?})", expr, data_type)) -} -Expr::Not(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("NOT {}", expr)) -} -Expr::IsNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NULL", expr)) -} -Expr::IsNotNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NOT NULL", expr)) -} -Expr::ScalarFunction { fun, args, .. } => { -create_function_name(_string(), false, args, input_schema) -} -Expr::ScalarUDF { fun, args, .. } => { -create_function_name(, false, args, input_schema) -} -Expr::AggregateFunction { -fun, -distinct, -args, -.. -} => create_function_name(_string(), *distinct, args, input_schema), -Expr::AggregateUDF { fun, args } => { -let mut names = Vec::with_capacity(args.len()); -for e in args { -names.push(create_name(e, input_schema)?); -} -Ok(format!("{}({})", fun.name, names.join(","))) -} -other => Err(DataFusionError::NotImplemented(format!( -"Physical plan does not support logical expression {:?}", -other -))), -} -} - -/// Create field meta-data from an expression, for use in a result set schema -pub fn exprlist_to_fields(expr: &[Expr], input_schema: ) -> Result> { -expr.iter().map(|e| e.to_field(input_schema)).collect() -} - -/// `Expr` is a logical expression. A logical expression is something like `1 + 1`, or `CAST(c1 AS int)`. -/// Logical expressions know how to compute its [arrow::datatypes::DataType] and nullability. -/// `Expr` is a central struct of DataFusion's query API. -/// -/// # Examples -/// -/// ``` -/// # use datafusion::logical_plan::Expr; -/// # use datafusion::error::Result; -/// # fn main() -> Result<()> { -/// let expr = Expr::Column("c1".to_string()) + Expr::Column("c2".to_string()); -/// println!("{:?}", expr); -/// #
[GitHub] [arrow] github-actions[bot] commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function
github-actions[bot] commented on pull request #8256: URL: https://github.com/apache/arrow/pull/8256#issuecomment-725616996 Revision: 2c3dd4279cd5fd9d403bc8aa1f4e0ec3323d081c Submitted crossbow builds: [ursa-labs/crossbow @ actions-701](https://github.com/ursa-labs/crossbow/branches/all?query=actions-701) |Task|Status| ||--| |homebrew-r-autobrew|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-701-travis-homebrew-r-autobrew.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |test-r-linux-as-cran|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-701-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-701-github-test-r-linux-as-cran)| |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rhub-ubuntu-gcc-release)| |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rocker-r-base-latest)| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-centos8)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-opensuse15)| |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-r-rstudio-r-base-3.6-opensuse42)| |test-ubuntu-18.04-r-sanitizer|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-701-azure-test-ubuntu-18.04-r-sanitizer)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-701-azure-test-ubuntu-18.04-r-sanitizer)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function
bkietz commented on pull request #8256: URL: https://github.com/apache/arrow/pull/8256#issuecomment-725571750 @nealrichardson @romainfrancois I've rebased and - moved forward declarations to c++ headers, so it's no longer necessary to forward declare things in arrow_exports.h - #include all type_fwd.h headers in arrow_types.h so all forward declarations are visible to `r/src/*` - moved specializations of `r6_class_name` to arrow_types.h so they'll also be visible to `r/src/*` - deleted arrow_exports.h; arrowExports.cpp now includes arrow_types.h This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8641: ARROW-8853: [Rust] [Integration Testing] Enable Flight tests
github-actions[bot] commented on pull request #8641: URL: https://github.com/apache/arrow/pull/8641#issuecomment-725588664 https://issues.apache.org/jira/browse/ARROW-8853 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts
pitrou commented on a change in pull request #8642: URL: https://github.com/apache/arrow/pull/8642#discussion_r521577628 ## File path: cpp/src/arrow/compute/kernels/scalar_cast_string.cc ## @@ -92,12 +93,74 @@ struct Utf8Validator { }; template -struct BinaryToStringSameWidthCastFunctor { +constexpr bool has_smaller_width() { + return sizeof(typename I::offset_type) < sizeof(typename O::offset_type); +} + +template +constexpr bool has_same_width() { + return sizeof(typename I::offset_type) == sizeof(typename O::offset_type); +} + +// Cast same-width offsets (no-op) +template +void CastBinaryToBinaryOffsets(enable_if_t(), KernelContext*> ctx, + const ArrayData& input, ArrayData* output) {} + +// Upcast offsets +template +void CastBinaryToBinaryOffsets(enable_if_t(), KernelContext*> ctx, + const ArrayData& input, ArrayData* output) { Review comment: Lack of `constexpr if` risks producing compile errors. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts
bkietz commented on a change in pull request #8642: URL: https://github.com/apache/arrow/pull/8642#discussion_r521577222 ## File path: cpp/src/arrow/compute/kernels/scalar_cast_string.cc ## @@ -92,12 +93,74 @@ struct Utf8Validator { }; template -struct BinaryToStringSameWidthCastFunctor { +constexpr bool has_smaller_width() { + return sizeof(typename I::offset_type) < sizeof(typename O::offset_type); +} + +template +constexpr bool has_same_width() { + return sizeof(typename I::offset_type) == sizeof(typename O::offset_type); +} + +// Cast same-width offsets (no-op) +template +void CastBinaryToBinaryOffsets(enable_if_t(), KernelContext*> ctx, + const ArrayData& input, ArrayData* output) {} + +// Upcast offsets +template +void CastBinaryToBinaryOffsets(enable_if_t(), KernelContext*> ctx, + const ArrayData& input, ArrayData* output) { Review comment: ```suggestion template void CastBinaryToBinaryOffsets(enable_if_t() || has_smaller_width(), KernelContext*> ctx, const ArrayData& input, ArrayData* output) { if (has_smaller_width()) { // } else { // } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts
pitrou commented on a change in pull request #8642: URL: https://github.com/apache/arrow/pull/8642#discussion_r521584030 ## File path: cpp/src/arrow/compute/kernels/scalar_cast_string.cc ## @@ -92,12 +93,74 @@ struct Utf8Validator { }; template -struct BinaryToStringSameWidthCastFunctor { +constexpr bool has_smaller_width() { + return sizeof(typename I::offset_type) < sizeof(typename O::offset_type); +} + +template +constexpr bool has_same_width() { + return sizeof(typename I::offset_type) == sizeof(typename O::offset_type); +} + +// Cast same-width offsets (no-op) +template +void CastBinaryToBinaryOffsets(enable_if_t(), KernelContext*> ctx, + const ArrayData& input, ArrayData* output) {} + +// Upcast offsets +template +void CastBinaryToBinaryOffsets(enable_if_t(), KernelContext*> ctx, + const ArrayData& input, ArrayData* output) { Review comment: Horror. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
jorgecarleitao commented on a change in pull request #8639: URL: https://github.com/apache/arrow/pull/8639#discussion_r521475124 ## File path: rust/datafusion/src/logical_plan/mod.rs ## @@ -1264,31 +1050,6 @@ mod tests { Ok(()) } -#[test] -fn test_visitor() { -let schema = Schema::new(vec![]); -assert_eq!("[]", format!("{}", display_schema())); -} - -#[test] -fn test_display_empty_schema() { -let schema = Schema::new(vec![]); -assert_eq!("[]", format!("{}", display_schema())); -} - -#[test] -fn test_display_schema() { -let schema = Schema::new(vec![ -Field::new("id", DataType::Int32, false), -Field::new("first_name", DataType::Utf8, true), -]); - -assert_eq!( -"[id:Int32, first_name:Utf8;N]", -format!("{}", display_schema()) -); -} Review comment: Maybe one of the tests disapeared in this commit? (I count -3 +2) ## File path: rust/datafusion/src/logical_plan/mod.rs ## @@ -21,2300 +21,21 @@ //! Logical query plans can then be optimized and executed directly, or translated into //! physical query plans and executed. -use std::fmt::{self, Debug, Display}; -use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc}; - -use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction}; -use arrow::{ -compute::can_cast_types, -datatypes::{DataType, Field, Schema, SchemaRef}, -}; - -use crate::datasource::parquet::ParquetTable; -use crate::datasource::TableProvider; -use crate::error::{DataFusionError, Result}; -use crate::{ -datasource::csv::{CsvFile, CsvReadOptions}, -physical_plan::udaf::AggregateUDF, -scalar::ScalarValue, -}; -use crate::{ -physical_plan::{ -aggregates, expressions::binary_operator_data_type, functions, udf::ScalarUDF, -}, -sql::parser::FileType, -}; -use arrow::record_batch::RecordBatch; -use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature}; - +mod builder; +mod display; +mod expr; +mod extension; mod operators; +mod plan; +mod registry; + +pub use builder::LogicalPlanBuilder; +pub use display::display_schema; +pub use expr::{ +and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf, +exprlist_to_fields, length, lit, max, min, sum, Expr, Literal, +}; +pub use extension::UserDefinedLogicalNode; pub use operators::Operator; - -fn create_function_name( -fun: , -distinct: bool, -args: &[Expr], -input_schema: , -) -> Result { -let names: Vec = args -.iter() -.map(|e| create_name(e, input_schema)) -.collect::>()?; -let distinct_str = match distinct { -true => "DISTINCT ", -false => "", -}; -Ok(format!("{}({}{})", fun, distinct_str, names.join(","))) -} - -/// Returns a readable name of an expression based on the input schema. -/// This function recursively transverses the expression for names such as "CAST(a > 2)". -fn create_name(e: , input_schema: ) -> Result { -match e { -Expr::Alias(_, name) => Ok(name.clone()), -Expr::Column(name) => Ok(name.clone()), -Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")), -Expr::Literal(value) => Ok(format!("{:?}", value)), -Expr::BinaryExpr { left, op, right } => { -let left = create_name(left, input_schema)?; -let right = create_name(right, input_schema)?; -Ok(format!("{} {:?} {}", left, op, right)) -} -Expr::Cast { expr, data_type } => { -let expr = create_name(expr, input_schema)?; -Ok(format!("CAST({} AS {:?})", expr, data_type)) -} -Expr::Not(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("NOT {}", expr)) -} -Expr::IsNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NULL", expr)) -} -Expr::IsNotNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NOT NULL", expr)) -} -Expr::ScalarFunction { fun, args, .. } => { -create_function_name(_string(), false, args, input_schema) -} -Expr::ScalarUDF { fun, args, .. } => { -create_function_name(, false, args, input_schema) -} -Expr::AggregateFunction { -fun, -distinct, -args, -.. -} => create_function_name(_string(), *distinct, args, input_schema), -Expr::AggregateUDF { fun, args } => { -let mut names = Vec::with_capacity(args.len()); -for e in args { -names.push(create_name(e, input_schema)?); -} -Ok(format!("{}({})", fun.name, names.join(","))) -} -other =>
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521574396 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: @jorgecarleitao the filter benchmarks may not simulate real-world use cases, but they are designed to test the code under specific conditions such as the worst case scenario with alternating 1s and 0s where no batch can be skipped and all selected values have to be copied individually; how can this scenario be achieved with a randomly generated filter array? the other scenarios which test mostly 0s (best performance because most filter batches can be skipped and only a small number of selected values have to be copied) and mostly 1s (which is not as fast, but still faster than worst case because filter batches can be checked quickly and most values are copied in slices) should be easier to achieve with random filter arrays but are they going to be repeatable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function
bkietz commented on pull request #8256: URL: https://github.com/apache/arrow/pull/8256#issuecomment-725615670 [ASAN UBSAN CI failure](https://github.com/apache/arrow/pull/8256/checks?check_run_id=1386783871#step:8:2353) is ARROW-10525 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function
nealrichardson commented on pull request #8256: URL: https://github.com/apache/arrow/pull/8256#issuecomment-725616136 @github-actions crossbow submit -g r This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521588710 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: @jorgecarleitao once the approach to filter benchmarks has been finalized would it be possible to rearrange the commits so that the same benchmark code can be used to benchmark both the old and new implementations so that we can do a direct comparison? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
alamb commented on a change in pull request #8639: URL: https://github.com/apache/arrow/pull/8639#discussion_r521467373 ## File path: rust/datafusion/src/logical_plan/mod.rs ## @@ -21,2300 +21,21 @@ //! Logical query plans can then be optimized and executed directly, or translated into //! physical query plans and executed. -use std::fmt::{self, Debug, Display}; -use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc}; - -use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction}; -use arrow::{ -compute::can_cast_types, -datatypes::{DataType, Field, Schema, SchemaRef}, -}; - -use crate::datasource::parquet::ParquetTable; -use crate::datasource::TableProvider; -use crate::error::{DataFusionError, Result}; -use crate::{ -datasource::csv::{CsvFile, CsvReadOptions}, -physical_plan::udaf::AggregateUDF, -scalar::ScalarValue, -}; -use crate::{ -physical_plan::{ -aggregates, expressions::binary_operator_data_type, functions, udf::ScalarUDF, -}, -sql::parser::FileType, -}; -use arrow::record_batch::RecordBatch; -use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature}; - +mod builder; Review comment: This module now `pub use` s all `struct`, `enum` and `traits` that were public in this module so no client code should need to change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #8619: ARROW-10531: [Rust][DataFusion]: Add schema and graphviz formatting for LogicalPlans and a PlanVisitor
alamb commented on pull request #8619: URL: https://github.com/apache/arrow/pull/8619#issuecomment-725513681 https://github.com/apache/arrow/pull/8639 is the PR for breaking up the logical_plan module This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian
kiszk commented on a change in pull request #7507: URL: https://github.com/apache/arrow/pull/7507#discussion_r521501051 ## File path: cpp/src/arrow/ipc/reader.cc ## @@ -699,42 +726,45 @@ Status ReadDictionary(const Buffer& metadata, DictionaryMemo* dictionary_memo, // Look up the dictionary value type, which must have been added to the // DictionaryMemo already prior to invoking this function - ARROW_ASSIGN_OR_RAISE(auto value_type, dictionary_memo->GetDictionaryType(id)); + ARROW_ASSIGN_OR_RAISE(auto value_type, context.dictionary_memo->GetDictionaryType(id)); // Load the dictionary data from the dictionary batch ArrayLoader loader(batch_meta, internal::GetMetadataVersion(message->version()), - options, file); - const auto dict_data = std::make_shared(); + context.options, file); + auto dict_data = std::make_shared(); const Field dummy_field("", value_type); RETURN_NOT_OK(loader.Load(_field, dict_data.get())); if (compression != Compression::UNCOMPRESSED) { ArrayDataVector dict_fields{dict_data}; -RETURN_NOT_OK(DecompressBuffers(compression, options, _fields)); +RETURN_NOT_OK(DecompressBuffers(compression, context.options, _fields)); + } + + // swap endian in dict_data if necessary (swap_endian == true) + if (context.swap_endian) { +SwapEndianArrayData(dict_data); } if (dictionary_batch->isDelta()) { -if (kind != nullptr) { - *kind = DictionaryKind::Delta; +if (context.kind != nullptr) { + *context.kind = DictionaryKind::Delta; } -return dictionary_memo->AddDictionaryDelta(id, dict_data); +return context.dictionary_memo->AddDictionaryDelta(id, dict_data); } ARROW_ASSIGN_OR_RAISE(bool inserted, -dictionary_memo->AddOrReplaceDictionary(id, dict_data)); - if (kind != nullptr) { -*kind = inserted ? DictionaryKind::New : DictionaryKind::Replacement; +context.dictionary_memo->AddOrReplaceDictionary(id, dict_data)); + if (context.kind != nullptr) { +*context.kind = inserted ? DictionaryKind::New : DictionaryKind::Replacement; } Review comment: I see. @kou Is it ok to exclude `param` from `IpcReadContext`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
alamb commented on a change in pull request #8639: URL: https://github.com/apache/arrow/pull/8639#discussion_r521541312 ## File path: rust/datafusion/src/logical_plan/mod.rs ## @@ -21,2300 +21,21 @@ //! Logical query plans can then be optimized and executed directly, or translated into //! physical query plans and executed. -use std::fmt::{self, Debug, Display}; -use std::{any::Any, collections::HashMap, collections::HashSet, sync::Arc}; - -use aggregates::{AccumulatorFunctionImplementation, StateTypeFunction}; -use arrow::{ -compute::can_cast_types, -datatypes::{DataType, Field, Schema, SchemaRef}, -}; - -use crate::datasource::parquet::ParquetTable; -use crate::datasource::TableProvider; -use crate::error::{DataFusionError, Result}; -use crate::{ -datasource::csv::{CsvFile, CsvReadOptions}, -physical_plan::udaf::AggregateUDF, -scalar::ScalarValue, -}; -use crate::{ -physical_plan::{ -aggregates, expressions::binary_operator_data_type, functions, udf::ScalarUDF, -}, -sql::parser::FileType, -}; -use arrow::record_batch::RecordBatch; -use functions::{ReturnTypeFunction, ScalarFunctionImplementation, Signature}; - +mod builder; +mod display; +mod expr; +mod extension; mod operators; +mod plan; +mod registry; + +pub use builder::LogicalPlanBuilder; +pub use display::display_schema; +pub use expr::{ +and, array, avg, binary_expr, col, concat, count, create_udaf, create_udf, +exprlist_to_fields, length, lit, max, min, sum, Expr, Literal, +}; +pub use extension::UserDefinedLogicalNode; pub use operators::Operator; - -fn create_function_name( -fun: , -distinct: bool, -args: &[Expr], -input_schema: , -) -> Result { -let names: Vec = args -.iter() -.map(|e| create_name(e, input_schema)) -.collect::>()?; -let distinct_str = match distinct { -true => "DISTINCT ", -false => "", -}; -Ok(format!("{}({}{})", fun, distinct_str, names.join(","))) -} - -/// Returns a readable name of an expression based on the input schema. -/// This function recursively transverses the expression for names such as "CAST(a > 2)". -fn create_name(e: , input_schema: ) -> Result { -match e { -Expr::Alias(_, name) => Ok(name.clone()), -Expr::Column(name) => Ok(name.clone()), -Expr::ScalarVariable(variable_names) => Ok(variable_names.join(".")), -Expr::Literal(value) => Ok(format!("{:?}", value)), -Expr::BinaryExpr { left, op, right } => { -let left = create_name(left, input_schema)?; -let right = create_name(right, input_schema)?; -Ok(format!("{} {:?} {}", left, op, right)) -} -Expr::Cast { expr, data_type } => { -let expr = create_name(expr, input_schema)?; -Ok(format!("CAST({} AS {:?})", expr, data_type)) -} -Expr::Not(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("NOT {}", expr)) -} -Expr::IsNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NULL", expr)) -} -Expr::IsNotNull(expr) => { -let expr = create_name(expr, input_schema)?; -Ok(format!("{} IS NOT NULL", expr)) -} -Expr::ScalarFunction { fun, args, .. } => { -create_function_name(_string(), false, args, input_schema) -} -Expr::ScalarUDF { fun, args, .. } => { -create_function_name(, false, args, input_schema) -} -Expr::AggregateFunction { -fun, -distinct, -args, -.. -} => create_function_name(_string(), *distinct, args, input_schema), -Expr::AggregateUDF { fun, args } => { -let mut names = Vec::with_capacity(args.len()); -for e in args { -names.push(create_name(e, input_schema)?); -} -Ok(format!("{}({})", fun.name, names.join(","))) -} -other => Err(DataFusionError::NotImplemented(format!( -"Physical plan does not support logical expression {:?}", -other -))), -} -} - -/// Create field meta-data from an expression, for use in a result set schema -pub fn exprlist_to_fields(expr: &[Expr], input_schema: ) -> Result> { -expr.iter().map(|e| e.to_field(input_schema)).collect() -} - -/// `Expr` is a logical expression. A logical expression is something like `1 + 1`, or `CAST(c1 AS int)`. -/// Logical expressions know how to compute its [arrow::datatypes::DataType] and nullability. -/// `Expr` is a central struct of DataFusion's query API. -/// -/// # Examples -/// -/// ``` -/// # use datafusion::logical_plan::Expr; -/// # use datafusion::error::Result; -/// # fn main() -> Result<()> { -/// let expr = Expr::Column("c1".to_string()) + Expr::Column("c2".to_string()); -/// println!("{:?}", expr); -/// #
[GitHub] [arrow] nevi-me commented on a change in pull request #8641: ARROW-8853: [Rust] [Integration Testing] Enable Flight tests
nevi-me commented on a change in pull request #8641: URL: https://github.com/apache/arrow/pull/8641#discussion_r521574538 ## File path: rust/integration-testing/src/bin/flight-test-integration-client.rs ## @@ -0,0 +1,377 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use arrow_integration_testing::{ +read_json_file, ArrowFile, AUTH_PASSWORD, AUTH_USERNAME, +}; + +use arrow::datatypes::SchemaRef; +use arrow::record_batch::RecordBatch; + +use arrow_flight::flight_service_client::FlightServiceClient; +use arrow_flight::{ +flight_descriptor::DescriptorType, BasicAuth, FlightData, HandshakeRequest, Location, +Ticket, +}; +use arrow_flight::{utils::flight_data_to_arrow_batch, FlightDescriptor}; + +use clap::{App, Arg}; +use futures::{channel::mpsc, sink::SinkExt, StreamExt}; +use prost::Message; +use tonic::{metadata::MetadataValue, Request, Status}; + +use std::sync::Arc; + +type Error = Box; +type Result = std::result::Result; + +type Client = FlightServiceClient; + +#[tokio::main] +async fn main() -> Result { +let matches = App::new("rust flight-test-integration-client") +.arg(Arg::with_name("host").long("host").takes_value(true)) +.arg(Arg::with_name("port").long("port").takes_value(true)) +.arg(Arg::with_name("path").long("path").takes_value(true)) +.arg( +Arg::with_name("scenario") +.long("scenario") +.takes_value(true), +) +.get_matches(); + +let host = matches.value_of("host").expect("Host is required"); +let port = matches.value_of("port").expect("Port is required"); + +match matches.value_of("scenario") { +Some("middleware") => middleware_scenario(host, port).await?, +Some("auth:basic_proto") => auth_basic_proto_scenario(host, port).await?, +Some(scenario_name) => unimplemented!("Scenario not found: {}", scenario_name), +None => { +let path = matches +.value_of("path") +.expect("Path is required if scenario is not specified"); +integration_test_scenario(host, port, path).await?; +} +} + +Ok(()) +} + +async fn middleware_scenario(host: , port: ) -> Result { +let url = format!("http://{}:{};, host, port); +let conn = tonic::transport::Endpoint::new(url)?.connect().await?; +let mut client = FlightServiceClient::with_interceptor(conn, middleware_interceptor); + +let mut descriptor = FlightDescriptor::default(); +descriptor.set_type(DescriptorType::Cmd); +descriptor.cmd = b"".to_vec(); + +// This call is expected to fail. +let resp = client +.get_flight_info(Request::new(descriptor.clone())) +.await; +match resp { +Ok(_) => return Err(Box::new(Status::internal("Expected call to fail"))), +Err(e) => { +let headers = e.metadata(); +let middleware_header = headers.get("x-middleware"); +let value = middleware_header.map(|v| v.to_str().unwrap()).unwrap_or(""); + +if value != "expected value" { +let msg = format!( +"Expected to receive header 'x-middleware: expected value', \ +but instead got: '{}'", +value +); +return Err(Box::new(Status::internal(msg))); +} + +eprintln!("Headers received successfully on failing call."); +} +} + +// This call should succeed +descriptor.cmd = b"success".to_vec(); +let resp = client.get_flight_info(Request::new(descriptor)).await?; + +let headers = resp.metadata(); +let middleware_header = headers.get("x-middleware"); +let value = middleware_header.map(|v| v.to_str().unwrap()).unwrap_or(""); + +if value != "expected value" { +let msg = format!( +"Expected to receive header 'x-middleware: expected value', \ +but instead got: '{}'", +value +); +return Err(Box::new(Status::internal(msg))); +} + +eprintln!("Headers received successfully on passing call."); + +Ok(()) +} + +fn middleware_interceptor(mut req: Request<()>) ->
[GitHub] [arrow] vivkong commented on pull request #8011: ARROW-9803: [Go] Add initial support for s390x
vivkong commented on pull request #8011: URL: https://github.com/apache/arrow/pull/8011#issuecomment-725525017 Hello @emkornfield, wondering if this can be considered for merging? I've updated the PR to use a constant to check for endianess. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8639: ARROW-10559: [Rust][DataFusion] Split up logical_plan/mod.rs into sub modules
github-actions[bot] commented on pull request #8639: URL: https://github.com/apache/arrow/pull/8639#issuecomment-725524192 https://issues.apache.org/jira/browse/ARROW-10559 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian
kiszk commented on a change in pull request #7507: URL: https://github.com/apache/arrow/pull/7507#discussion_r521499370 ## File path: cpp/src/arrow/ipc/reader.cc ## @@ -107,6 +108,23 @@ Status InvalidMessageType(MessageType expected, MessageType actual) { // -- // Record batch read path +/// \brief Structure to keep common arguments to be passed +struct IpcReadContext { + IpcReadContext(DictionaryMemo* memo, const IpcReadOptions& option, DictionaryKind* kind, + bool swap) + : dictionary_memo(memo), options(option), kind(kind), swap_endian(swap) {} + + DictionaryMemo* dictionary_memo; + + const IpcReadOptions& options; + + DictionaryKind* kind; Review comment: This style, which uses a mutable pointer, follows the original code. Here is an example. My guess is that this code needs to have a special state `uninitialized`. https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L716-L726 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8640: WIP: ARROW-4193: [Rust] Add support for decimal data type
github-actions[bot] commented on pull request #8640: URL: https://github.com/apache/arrow/pull/8640#issuecomment-725542282 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] carols10cents opened a new pull request #8641: ARROW-8853: [Rust] [Integration Testing] Enable Flight tests
carols10cents opened a new pull request #8641: URL: https://github.com/apache/arrow/pull/8641 There are some parts of the tests that should probably be part of the `arrow-flight` crate instead, but that can be done in a future PR. Everything here is in the integration tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts
bkietz commented on a change in pull request #8642: URL: https://github.com/apache/arrow/pull/8642#discussion_r521583351 ## File path: cpp/src/arrow/compute/kernels/scalar_cast_string.cc ## @@ -92,12 +93,74 @@ struct Utf8Validator { }; template -struct BinaryToStringSameWidthCastFunctor { +constexpr bool has_smaller_width() { + return sizeof(typename I::offset_type) < sizeof(typename O::offset_type); +} + +template +constexpr bool has_same_width() { + return sizeof(typename I::offset_type) == sizeof(typename O::offset_type); +} + +// Cast same-width offsets (no-op) +template +void CastBinaryToBinaryOffsets(enable_if_t(), KernelContext*> ctx, + const ArrayData& input, ArrayData* output) {} + +// Upcast offsets +template +void CastBinaryToBinaryOffsets(enable_if_t(), KernelContext*> ctx, + const ArrayData& input, ArrayData* output) { Review comment: If constexpr can be emulated using `std::conditional(), NarrowerImpl, WiderImpl>::template Exec(ctx, input, output)` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on pull request #8630: URL: https://github.com/apache/arrow/pull/8630#issuecomment-725613377 @jorgecarleitao thank you for this PR; overall I think it's a great idea to reuse the code between the take and filter kernels if possible - and you have demonstrated how it can be possible; we just have to find a way to keep performance at a good level; I have been thinking whether the BitChunkIterator from here https://github.com/apache/arrow/blob/master/rust/arrow/src/util/bit_chunk_iterator.rs can be used to improve the filter kernel so I am happy to see you have already done it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function
nealrichardson commented on pull request #8256: URL: https://github.com/apache/arrow/pull/8256#issuecomment-725639997 @github-actions crossbow submit test-r-linux-as-cran This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian
kiszk commented on a change in pull request #7507: URL: https://github.com/apache/arrow/pull/7507#discussion_r521553432 ## File path: cpp/src/arrow/ipc/reader.cc ## @@ -699,42 +726,45 @@ Status ReadDictionary(const Buffer& metadata, DictionaryMemo* dictionary_memo, // Look up the dictionary value type, which must have been added to the // DictionaryMemo already prior to invoking this function - ARROW_ASSIGN_OR_RAISE(auto value_type, dictionary_memo->GetDictionaryType(id)); + ARROW_ASSIGN_OR_RAISE(auto value_type, context.dictionary_memo->GetDictionaryType(id)); // Load the dictionary data from the dictionary batch ArrayLoader loader(batch_meta, internal::GetMetadataVersion(message->version()), - options, file); - const auto dict_data = std::make_shared(); + context.options, file); + auto dict_data = std::make_shared(); const Field dummy_field("", value_type); RETURN_NOT_OK(loader.Load(_field, dict_data.get())); if (compression != Compression::UNCOMPRESSED) { ArrayDataVector dict_fields{dict_data}; -RETURN_NOT_OK(DecompressBuffers(compression, options, _fields)); +RETURN_NOT_OK(DecompressBuffers(compression, context.options, _fields)); + } + + // swap endian in dict_data if necessary (swap_endian == true) + if (context.swap_endian) { +SwapEndianArrayData(dict_data); } if (dictionary_batch->isDelta()) { -if (kind != nullptr) { - *kind = DictionaryKind::Delta; +if (context.kind != nullptr) { + *context.kind = DictionaryKind::Delta; } -return dictionary_memo->AddDictionaryDelta(id, dict_data); +return context.dictionary_memo->AddDictionaryDelta(id, dict_data); } ARROW_ASSIGN_OR_RAISE(bool inserted, -dictionary_memo->AddOrReplaceDictionary(id, dict_data)); - if (kind != nullptr) { -*kind = inserted ? DictionaryKind::New : DictionaryKind::Replacement; +context.dictionary_memo->AddOrReplaceDictionary(id, dict_data)); + if (context.kind != nullptr) { +*context.kind = inserted ? DictionaryKind::New : DictionaryKind::Replacement; } Review comment: nvm, after your suggestion, I added `param` into `IpcReadContext`. This is because `param` has recently added. I will keep the DictionaryKind as an out param This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521574396 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: the filter benchmarks may not simulate real-world use cases, but they are designed to test the code under specific conditions such as the worst case scenario with alternating 1s and 0s where no batch can be skipped and all selected values have to be copied individually; how can this scenario be achieved with a randomly generated filter array? the other scenarios which test mostly 0s (best performance because most filter batches can be skipped and only a small number of selected values have to be copied) and mostly 1s (which is not as fast, but still faster than worst case because filter batches can be checked quickly and most values are copied in slices) should be easier to achieve with random filter arrays but are they going to be repeatable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] carols10cents commented on a change in pull request #8641: ARROW-8853: [Rust] [Integration Testing] Enable Flight tests
carols10cents commented on a change in pull request #8641: URL: https://github.com/apache/arrow/pull/8641#discussion_r521618644 ## File path: rust/integration-testing/src/bin/flight-test-integration-server.rs ## @@ -0,0 +1,634 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; +use std::convert::TryFrom; +use std::net::SocketAddr; +use std::pin::Pin; +use std::sync::Arc; + +use clap::{App, Arg}; +use futures::{channel::mpsc, sink::SinkExt, Stream, StreamExt}; +use prost::Message; +use tokio::net::TcpListener; +use tokio::sync::Mutex; +use tonic::transport::Server; +use tonic::{metadata::MetadataMap, Request, Response, Status, Streaming}; + +use arrow::{datatypes::Schema, record_batch::RecordBatch}; +use arrow_flight::{ +flight_descriptor::DescriptorType, flight_service_server::FlightService, +flight_service_server::FlightServiceServer, utils::flight_data_to_arrow_batch, +Action, ActionType, BasicAuth, Criteria, Empty, FlightData, FlightDescriptor, +FlightEndpoint, FlightInfo, HandshakeRequest, HandshakeResponse, Location, PutResult, +SchemaResult, Ticket, +}; + +use arrow_integration_testing::{AUTH_PASSWORD, AUTH_USERNAME}; + +type TonicStream = Pin + Send + Sync + 'static>>; + +#[derive(Debug, Clone)] +struct IntegrationDataset { +schema: Schema, +chunks: Vec, +} + +#[derive(Clone, Default)] +pub struct FlightServiceImpl { +server_location: String, +uploaded_chunks: Arc>>, +} + +#[tonic::async_trait] +impl FlightService for FlightServiceImpl { +type HandshakeStream = TonicStream>; +type ListFlightsStream = TonicStream>; +type DoGetStream = TonicStream>; +type DoPutStream = TonicStream>; +type DoActionStream = TonicStream>; +type ListActionsStream = TonicStream>; +type DoExchangeStream = TonicStream>; + +async fn get_schema( +, +_request: Request, +) -> Result, Status> { +Err(Status::unimplemented("Not yet implemented")) Review comment: Yes, there's nothing I can see that calls these methods in the integration tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #8256: ARROW-9001: [R] Box outputs as correct type in call_function
nealrichardson closed pull request #8256: URL: https://github.com/apache/arrow/pull/8256 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] a-campbell opened a new issue #8646: Predicate pushdown question
a-campbell opened a new issue #8646: URL: https://github.com/apache/arrow/issues/8646 Hi Arrow community, I'm new to the project and am trying to understand exactly what is happening under the hood when I run a filter-collect query on an Arrow Dataset (backed by Parquet). Let's say I created a Parquet dataset with no file-level partitions. I just wrote a bunch of separate files to a dataset. Now I want to run a query that returns the rows corresponding to a specific range of datetimes in the dataset's dt column. My understanding is that the Dataset API will push this query down to the file level, checking the footer of each file for the min/max value of dt and determining whether this block of rows should be read. Assuming this is correct, a few questions: Will every query result in the reading all of the file footers? Is there any caching of these min/max values? Is there a way to profile query performance? A way to view a query plan before it is executed? I appreciate your time in helping me better understand. Andrew This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #8461: ARROW-10197: [python][Gandiva] Execute expression on filtered data
wesm commented on pull request #8461: URL: https://github.com/apache/arrow/pull/8461#issuecomment-725667526 @pitrou or @bkietz could one of you have a look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #8579: ARROW-10481: [R] Bindings to add, remove, replace Table columns
nealrichardson commented on a change in pull request #8579: URL: https://github.com/apache/arrow/pull/8579#discussion_r521631559 ## File path: r/R/table.R ## @@ -254,6 +257,68 @@ names.Table <- function(x) x$ColumnNames() #' @export `[[.Table` <- `[[.RecordBatch` +#' @export +`[[<-.Table` <- function(x, i, value) { + if (!is.character(i) & !is.numeric(i)) { +stop("'i' must be character or numeric, not ", class(i), call. = FALSE) + } else if (is.na(i)) { +# Catch if a NA_character or NA_integer is passed. These are caught elsewhere +# in cpp (i.e. _arrow_RecordBatch__column_name) +# TODO: figure out if catching in cpp like ^^^ is preferred Review comment: I suspect that error comes from https://github.com/apache/arrow/pull/8579/files#diff-242f7f20fb4999f3bbe719d6977d2580231b89334326b588d46e9236c2e28fcdR285, i.e. before we hit cpp. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz closed pull request #8642: ARROW-6071: [C++] Generic binary-to-binary casts
bkietz closed pull request #8642: URL: https://github.com/apache/arrow/pull/8642 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #8616: ARROW-10522: [R] Allow rename Table and RecordBatch columns with names()
nealrichardson closed pull request #8616: URL: https://github.com/apache/arrow/pull/8616 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #8011: ARROW-9803: [Go] Add initial support for s390x
kou closed pull request #8011: URL: https://github.com/apache/arrow/pull/8011 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #8643: ARROW-10522: [R] Allow rename Table and RecordBatch columns with names()
nealrichardson closed pull request #8643: URL: https://github.com/apache/arrow/pull/8643 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao opened a new pull request #8645: ARROW-10561: [Rust] Small simplification of `write` and `write_bytes`
jorgecarleitao opened a new pull request #8645: URL: https://github.com/apache/arrow/pull/8645 This PR addresses 3 small issues on `MutableBuffer`: 1. `write_bytes` is incorrect, as it double-increments `len`: the length is incremented both on `self.write` and also by `write_bytes` itself. This leads to more allocations than necessary. 2. `write` is implemented from the trait `io::Write`. However, this trait is suitable for fallible IO operations. In the case of a write to memory, it isn't really fallible because we can always call `reserve` to allocate more space. 3. `write` and `write_bytes` are really similar. This PR replaces both `write_bytes` and `write` by `extend_from_slice` (inspired by [`Vec::extend_from_slice`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.extend_from_slice). This has the same performance (a single capacity comparison) or better. Compared to the current code, the main difference is that this is infallible as it `reserve` more space if the required size is insufficient. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Bei-z commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support
Bei-z commented on a change in pull request #8542: URL: https://github.com/apache/arrow/pull/8542#discussion_r521719735 ## File path: cpp/src/arrow/util/basic_decimal.cc ## @@ -395,33 +395,49 @@ BasicDecimal128& BasicDecimal128::operator*=(const BasicDecimal128& right) { return *this; } -/// Expands the given value into an array of ints so that we can work on -/// it. The array will be converted to an absolute value and the wasNegative +/// Expands the given little endian array of uint64_t into a big endian array of +/// uint32_t. The value of input array is expected to be non-negative. The result_array +/// will remove leading zeros from the input array. +/// \param value_array a little endian array to represent the value +/// \param result_array a big endian array of length N*2 to set with the value +/// \result the output length of the array +template +static int64_t FillInArray(const std::array& value_array, + uint32_t* result_array) { + int64_t next_index = 0; + for (int64_t i = N - 1; i >= 0; i--) { +if (value_array[i] != 0) { + if (value_array[i] <= std::numeric_limits::max()) { +result_array[next_index++] = static_cast(value_array[i]); +i--; + } + for (int64_t j = i; j >= 0; j--) { +result_array[next_index++] = static_cast(value_array[j] >> 32); +result_array[next_index++] = static_cast(value_array[j]); + } + break; +} + } + return next_index; +} + +/// Expands the given value into a big endian array of ints so that we can work on +/// it. The array will be converted to an absolute value and the was_negative /// flag will be set appropriately. The array will remove leading zeros from /// the value. -/// \param array an array of length 4 to set with the value +/// \param array a big endian array of length 4 to set with the value /// \param was_negative a flag for whether the value was original negative /// \result the output length of the array static int64_t FillInArray(const BasicDecimal128& value, uint32_t* array, bool& was_negative) { - uint64_t high; - uint64_t low; - const int64_t highbits = value.high_bits(); - const uint64_t lowbits = value.low_bits(); - - if (highbits < 0) { -low = ~lowbits + 1; -high = static_cast(~highbits); -if (low == 0) { - ++high; -} -was_negative = true; - } else { -low = lowbits; -high = static_cast(highbits); -was_negative = false; - } - + BasicDecimal128 abs_value = BasicDecimal128::Abs(value); + was_negative = value.high_bits() < 0; + uint64_t high = static_cast(abs_value.high_bits()); + uint64_t low = abs_value.low_bits(); + + // FillInArray(std::array& value_array, uint32_t* result_array) is not + // called here as the following code has better performance, to avoid regression on + // BasicDecimal128 Division. if (high != 0) { if (high > std::numeric_limits::max()) { Review comment: Ah sorry just got what you mean here! And yes it should be strict inequality for both places. Changed accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Bei-z commented on a change in pull request #8542: ARROW-10407: [C++] Add BasicDecimal256 Division Support
Bei-z commented on a change in pull request #8542: URL: https://github.com/apache/arrow/pull/8542#discussion_r521719841 ## File path: cpp/src/arrow/util/basic_decimal.cc ## @@ -490,49 +527,60 @@ static void FixDivisionSigns(BasicDecimal128* result, BasicDecimal128* remainder } } -/// \brief Build a BasicDecimal128 from a list of ints. +/// \brief Build a little endian array of uint64_t from a big endian array of uint32_t. +template +static DecimalStatus BuildFromArray(std::array* result_array, +uint32_t* array, int64_t length) { + for (int64_t i = length - 2 * N - 1; i >= 0; i--) { +if (array[i] != 0) { + return DecimalStatus::kOverflow; +} + } + int64_t next_index = length - 1; + for (size_t i = 0; i < N; i++) { +uint64_t lower_bits = (next_index < 0) ? 0 : array[next_index--]; +(*result_array)[i] = +(next_index < 0) Review comment: In this case, would you recommend to keep it this way, or changed back to one loop? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org