[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Bankim Bhavsar has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. [perf] Check range predicate first while evaluating Bloom filter predicate Range predicates can be specified along with Bloom filter predicate for the same column. It's cheaper to check against range predicate and exit early if the column value is out of bounds compared to computing hash and then looking up the value in Bloom filter. This case is common when Impala pushes down Bloom filter predicate as it'll likely be accompained by min-max filter (i.e. range predicate) on the same column. Tests: Added a test case that scans against 1M column values with a range predicate and Bloom filter predicate. In one case, with a range predicate that yields no rows and other with a range predicate that yields all rows. Modified the test case to run against 100M rows on a 48 core Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz with 94GB of memory. Across iterations observed an improvement of 20-30% when the range predicate yields no rows preventing hash computation and Bloom filter lookup. Don't see any noticeable regression for the case where values are within range bounds. Without perf change: Counting rows with a range predicate less than the min value: real 0.953s user 0.001s sys 0.000s Counting rows with a range predicate that includes all values: real 0.767s user 0.001s sys 0.000s Counting rows with a range predicate less than the min value: real 0.899s user 0.000s sys 0.000s Counting rows with a range predicate that includes all values: real 0.775s user 0.000s sys 0.001s Counting rows with a range predicate less than the min value: real 0.983s user 0.000s sys 0.000s Counting rows with a range predicate that includes all values: real 0.832s user 0.001s sys 0.000s With perf change: Counting rows with a range predicate less than the min value: real 0.725s user 0.001s sys 0.000s Counting rows with a range predicate that includes all values: real 0.847s user 0.000s sys 0.000s Counting rows with a range predicate less than the min value: real 0.664s user 0.000s sys 0.000s Counting rows with a range predicate that includes all values: real 0.794s user 0.001s sys 0.000s Counting rows with a range predicate less than the min value: real 0.706s user 0.001s sys 0.000s Counting rows with a range predicate that includes all values: real 0.774s user 0.000s sys 0.000s Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Reviewed-on: http://gerrit.cloudera.org:8080/15913 Reviewed-by: Andrew Wong Tested-by: Kudu Jenkins --- M src/kudu/client/predicate-test.cc M src/kudu/common/column_predicate.h 2 files changed, 113 insertions(+), 88 deletions(-) Approvals: Andrew Wong: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 6 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241)
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. Patch Set 5: Code-Review+2 (5 comments) http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG@19 PS3, Line 19: 1M c > nit: 1M here too Done http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG@25 PS3, Line 25: > nit: would be nice to describe what these are and why the results look the Done http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc File src/kudu/client/predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1305 PS3, Line 1305: CreateRandomUniqueIntegers > Bunch of callers to that function unrelated to this change. Would prefer do Sounds good http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1355 PS3, Line 1355: Counting rows with a range predicat > Fixed the log description. OK, though if the goal is to make it easy to find this output when benchmarking, --gtest_filter is probably sufficient since we'd probably only be running the benchmark test. http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1471 PS3, Line 1471: KuduColumnSchema::VARCH > NewInBloomFilterPredicate() takes ownership of the predicate. Ack -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 5 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Fri, 15 May 2020 19:08:08 + Gerrit-HasComments: Yes
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15913 to look at the new patch set (#5). Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. [perf] Check range predicate first while evaluating Bloom filter predicate Range predicates can be specified along with Bloom filter predicate for the same column. It's cheaper to check against range predicate and exit early if the column value is out of bounds compared to computing hash and then looking up the value in Bloom filter. This case is common when Impala pushes down Bloom filter predicate as it'll likely be accompained by min-max filter (i.e. range predicate) on the same column. Tests: Added a test case that scans against 1M column values with a range predicate and Bloom filter predicate. In one case, with a range predicate that yields no rows and other with a range predicate that yields all rows. Modified the test case to run against 100M rows on a 48 core Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz with 94GB of memory. Across iterations observed an improvement of 20-30% when the range predicate yields no rows preventing hash computation and Bloom filter lookup. Don't see any noticeable regression for the case where values are within range bounds. Without perf change: Counting rows with a range predicate less than the min value: real 0.953s user 0.001s sys 0.000s Counting rows with a range predicate that includes all values: real 0.767s user 0.001s sys 0.000s Counting rows with a range predicate less than the min value: real 0.899s user 0.000s sys 0.000s Counting rows with a range predicate that includes all values: real 0.775s user 0.000s sys 0.001s Counting rows with a range predicate less than the min value: real 0.983s user 0.000s sys 0.000s Counting rows with a range predicate that includes all values: real 0.832s user 0.001s sys 0.000s With perf change: Counting rows with a range predicate less than the min value: real 0.725s user 0.001s sys 0.000s Counting rows with a range predicate that includes all values: real 0.847s user 0.000s sys 0.000s Counting rows with a range predicate less than the min value: real 0.664s user 0.000s sys 0.000s Counting rows with a range predicate that includes all values: real 0.794s user 0.001s sys 0.000s Counting rows with a range predicate less than the min value: real 0.706s user 0.001s sys 0.000s Counting rows with a range predicate that includes all values: real 0.774s user 0.000s sys 0.000s Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 --- M src/kudu/client/predicate-test.cc M src/kudu/common/column_predicate.h 2 files changed, 113 insertions(+), 88 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/5 -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 5 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241)
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. Patch Set 5: (10 comments) http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc File src/kudu/client/predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1287 PS3, Line 1287: included_values_; > nit: same here, with "included". Done http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1289 PS3, Line 1289: excluded_values_; > nit: I think "excluded" is a more appropriate word. "exclusive" doesn't mak Done http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1293 PS3, Line 1293: // Initialize the members with the values that will be inserted to the table and Bloom filters. : // The table will have 'num_all_values' unique values, and we'll generate two Bloom filters: : // one with 'num_included_values' values from the table, and one with 'num_excluded_values' : // values that aren't present in the table. > nit: I found myself reading through other tests to better understand what t Done http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1305 PS3, Line 1305: CreateRandomUniqueIntegers > nit: not related to this patch, but consider renaming this to GetRandomAvoi Bunch of callers to that function unrelated to this change. Would prefer doing it separately. http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1355 PS3, Line 1355: Counting rows with a range predicat Fixed the log description. > Also why is the CURRENT_TEST_NAME() useful? Can't we figure that out from the > test output? Or if you're looking to benchmark a specific test, why not use > --gtest_filter or something? This is a common function for other non-benchmark tests as well and hence adding the test name helps filtering when multiple tests are run together. http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1368 PS3, Line 1368: (tab > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1449 PS3, Line 1449: ST_F(BloomFilterPredicateTest, > remove ? Done http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1449 PS3, Line 1449: TEST_F(BloomFilterPredicateTest, > Uncomment? Done http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1451 PS3, Line 1451: : // Writing large number of rows sometimes leads to write timeouts. Hence the test below : // uses 1M rows instead. : Init(100/*num_all_values*/, 1/*num_included_values*/, 1/*num_excluded_values*/); : KuduBloomFilter* included_bf = CreateBloomFilterWithValues(included_values_); : : InsertAllValuesInTable(); : vector included_bf_vec = { included_bf }; : auto* included_predicate = table_->NewInBloomFilterPredicate("value", _bf_vec); : auto* included_predicate_clone = included_predicate->Clone(); : : TestWithRangePredicate(included_predicate, included_predicate_clone); : } : > nit: this probably isn't necessary, given it's not particularly useful as a Done http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1471 PS3, Line 1471: KuduColumnSchema::VARCH > nit: Why do we need to clone this? Aren't they used in a const manner anywa NewInBloomFilterPredicate() takes ownership of the predicate. Alternative is to create another Bloom filter and call NewInBloomFilterPredicate(). It's easier to simply Clone the predicate. -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 5 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Fri, 15 May 2020 19:02:56 + Gerrit-HasComments: Yes
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. Patch Set 4: Code-Review+1 (1 comment) It's a good optimization point! http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc File src/kudu/client/predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1449 PS3, Line 1449: // SKIP_IF_SLOW_NOT_ALLOWED(); remove ? -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 4 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 14 May 2020 22:36:38 + Gerrit-HasComments: Yes
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG@19 PS3, Line 19: 100M nit: 1M here too http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG@25 PS3, Line 25: Without perf change: nit: would be nice to describe what these are and why the results look the way they do, e.g. "The added test runs two scans against a single dataset, both of which have a range predicate and a bloom filter predicate. The first scan's range predicate yields no rows, while the second scan's range predicate yields all rows. With this patch, we see a notable decrease in the time taken to evaluate the first scan, since the range predicate is evaluated first." -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 4 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 14 May 2020 22:28:32 + Gerrit-HasComments: Yes
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. Patch Set 4: (10 comments) http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG@18 PS1, Line 18: Tests: > Disregarding the data type of the column, I think this is the best case sce I see, yeah I was more curious about perf in the case of string/binary. http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc File src/kudu/client/predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1287 PS3, Line 1287: inclusive_values_ nit: same here, with "included". http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1289 PS3, Line 1289: exclusive_values_ nit: I think "excluded" is a more appropriate word. "exclusive" doesn't make sense IMO. http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1293 PS3, Line 1293: // Initialize table with "num_all_values" values. : // "num_inclusive_values" is the number of values from the table that'll be inserted in : // BloomFilter and searched against all values in the table. : // "num_exclusive_values" is similarly the number of values that are not present in the table. nit: I found myself reading through other tests to better understand what this was doing and why. Maybe reword a bit to add a more complete description: "Initialize the members with the values that will be inserted to the table and bloom filters. The table will have 'num_all_values' unique values, and we'll generate two bloom bloom filters: one with 'num_inclusive_values' values from the table, and one with 'num_exclusive_values' values that aren't present in the table." Also would it make sense to call InsertAllValuesInTable() here? http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1305 PS3, Line 1305: CreateRandomUniqueIntegers nit: not related to this patch, but consider renaming this to GetRandomAvoidingValues() or somesuch. Without going back to the implementation, this just seems like it's creating an unrestricted set of ints. http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1355 PS3, Line 1355: Counting rows when no rows expected nit: might be easier on others who may look at this test output to have a more implementation-specific description, e.g. "counting rows with a range predicate less than the min value" and "counting rows with a range predicate selecting all between the min and max values" Also why is the CURRENT_TEST_NAME() useful? Can't we figure that out from the test output? Or if you're looking to benchmark a specific test, why not use --gtest_filter or something? http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1368 PS3, Line 1368: {i nit: spacing http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1449 PS3, Line 1449: // SKIP_IF_SLOW_NOT_ALLOWED(); Uncomment? http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1451 PS3, Line 1451: // Following are results for scanning against 100M rows on 48 core : // Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz with 94GB of memory. : // : // Counting rows when no rows expected: real 0.725s user 0.001s sys 0.000s : // Counting rows when range predicate doesn't prune: real 0.847s user 0.000s sys 0.000s : // : // Counting rows when no rows expected: real 0.664s user 0.000s sys 0.000s : // Counting rows when range predicate doesn't prune: real 0.794s user 0.001s sys 0.000s : // : // Counting rows when no rows expected: real 0.706s user 0.001s sys 0.000s : // Counting rows when range predicate doesn't prune: real 0.774s user 0.000s sys 0.000s : // : // Writing large number of rows sometimes leads to write timeouts. Hence the test below : // uses 1M rows instead. nit: this probably isn't necessary, given it's not particularly useful as a reader of this code. If people were really interested, it seems more appropriate to go back to the commit message for this info. http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1471 PS3, Line 1471: auto* inclusive_predicate_clone = inclusive_predicate->Clone(); nit: Why do we need to clone this? Aren't they used in a const manner anyway? -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15913 to look at the new patch set (#4). Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. [perf] Check range predicate first while evaluating Bloom filter predicate Range predicates can be specified along with Bloom filter predicate for the same column. It's cheaper to check against range predicate and exit early if the column value is out of bounds compared to computing hash and then looking up the value in Bloom filter. This case is common when Impala pushes down Bloom filter predicate as it'll likely be accompained by min-max filter (i.e. range predicate) on the same column. Tests: Added a test case that scans against 100M column values. Across iterations observed an improvement of 20-30% when the range predicate check prevents hash computation and Bloom filter lookup. Don't see any noticeable regression for the case where values are within range bounds. Without perf change: Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.953s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.767s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.899s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.775s user 0.000s sys 0.001s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.983s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.832s user 0.001s sys 0.000s With perf change: Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.725s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.847s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.664s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.794s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.706s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.774s user 0.000s sys 0.000s Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 --- M src/kudu/client/predicate-test.cc M src/kudu/common/column_predicate.h 2 files changed, 79 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/4 -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 4 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241)
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15913 to look at the new patch set (#3). Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. [perf] Check range predicate first while evaluating Bloom filter predicate Range predicates can be specified along with Bloom filter predicate for the same column. It's cheaper to check against range predicate and exit early if the column value is out of bounds compared to computing hash and then looking up the value in Bloom filter. This case is common when Impala pushes down Bloom filter predicate as it'll likely be accompained by min-max filter (i.e. range predicate) on the same column. Tests: Added a test case that scans against 100M column values. Across iterations observed an improvement of 20-30% when the range predicate check prevents hash computation and Bloom filter lookup. Don't see any noticeable regression for the case where values are within range bounds. Without perf change: Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.953s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.767s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.899s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.775s user 0.000s sys 0.001s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.983s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.832s user 0.001s sys 0.000s With perf change: Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.725s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.847s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.664s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.794s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.706s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.774s user 0.000s sys 0.000s Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 --- M src/kudu/client/predicate-test.cc M src/kudu/common/column_predicate.h 2 files changed, 79 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/3 -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 3 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241)
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15913 to look at the new patch set (#2). Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. [perf] Check range predicate first while evaluating Bloom filter predicate Range predicates can be specified along with Bloom filter predicate for the same column. It's cheaper to check against range predicate and exit early if the column value is out of bounds compared to computing hash and then looking up the value in Bloom filter. This case is common when Impala pushes down Bloom filter predicate as it'll likely be accompained by min-max filter (i.e. range predicate) on the same column. Tests: Added a test case that scans against 100M column values. Across iterations observed an improvement of 20-30% when the range predicate check prevents hash computation and Bloom filter lookup. Don't see any noticeable regression for the case where values are within range bounds. Without perf change: Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.953s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.767s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.899s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.775s user 0.000s sys 0.001s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.983s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.832s user 0.001s sys 0.000s With perf change: Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.725s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.847s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.664s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.794s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.706s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.774s user 0.000s sys 0.000s Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 --- M src/kudu/client/predicate-test.cc M src/kudu/common/column_predicate.h 2 files changed, 79 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/2 -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 2 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241)
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. Patch Set 1: (8 comments) http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG@18 PS1, Line 18: Tests: > nit: is this a best-case scenario? average scenario? worst case? If we tune Disregarding the data type of the column, I think this is the best case scenario considering range predicate checks prevents any checks to Bloom filter. Hashing for larger string/binary values could give better numbers. http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc File src/kudu/client/predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1311 PS1, Line 1311: num_false_positive_values_ = num_all_values * kBloomFilterFalsePositiveProb; > warning: narrowing conversion from 'float' to 'int' [bugprone-narrowing-con Done http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1350 PS1, Line 1350: auto * > nit: revert the order? Done http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1353 PS1, Line 1353: const auto* test_name = ::testing::UnitTest::GetInstance()->current_test_info()->name(); > warning: Value stored to 'test_name' during its initialization is never rea Using CURRENT_TEST_NAME() macro instead. http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1354 PS1, Line 1354: strings:: > nit: add a `using` for this? Done http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h File src/kudu/common/column_predicate.h: http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@326 PS1, Line 326: return false; : } : } else if (upper_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->upper_) >= 0) { : return false; : } : } else if (lower_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->lower_) < 0) { : return false; : > nit: spacing for 'return false' Done http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@322 PS1, Line 322: // Check optional lower and upper bound. : if (lower_ != nullptr && upper_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->upper_) >= 0 || : DataTypeTraits::Compare(cell, this->lower_) < 0) { : return false; : } : } else if (upper_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->upper_) >= 0) { : return false; : } : } else if (lower_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->lower_) < 0) { : return false; : } : } > nit: would this work: Indeed! http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@349 PS1, Line 349: return false; > nit: spacing Looks like simply importing Google style didn't update the spacing in IDE on Mac. Fixed now. -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 1 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Bankim Bhavsar Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 14 May 2020 21:05:49 + Gerrit-HasComments: Yes
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc File src/kudu/client/predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1350 PS1, Line 1350: auto * nit: revert the order? http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1353 PS1, Line 1353: const auto* test_name = ::testing::UnitTest::GetInstance()->current_test_info()->name(); > warning: Value stored to 'test_name' during its initialization is never rea Should we be dereferencing the pointer in the Substitute() call? http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1354 PS1, Line 1354: strings:: nit: add a `using` for this? -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 1 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 14 May 2020 20:47:31 + Gerrit-HasComments: Yes
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15913 ) Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. Patch Set 1: (4 comments) http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG@18 PS1, Line 18: Tests: nit: is this a best-case scenario? average scenario? worst case? If we tuned the values a bit, would we be able to see a greater improvement? Is 20-30% the most we'll get from this? http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h File src/kudu/common/column_predicate.h: http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@326 PS1, Line 326: return false; : } : } else if (upper_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->upper_) >= 0) { : return false; : } : } else if (lower_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->lower_) < 0) { : return false; : nit: spacing for 'return false' http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@322 PS1, Line 322: // Check optional lower and upper bound. : if (lower_ != nullptr && upper_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->upper_) >= 0 || : DataTypeTraits::Compare(cell, this->lower_) < 0) { : return false; : } : } else if (upper_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->upper_) >= 0) { : return false; : } : } else if (lower_ != nullptr) { : if (DataTypeTraits::Compare(cell, this->lower_) < 0) { : return false; : } : } nit: would this work: if (lower_ && Compare(cell, lower_) < 0) { return false; } if (upper_ && Compare(cell, upper_) >= 0) { return false; } ? http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@349 PS1, Line 349: return false; nit: spacing -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 1 Gerrit-Owner: Bankim Bhavsar Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 14 May 2020 01:09:01 + Gerrit-HasComments: Yes
[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate
Bankim Bhavsar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15913 Change subject: [perf] Check range predicate first while evaluating Bloom filter predicate .. [perf] Check range predicate first while evaluating Bloom filter predicate Range predicates can be specified along with Bloom filter predicate for the same column. It's cheaper to check against range predicate and exit early if the column value is out of bounds compared to computing hash and then looking up the value in Bloom filter. This case is common when Impala pushes down Bloom filter predicate as it'll likely be accompained by min-max filter (i.e. range predicate) on the same column. Tests: Added a test case that scans against 100M column values. Across iterations observed an improvement of 20-30% when the range predicate check prevents hash computation and Bloom filter lookup. Don't see any noticeable regression for the case where values are within range bounds. Without perf change: Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.953s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.767s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.899s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.775s user 0.000s sys 0.001s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.983s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.832s user 0.001s sys 0.000s With perf change: Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.725s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.847s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.664s user 0.000s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.794s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows expected: real 0.706s user 0.001s sys 0.000s Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range predicate doesn't prune: real 0.774s user 0.000s sys 0.000s Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 --- M src/kudu/client/predicate-test.cc M src/kudu/common/column_predicate.h 2 files changed, 69 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/1 -- To view, visit http://gerrit.cloudera.org:8080/15913 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3 Gerrit-Change-Number: 15913 Gerrit-PatchSet: 1 Gerrit-Owner: Bankim Bhavsar