[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-15 Thread Bankim Bhavsar (Code Review)
Bankim Bhavsar has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..

[perf] Check range predicate first while evaluating Bloom filter predicate

Range predicates can be specified along with Bloom filter predicate
for the same column. It's cheaper to check against range
predicate and exit early if the column value is out of bounds
compared to computing hash and then looking up the value in Bloom filter.

This case is common when Impala pushes down Bloom filter
predicate as it'll likely be accompained by min-max filter (i.e. range
predicate) on the same column.

Tests:
Added a test case that scans against 1M column values with a range predicate
and Bloom filter predicate. In one case, with a range predicate that yields
no rows and other with a range predicate that yields all rows.

Modified the test case to run against 100M rows on a 48 core
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz with 94GB of memory.

Across iterations observed an improvement of 20-30% when the range predicate
yields no rows preventing hash computation and Bloom filter lookup.
Don't see any noticeable regression for the case where values are within
range bounds.

Without perf change:
 Counting rows with a range predicate less than the min value: real 0.953s user 
0.001s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.767s 
user 0.001s sys 0.000s

 Counting rows with a range predicate less than the min value: real 0.899s user 
0.000s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.775s 
user 0.000s sys 0.001s

 Counting rows with a range predicate less than the min value: real 0.983s user 
0.000s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.832s 
user 0.001s sys 0.000s

With perf change:
 Counting rows with a range predicate less than the min value: real 0.725s user 
0.001s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.847s 
user 0.000s sys 0.000s

 Counting rows with a range predicate less than the min value: real 0.664s user 
0.000s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.794s 
user 0.001s sys 0.000s

 Counting rows with a range predicate less than the min value: real 0.706s user 
0.001s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.774s 
user 0.000s sys 0.000s

Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Reviewed-on: http://gerrit.cloudera.org:8080/15913
Reviewed-by: Andrew Wong 
Tested-by: Kudu Jenkins
---
M src/kudu/client/predicate-test.cc
M src/kudu/common/column_predicate.h
2 files changed, 113 insertions(+), 88 deletions(-)

Approvals:
  Andrew Wong: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 6
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-15 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..


Patch Set 5: Code-Review+2

(5 comments)

http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG@19
PS3, Line 19: 1M c
> nit: 1M here too
Done


http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG@25
PS3, Line 25:
> nit: would be nice to describe what these are and why the results look the
Done


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc
File src/kudu/client/predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1305
PS3, Line 1305: CreateRandomUniqueIntegers
> Bunch of callers to that function unrelated to this change. Would prefer do
Sounds good


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1355
PS3, Line 1355: Counting rows with a range predicat
> Fixed the log description.
OK, though if the goal is to make it easy to find this output when 
benchmarking, --gtest_filter is probably sufficient since we'd probably only be 
running the benchmark test.


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1471
PS3, Line 1471:   
KuduColumnSchema::VARCH
> NewInBloomFilterPredicate() takes ownership of the predicate.
Ack



--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Fri, 15 May 2020 19:08:08 +
Gerrit-HasComments: Yes


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-15 Thread Bankim Bhavsar (Code Review)
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15913

to look at the new patch set (#5).

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..

[perf] Check range predicate first while evaluating Bloom filter predicate

Range predicates can be specified along with Bloom filter predicate
for the same column. It's cheaper to check against range
predicate and exit early if the column value is out of bounds
compared to computing hash and then looking up the value in Bloom filter.

This case is common when Impala pushes down Bloom filter
predicate as it'll likely be accompained by min-max filter (i.e. range
predicate) on the same column.

Tests:
Added a test case that scans against 1M column values with a range predicate
and Bloom filter predicate. In one case, with a range predicate that yields
no rows and other with a range predicate that yields all rows.

Modified the test case to run against 100M rows on a 48 core
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz with 94GB of memory.

Across iterations observed an improvement of 20-30% when the range predicate
yields no rows preventing hash computation and Bloom filter lookup.
Don't see any noticeable regression for the case where values are within
range bounds.

Without perf change:
 Counting rows with a range predicate less than the min value: real 0.953s user 
0.001s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.767s 
user 0.001s sys 0.000s

 Counting rows with a range predicate less than the min value: real 0.899s user 
0.000s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.775s 
user 0.000s sys 0.001s

 Counting rows with a range predicate less than the min value: real 0.983s user 
0.000s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.832s 
user 0.001s sys 0.000s

With perf change:
 Counting rows with a range predicate less than the min value: real 0.725s user 
0.001s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.847s 
user 0.000s sys 0.000s

 Counting rows with a range predicate less than the min value: real 0.664s user 
0.000s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.794s 
user 0.001s sys 0.000s

 Counting rows with a range predicate less than the min value: real 0.706s user 
0.001s sys 0.000s
 Counting rows with a range predicate that includes all values: real 0.774s 
user 0.000s sys 0.000s

Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
---
M src/kudu/client/predicate-test.cc
M src/kudu/common/column_predicate.h
2 files changed, 113 insertions(+), 88 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/5
--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-15 Thread Bankim Bhavsar (Code Review)
Bankim Bhavsar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..


Patch Set 5:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc
File src/kudu/client/predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1287
PS3, Line 1287: included_values_;
> nit: same here, with "included".
Done


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1289
PS3, Line 1289: excluded_values_;
> nit: I think "excluded" is a more appropriate word. "exclusive" doesn't mak
Done


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1293
PS3, Line 1293:   // Initialize the members with the values that will be 
inserted to the table and Bloom filters.
  :   // The table will have 'num_all_values' unique values, and 
we'll generate two Bloom filters:
  :   // one with 'num_included_values' values from the table, and 
one with 'num_excluded_values'
  :   // values that aren't present in the table.
> nit: I found myself reading through other tests to better understand what t
Done


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1305
PS3, Line 1305: CreateRandomUniqueIntegers
> nit: not related to this patch, but consider renaming this to GetRandomAvoi
Bunch of callers to that function unrelated to this change. Would prefer doing 
it separately.


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1355
PS3, Line 1355: Counting rows with a range predicat
Fixed the log description.

> Also why is the CURRENT_TEST_NAME() useful? Can't we figure that out from the 
> test output? Or if you're looking to benchmark a specific test, why not use 
> --gtest_filter or something?

This is a common function for other non-benchmark tests as well and hence 
adding the test name helps filtering when multiple tests are run together.


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1368
PS3, Line 1368: (tab
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1449
PS3, Line 1449: ST_F(BloomFilterPredicateTest,
> remove ?
Done


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1449
PS3, Line 1449: TEST_F(BloomFilterPredicateTest,
> Uncomment?
Done


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1451
PS3, Line 1451:
  :   // Writing large number of rows sometimes leads to write 
timeouts. Hence the test below
  :   // uses 1M rows instead.
  :   Init(100/*num_all_values*/, 1/*num_included_values*/, 
1/*num_excluded_values*/);
  :   KuduBloomFilter* included_bf = 
CreateBloomFilterWithValues(included_values_);
  :
  :   InsertAllValuesInTable();
  :   vector included_bf_vec = { included_bf };
  :   auto* included_predicate = 
table_->NewInBloomFilterPredicate("value", _bf_vec);
  :   auto* included_predicate_clone = included_predicate->Clone();
  :
  :   TestWithRangePredicate(included_predicate, 
included_predicate_clone);
  : }
  :
> nit: this probably isn't necessary, given it's not particularly useful as a
Done


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1471
PS3, Line 1471:   
KuduColumnSchema::VARCH
> nit: Why do we need to clone this? Aren't they used in a const manner anywa
NewInBloomFilterPredicate() takes ownership of the predicate.
Alternative is to create another Bloom filter and call 
NewInBloomFilterPredicate(). It's easier to simply Clone the predicate.



--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Fri, 15 May 2020 19:02:56 +
Gerrit-HasComments: Yes


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-14 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..


Patch Set 4: Code-Review+1

(1 comment)

It's a good optimization point!

http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc
File src/kudu/client/predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1449
PS3, Line 1449: // SKIP_IF_SLOW_NOT_ALLOWED();
remove ?



--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 14 May 2020 22:36:38 +
Gerrit-HasComments: Yes


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-14 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG@19
PS3, Line 19: 100M
nit: 1M here too


http://gerrit.cloudera.org:8080/#/c/15913/3//COMMIT_MSG@25
PS3, Line 25: Without perf change:
nit: would be nice to describe what these are and why the results look the way 
they do, e.g.

"The added test runs two scans against a single dataset, both of which have a 
range predicate and a bloom filter predicate. The first scan's range predicate 
yields no rows, while the second scan's range predicate yields all rows. With 
this patch, we see a notable decrease in the time taken to evaluate the first 
scan, since the range predicate is evaluated first."



--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 14 May 2020 22:28:32 +
Gerrit-HasComments: Yes


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-14 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..


Patch Set 4:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG@18
PS1, Line 18: Tests:
> Disregarding the data type of the column, I think this is the best case sce
I see, yeah I was more curious about perf in the case of string/binary.


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc
File src/kudu/client/predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1287
PS3, Line 1287: inclusive_values_
nit: same here, with "included".


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1289
PS3, Line 1289: exclusive_values_
nit: I think "excluded" is a more appropriate word. "exclusive" doesn't make 
sense IMO.


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1293
PS3, Line 1293:   // Initialize table with "num_all_values" values.
  :   // "num_inclusive_values" is the number of values from the 
table that'll be inserted in
  :   // BloomFilter and searched against all values in the table.
  :   // "num_exclusive_values" is similarly the number of values 
that are not present in the table.
nit: I found myself reading through other tests to better understand what this 
was doing and why. Maybe reword a bit to add a more complete description:

"Initialize the members with the values that will be inserted to the table and 
bloom filters. The table will have 'num_all_values' unique values, and we'll 
generate two bloom bloom filters: one with 'num_inclusive_values' values from 
the table, and one with 'num_exclusive_values' values that aren't present in 
the table."

Also would it make sense to call InsertAllValuesInTable() here?


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1305
PS3, Line 1305: CreateRandomUniqueIntegers
nit: not related to this patch, but consider renaming this to 
GetRandomAvoidingValues() or somesuch. Without going back to the 
implementation, this just seems like it's creating an unrestricted set of ints.


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1355
PS3, Line 1355: Counting rows when no rows expected
nit: might be easier on others who may look at this test output to have a more 
implementation-specific description, e.g.

"counting rows with a range predicate less than the min value"

and

"counting rows with a range predicate selecting all between the min and max 
values"

Also why is the CURRENT_TEST_NAME() useful? Can't we figure that out from the 
test output? Or if you're looking to benchmark a specific test, why not use 
--gtest_filter or something?


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1368
PS3, Line 1368:   {i
nit: spacing


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1449
PS3, Line 1449:   // SKIP_IF_SLOW_NOT_ALLOWED();
Uncomment?


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1451
PS3, Line 1451:   // Following are results for scanning against 100M rows on 48 
core
  :   // Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz with 94GB of 
memory.
  :   //
  :   // Counting rows when no rows expected: real 0.725s user 
0.001s sys 0.000s
  :   // Counting rows when range predicate doesn't prune: real 
0.847s user 0.000s sys 0.000s
  :   //
  :   // Counting rows when no rows expected: real 0.664s user 
0.000s sys 0.000s
  :   // Counting rows when range predicate doesn't prune: real 
0.794s user 0.001s sys 0.000s
  :   //
  :   // Counting rows when no rows expected: real 0.706s user 
0.001s sys 0.000s
  :   // Counting rows when range predicate doesn't prune: real 
0.774s user 0.000s sys 0.000s
  :   //
  :   // Writing large number of rows sometimes leads to write 
timeouts. Hence the test below
  :   // uses 1M rows instead.
nit: this probably isn't necessary, given it's not particularly useful as a 
reader of this code. If people were really interested, it seems more 
appropriate to go back to the commit message for this info.


http://gerrit.cloudera.org:8080/#/c/15913/3/src/kudu/client/predicate-test.cc@1471
PS3, Line 1471:   auto* inclusive_predicate_clone = 
inclusive_predicate->Clone();
nit: Why do we need to clone this? Aren't they used in a const manner anyway?



--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master

[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-14 Thread Bankim Bhavsar (Code Review)
Hello Tidy Bot, Kudu Jenkins, Andrew Wong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15913

to look at the new patch set (#4).

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..

[perf] Check range predicate first while evaluating Bloom filter predicate

Range predicates can be specified along with Bloom filter predicate
for the same column. It's cheaper to check against range
predicate and exit early if the column value is out of bounds
compared to computing hash and then looking up the value in Bloom filter.

This case is common when Impala pushes down Bloom filter
predicate as it'll likely be accompained by min-max filter (i.e. range
predicate) on the same column.

Tests:
Added a test case that scans against 100M column values.
Across iterations observed an improvement of 20-30% when the range predicate
check prevents hash computation and Bloom filter lookup.
Don't see any noticeable regression for the case where values are within
range bounds.

Without perf change:
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.953s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.767s user 0.001s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.899s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.775s user 0.000s sys 0.001s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.983s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.832s user 0.001s sys 0.000s

With perf change:
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.725s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.847s user 0.000s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.664s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.794s user 0.001s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.706s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.774s user 0.000s sys 0.000s

Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
---
M src/kudu/client/predicate-test.cc
M src/kudu/common/column_predicate.h
2 files changed, 79 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/4
--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-14 Thread Bankim Bhavsar (Code Review)
Hello Tidy Bot, Kudu Jenkins, Andrew Wong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15913

to look at the new patch set (#3).

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..

[perf] Check range predicate first while evaluating Bloom filter predicate

Range predicates can be specified along with Bloom filter predicate
for the same column. It's cheaper to check against range
predicate and exit early if the column value is out of bounds
compared to computing hash and then looking up the value in Bloom filter.

This case is common when Impala pushes down Bloom filter
predicate as it'll likely be accompained by min-max filter (i.e. range
predicate) on the same column.

Tests:
Added a test case that scans against 100M column values.
Across iterations observed an improvement of 20-30% when the range predicate
check prevents hash computation and Bloom filter lookup.
Don't see any noticeable regression for the case where values are within
range bounds.

Without perf change:
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.953s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.767s user 0.001s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.899s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.775s user 0.000s sys 0.001s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.983s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.832s user 0.001s sys 0.000s

With perf change:
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.725s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.847s user 0.000s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.664s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.794s user 0.001s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.706s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.774s user 0.000s sys 0.000s

Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
---
M src/kudu/client/predicate-test.cc
M src/kudu/common/column_predicate.h
2 files changed, 79 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/3
--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 3
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-14 Thread Bankim Bhavsar (Code Review)
Hello Tidy Bot, Kudu Jenkins, Andrew Wong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15913

to look at the new patch set (#2).

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..

[perf] Check range predicate first while evaluating Bloom filter predicate

Range predicates can be specified along with Bloom filter predicate
for the same column. It's cheaper to check against range
predicate and exit early if the column value is out of bounds
compared to computing hash and then looking up the value in Bloom filter.

This case is common when Impala pushes down Bloom filter
predicate as it'll likely be accompained by min-max filter (i.e. range
predicate) on the same column.

Tests:
Added a test case that scans against 100M column values.
Across iterations observed an improvement of 20-30% when the range predicate
check prevents hash computation and Bloom filter lookup.
Don't see any noticeable regression for the case where values are within
range bounds.

Without perf change:
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.953s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.767s user 0.001s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.899s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.775s user 0.000s sys 0.001s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.983s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.832s user 0.001s sys 0.000s

With perf change:
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.725s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.847s user 0.000s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.664s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.794s user 0.001s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.706s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.774s user 0.000s sys 0.000s

Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
---
M src/kudu/client/predicate-test.cc
M src/kudu/common/column_predicate.h
2 files changed, 79 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/2
--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 2
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-14 Thread Bankim Bhavsar (Code Review)
Bankim Bhavsar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..


Patch Set 1:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG@18
PS1, Line 18: Tests:
> nit: is this a best-case scenario? average scenario? worst case? If we tune
Disregarding the data type of the column, I think this is the best case 
scenario considering range predicate checks prevents any checks to Bloom filter.

Hashing for larger string/binary values could give better numbers.


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc
File src/kudu/client/predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1311
PS1, Line 1311: num_false_positive_values_ = num_all_values * 
kBloomFilterFalsePositiveProb;
> warning: narrowing conversion from 'float' to 'int' [bugprone-narrowing-con
Done


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1350
PS1, Line 1350: auto *
> nit: revert the order?
Done


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1353
PS1, Line 1353: const auto* test_name = 
::testing::UnitTest::GetInstance()->current_test_info()->name();
> warning: Value stored to 'test_name' during its initialization is never rea
Using CURRENT_TEST_NAME() macro instead.


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1354
PS1, Line 1354: strings::
> nit: add a `using` for this?
Done


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h
File src/kudu/common/column_predicate.h:

http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@326
PS1, Line 326:   return false;
 :   }
 : } else if (upper_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->upper_) >= 0) {
 :   return false;
 :   }
 : } else if (lower_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->lower_) < 0) {
 :   return false;
 :
> nit: spacing for 'return false'
Done


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@322
PS1, Line 322: // Check optional lower and upper bound.
 : if (lower_ != nullptr && upper_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->upper_) >= 0 ||
 :   DataTypeTraits::Compare(cell, 
this->lower_) < 0) {
 :   return false;
 :   }
 : } else if (upper_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->upper_) >= 0) {
 :   return false;
 :   }
 : } else if (lower_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->lower_) < 0) {
 :   return false;
 :   }
 : }
> nit: would this work:
Indeed!


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@349
PS1, Line 349: return false;
> nit: spacing
Looks like simply importing Google style didn't update the spacing in IDE on 
Mac. Fixed now.



--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Bankim Bhavsar 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 14 May 2020 21:05:49 +
Gerrit-HasComments: Yes


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-14 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc
File src/kudu/client/predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1350
PS1, Line 1350: auto *
nit: revert the order?


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1353
PS1, Line 1353: const auto* test_name = 
::testing::UnitTest::GetInstance()->current_test_info()->name();
> warning: Value stored to 'test_name' during its initialization is never rea
Should we be dereferencing the pointer in the Substitute() call?


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/client/predicate-test.cc@1354
PS1, Line 1354: strings::
nit: add a `using` for this?



--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 14 May 2020 20:47:31 +
Gerrit-HasComments: Yes


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-13 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15913 )

Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15913/1//COMMIT_MSG@18
PS1, Line 18: Tests:
nit: is this a best-case scenario? average scenario? worst case? If we tuned 
the values a bit, would we be able to see a greater improvement? Is 20-30% the 
most we'll get from this?


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h
File src/kudu/common/column_predicate.h:

http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@326
PS1, Line 326:   return false;
 :   }
 : } else if (upper_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->upper_) >= 0) {
 :   return false;
 :   }
 : } else if (lower_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->lower_) < 0) {
 :   return false;
 :
nit: spacing for 'return false'


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@322
PS1, Line 322: // Check optional lower and upper bound.
 : if (lower_ != nullptr && upper_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->upper_) >= 0 ||
 :   DataTypeTraits::Compare(cell, 
this->lower_) < 0) {
 :   return false;
 :   }
 : } else if (upper_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->upper_) >= 0) {
 :   return false;
 :   }
 : } else if (lower_ != nullptr) {
 :   if (DataTypeTraits::Compare(cell, 
this->lower_) < 0) {
 :   return false;
 :   }
 : }
nit: would this work:

if (lower_ && Compare(cell, lower_) < 0) {
  return false;
}
if (upper_ && Compare(cell, upper_) >= 0) {
  return false;
}

?


http://gerrit.cloudera.org:8080/#/c/15913/1/src/kudu/common/column_predicate.h@349
PS1, Line 349: return false;
nit: spacing



--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 14 May 2020 01:09:01 +
Gerrit-HasComments: Yes


[kudu-CR] [perf] Check range predicate first while evaluating Bloom filter predicate

2020-05-13 Thread Bankim Bhavsar (Code Review)
Bankim Bhavsar has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/15913


Change subject: [perf] Check range predicate first while evaluating Bloom 
filter predicate
..

[perf] Check range predicate first while evaluating Bloom filter predicate

Range predicates can be specified along with Bloom filter predicate
for the same column. It's cheaper to check against range
predicate and exit early if the column value is out of bounds
compared to computing hash and then looking up the value in Bloom filter.

This case is common when Impala pushes down Bloom filter
predicate as it'll likely be accompained by min-max filter (i.e. range
predicate) on the same column.

Tests:
Added a test case that scans against 100M column values.
Across iterations observed an improvement of 20-30% when the range predicate
check prevents hash computation and Bloom filter lookup.
Don't see any noticeable regression for the case where values are within
range bounds.

Without perf change:
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.953s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.767s user 0.001s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.899s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.775s user 0.000s sys 0.001s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.983s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.832s user 0.001s sys 0.000s

With perf change:
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.725s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.847s user 0.000s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.664s  user 0.000s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.794s user 0.001s sys 0.000s

Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when no rows 
expected: real 0.706s  user 0.001s sys 0.000s
Time spent TestKuduBloomFilterPredicateBenchmark: Counting rows when range 
predicate doesn't prune: real 0.774s user 0.000s sys 0.000s

Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
---
M src/kudu/client/predicate-test.cc
M src/kudu/common/column_predicate.h
2 files changed, 69 insertions(+), 42 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/15913/1
--
To view, visit http://gerrit.cloudera.org:8080/15913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I8451d6ddfe1fbdf307b3e9f2cc23a8d06e655ba3
Gerrit-Change-Number: 15913
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar