[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1479/


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 14
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 15 Nov 2017 23:58:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-13 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 14: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 14
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Mon, 13 Nov 2017 23:28:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-13 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 14: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7793/14/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/14/common/thrift/ImpalaInternalService.thrift@203
PS14, Line 203:   41: optional i32 max_num_runtime_filters = 10
> so this (and two options below) only apply to bloom but not min/max filters
I see you addressed this in a doc you had already linked:
https://docs.google.com/document/d/1G-SPZelateebNxzVb67urEVtjc5Itw-B_jjfS85bSCE/edit?usp=sharing

That reasoning seems okay with me.



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 14
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Mon, 13 Nov 2017 22:09:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-13 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 14:

(2 comments)

looked through the headers and thrift and it looks fine to me. Just a question 
about the query options.

http://gerrit.cloudera.org:8080/#/c/7793/14/be/src/exec/kudu-util.cc
File be/src/exec/kudu-util.cc:

http://gerrit.cloudera.org:8080/#/c/7793/14/be/src/exec/kudu-util.cc@116
PS14, Line 116: Status ConvertTimestampValue(const TimestampValue* tv, int64_t* 
ts_micros) {
static


http://gerrit.cloudera.org:8080/#/c/7793/14/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/14/common/thrift/ImpalaInternalService.thrift@203
PS14, Line 203:   41: optional i32 max_num_runtime_filters = 10
so this (and two options below) only apply to bloom but not min/max filters? is 
that going to be clear to a user? hopefully we don't expect most to use these 
options, so maybe we should rename them?



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 14
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Mon, 13 Nov 2017 18:59:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-09 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 14: Code-Review+1

(1 comment)

I'm happy with this change. Nice work.

http://gerrit.cloudera.org:8080/#/c/7793/14/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/14/be/src/util/min-max-filter.cc@166
PS14, Line 166: // For integer filter types, add a GetCastIntMinMax() that 
return the min/max values for
Sorry I didn't realize this was already in the .h file. Better to remove this 
comment in favor of the one in .h



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 14
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 10 Nov 2017 05:54:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-09 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 14:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/exec/kudu-scanner.cc@200
PS13, Line 200: if (!filter->GetCastIntMinMax(col_type, _min, 
_max)) {
> Unfortunate that that every Kudu client has to do this.
This is something that's been discussed before (see KUDU-931), I'm not sure why 
they decided not to do it.


http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/util/min-max-filter.cc@139
PS13, Line 139: case TYPE_BIGINT:
> using std::numeric_limits; at the top if this file
Done


http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/util/min-max-filter.cc@165
PS13, Line 165:
> Brief comment especially about the return value would be good.
Done


http://gerrit.cloudera.org:8080/#/c/7793/13/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/13/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@604
PS13, Line 604: // must be a SlotRef pointing to a column. We can allow 
implicit integer casts
> typo: We can allow implicit integer casts
Done


http://gerrit.cloudera.org:8080/#/c/7793/13/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@605
PS13, Line 605: // by casting the min/max values before sending them to 
Kudu.
> min/max values
Done


http://gerrit.cloudera.org:8080/#/c/7793/13/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/13/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@98
PS13, Line 98: where a.tinyint_col = b.int_col and b.int_col in (0, 1)
> Let's make the min/max filter selective, e.g. by adding where b.int_col in
Done


http://gerrit.cloudera.org:8080/#/c/7793/13/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@103
PS13, Line 103: # The min/max values in the filter are both above the range of 
the target col so all rows
> Let's also add a non-selective case where the min/max values fall outside t
Done



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 14
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 09 Nov 2017 18:53:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-09 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Anonymous Coward #345, Tim 
Armstrong, Todd Lipcon, Mostafa Mokhtar, Alex Behm, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#14).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each hash join node
generates a bloom and min-max filter for each equi-join predicate, but
only those filters that end up being assigned to a target make it into
the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does not eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder-ir.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/runtime/timestamp-value.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter-test.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/Data.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/inline-view-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-08 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 13:

(7 comments)

Looks pretty good to me

http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/exec/kudu-scanner.cc@200
PS13, Line 200: if (!filter->GetCastIntMinMax(col_type, _min, 
_max)) {
Unfortunate that that every Kudu client has to do this.

We should consider filing a JIRA against Kudu to address perform this logic on 
their side.


http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/util/min-max-filter.cc@139
PS13, Line 139:   return std::numeric_limits::max();
using std::numeric_limits; at the top if this file


http://gerrit.cloudera.org:8080/#/c/7793/13/be/src/util/min-max-filter.cc@165
PS13, Line 165: #define NUMERIC_MIN_MAX_FILTER_CAST(NAME)   
\
Brief comment especially about the return value would be good.


http://gerrit.cloudera.org:8080/#/c/7793/13/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/13/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@604
PS13, Line 604: // must be a SlotRef pointing to a column. Wc can allow 
implicit integer casts
typo: We can allow implicit integer casts


http://gerrit.cloudera.org:8080/#/c/7793/13/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@605
PS13, Line 605: // by casting the min/max value before sending them to 
Kudu.
min/max values


http://gerrit.cloudera.org:8080/#/c/7793/13/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/13/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@98
PS13, Line 98: where a.tinyint_col = b.int_col
Let's make the min/max filter selective, e.g. by adding where b.int_col in 
(0,1) or something like that


http://gerrit.cloudera.org:8080/#/c/7793/13/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@103
PS13, Line 103: # The min/max values in the filter are outside the range 
representable for the target col.
Let's also add a non-selective case where the min/max values fall outside the 
range of the target column, something like:

select STRAIGHT_JOIN count(*)
from alltypes a join [BROADCAST]
  (values(min_int() x), (max_int())) v
where a.tinyint_col = v.x



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 13
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 09 Nov 2017 00:02:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-08 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Anonymous Coward #345, Tim 
Armstrong, Todd Lipcon, Mostafa Mokhtar, Alex Behm, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#13).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each hash join node
generates a bloom and min-max filter for each equi-join predicate, but
only those filters that end up being assigned to a target make it into
the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does not eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder-ir.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/runtime/timestamp-value.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter-test.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/Data.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/inline-view-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-07 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 11:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@27
PS9, Line 27:
> Thanks for explaining.
https://docs.google.com/document/d/1G-SPZelateebNxzVb67urEVtjc5Itw-B_jjfS85bSCE/edit?usp=sharing


http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/exec/filter-context.h@118
PS11, Line 118: Materialize filter values.
> what does that mean?
Done


http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/runtime/runtime-filter-bank.h
File be/src/runtime/runtime-filter-bank.h:

http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/runtime/runtime-filter-bank.h@78
PS11, Line 78:   /// may both be NULL, representing a filter that allows all 
rows to pass.
> is it the case that at most one of bloom_filter and min_max_filter should b
Done


http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/util/min-max-filter.h@62
PS11, Line 62: Materialize filter values
> what does that mean?
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/PlanNodes.thrift@103
PS9, Line 103:   12: optional string kudu_col_name
> case sensitive?
Done


http://gerrit.cloudera.org:8080/#/c/7793/11/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/11/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@359
PS11, Line 359: public Operator getJoinOp() { return exprCmpOp_; }
> getExprCmpOp()
Done


http://gerrit.cloudera.org:8080/#/c/7793/11/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@602
PS11, Line 602: if (!(targetExpr instanceof SlotRef)
> I think only explicit casts are problematic. Implicit casts should be ok, o
Right, we should be able to support integer implicit casts.

The complication is that if, say, the calculated max is outside of the range 
for the type of the targeted column, we can't just pass that value into Kudu as 
it will complain.

In that case, we would need to convert the max we send to be the max for the 
type of the targeted column. I've added some code to the BE to do this 
conversion.


http://gerrit.cloudera.org:8080/#/c/7793/11/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
File testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/11/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test@144
PS11, Line 144: on a.month = cast(b.month + 1 as int);
> Was this your change? Why the change?
This was necessary if we didn't support implicit casts on the target. Removed 
now.



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 11
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 07 Nov 2017 15:06:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-07 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Anonymous Coward #345, Tim 
Armstrong, Todd Lipcon, Mostafa Mokhtar, Alex Behm, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#12).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each hash join node
generates a bloom and min-max filter for each equi-join predicate, but
only those filters that end up being assigned to a target make it into
the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does not eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder-ir.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/runtime/timestamp-value.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter-test.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/Data.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/inline-view-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-06 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 11:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/exec/filter-context.h@118
PS11, Line 118: Materialize filter values.
what does that mean?


http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/runtime/runtime-filter-bank.h
File be/src/runtime/runtime-filter-bank.h:

http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/runtime/runtime-filter-bank.h@78
PS11, Line 78:   /// may both be NULL, representing a filter that allows all 
rows to pass.
is it the case that at most one of bloom_filter and min_max_filter should be 
non-NULL? could you clarify that (including in the class comment)?


http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/11/be/src/util/min-max-filter.h@62
PS11, Line 62: Materialize filter values
what does that mean?



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 11
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Mon, 06 Nov 2017 23:54:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-03 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 11:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@27
PS9, Line 27:
> RUNTIME_FILTER_WAIT_TIME_MS applies the same to bloom and min-max - its the
Thanks for explaining.
Can you add this text into a Google Doc for us to keep track of the 
evolution/meaning of these query options? No need to polish, just put it 
somewhere.

I think the new types of filters will affect our query options in non-trivial 
ways and we should come up with a plan that minimizes user confusion, adding 
new options, and deprecating options.


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@41
PS9, Line 41:
> Those results are posted in the review comments. I can include them here as
Thanks. No need to add here.


http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift@202
PS9, Line 202:   // Maximum number of bloom runtime filters allowed per query
> There's basically two reasons for this:
Thanks. Please add this text to a Google Doc for tracking the evolution of 
these query options.

Even though the min/max filters are smaller and bounded in size, I think 
extreme queries with a large number of joins and join conditions can still 
cause havoc. Keep in mind that we now allow an *unbounded* number of such 
filters. Crazy queries will happen.

We should not hold this patch, but the lack of safeguards is concerning to me 
and we should continue thinking.


http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/PlanNodes.thrift@103
PS9, Line 103:   12: optional string kudu_col_name
case sensitive?


http://gerrit.cloudera.org:8080/#/c/7793/11/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/11/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@359
PS11, Line 359: public Operator getJoinOp() { return exprCmpOp_; }
getExprCmpOp()


http://gerrit.cloudera.org:8080/#/c/7793/11/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@602
PS11, Line 602: if (!(targetExpr instanceof SlotRef)
I think only explicit casts are problematic. Implicit casts should be ok, or am 
I missing something?


http://gerrit.cloudera.org:8080/#/c/7793/11/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
File testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/11/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test@144
PS11, Line 144: on a.month = cast(b.month + 1 as int);
Was this your change? Why the change?



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 11
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 Nov 2017 23:22:11 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-03 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 11:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@13
PS9, Line 13: In RuntimeFilterGenerator in the planner, each hash join node
> ... each hash join node generates a bloom and min-max filter for each equi-
Done


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@27
PS9, Line 27:
> Not specific to the code changes, and I don't expect a response here (proba
RUNTIME_FILTER_WAIT_TIME_MS applies the same to bloom and min-max - its the 
time a scan node will wait for any filter of either type to arrive. Similarly, 
RUNTIME_FILTER_MODE works the same for min-max as its worked for bloom.

The size related params - RUNTIME_BLOOM_FILTER_SIZE, 
RUNTIME_FILTER_MIN/MAX_SIZE only apply to bloom filters as the mem used by 
min-max is small and bounded.

I've updated some comments to make the above clearer.

One tricky case is DISABLE_ROW_RUNTIME_FILTERING. Kudu applies the filters on 
both a per-partition and per-row basis. At the moment, there's no way to change 
this - the filters are applied through the same interface as eg. predicates 
that are pushed down, so Kudu assumes that they have to be applied for 
correctness, not just for performance.

So, I could see the argument either way - we could disable Kudu filters if 
DISABLE_ROW_RUNTIME_FILTERING is true, though it would also disable the 
partition filtering, or we could just not consider 
DISABLE_ROW_RUNTIME_FILTERING for Kudu filters.

The remaining filter related option is MAX_NUM_RUNTIME_FILTERS, which I've 
addressed in another comment.


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@41
PS9, Line 41:
> Contrived extreme queries are good data points, but how about running the T
Those results are posted in the review comments. I can include them here as 
well, I just felt it made the commit message excessively large.


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@44
PS9, Line 44:   timings are averages of 3 runs.
> does not eliminate
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/runtime/timestamp-value.h@164
PS8, Line 164: tvalue.ti
> I'm confused by this variable name.
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc
File be/src/util/min-max-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@27
PS8, Line 27:
> Can you maybe briefly describe what is tested for each filter type? There s
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@252
PS8, Line 252:   MinMaxFilter::Create(tFilter, string_type, _pool, 
_pool);
> Would TestEnv work? That is the semi-standard way to create an ExecEnv for
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@353
PS8, Line 353:   t3.ToTColumnValue();
> Remove?
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter.cc@246
PS8, Line 246: null
> nullptr?
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift@202
PS9, Line 202:   // Maximum number of bloom runtime filters allowed per query
> I think I understand why you did this, but it seems confusing from a user's
There's basically two reasons for this:
- We don't want to regress queries by eliminating bloom filters that used to be 
generated.
- The purpose of this param in the first place was to prevent the coordinator 
from being overwhelmed. Min-max filters are less heavy-weight than bloom 
filters so they don't put as much stress on the coordinator anyways.

Given that, I think that it would be fine to maintain this behavior even once 
HDFS can do min-max and Kudu can do bloom.

Another concern though is what to do if we add in-list filters, which are 
probably similarly heavy-weight to bloom filters.

One option would be to deprecate this and have separate max_num_bloom and 
max_num_in_list params, but that may be confusing and requires user action to 
keep things working as expected anyways.

Another option at that point would be to keep this param and make it "Maximum 
number of expensive filters (bloom or in-list)", and then there's just once we 
have to worry about regressing queries.

Or maybe we should just apply this to min-max filters as well, since its the 
least confusing option in the long run, possibly bumping up the default a 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-03 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Anonymous Coward #345, Tim 
Armstrong, Todd Lipcon, Mostafa Mokhtar, Alex Behm,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#11).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each hash join node
generates a bloom and min-max filter for each equi-join predicate, but
only those filters that end up being assigned to a target make it into
the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does not eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder-ir.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/runtime/timestamp-value.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter-test.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/Data.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/inline-view-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-03 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Anonymous Coward #345, Tim 
Armstrong, Todd Lipcon, Mostafa Mokhtar, Alex Behm,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#10).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each partitioned hash join
generates a bloom and min-max filter, but only those filters that end
up being assigned to a target make it into the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder-ir.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/runtime/timestamp-value.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter-test.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/Data.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/inline-view-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-01 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 9:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@13
PS9, Line 13: In RuntimeFilterGenerator in the planner, each partitioned hash 
join
... each hash join node generates a bloom and min-max filter for each equi-join 
predicate, but only those ...


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@27
PS9, Line 27: For now, min-max filters are only applied at the KuduScanner, 
which
Not specific to the code changes, and I don't expect a response here (probably 
too long :)).

How do the existing query options around runtime filters affect the new min/max 
filters on Kudu? For example, what does DISABLE_ROW_RUNTIME_FILTERING mean for 
the Kudu min/max filters?

How should users think about setting:
RUNTIME_FILTER_WAIT_TIME_MS

In particular, are min/max filters more effective against Kudu PK or partition 
columns?


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@41
PS9, Line 41: Perf Testing:
Contrived extreme queries are good data points, but how about running the 
TPCH/DS perf suites against Kudu?


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@44
PS9, Line 44: - Ran a contrived query with a filter that does eliminate any rows
does not eliminate


http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift@202
PS9, Line 202:   // Maximum number of bloom runtime filters allowed per query
I think I understand why you did this, but it seems confusing from a user's 
perspective. Ok to leave, but do you have a story around the eventual meaning 
of existing query options when HDFS can do min/max and Kudu can do bloom?


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/PlanNode.java
File fe/src/main/java/org/apache/impala/planner/PlanNode.java:

http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/PlanNode.java@730
PS9, Line 730: output.append(Joiner.on(", ").join(filtersStr) + "\n");
just return the string? don't think we need the 'output' StringBuilder


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@111
PS9, Line 111: private final Operator joinOp_;
Let's call this cmpOp_ or exprCmpOp_ or something else because "joinOp_" 
usually indicates the join type like left outer, right outer, etc.


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@170
PS9, Line 170:   SlotRef slotRef = expr.unwrapSlotRef(false);
Add a comment stating that the validity of this is checked elsewhere (and where 
exactly)


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@368
PS9, Line 368:   if (node instanceof HdfsScanNode && type_ != 
TRuntimeFilterType.BLOOM) {
I feel like these checks belong in the caller. Having an addTarget() function 
be a co-op in some cases seems difficult to reason about. Can be cleaned with a 
isValidTarget() helper function.


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@463
PS9, Line 463: // We only enforce a limit on the number of bloom filters as 
they are much more
This seems really confusing for users. I'm ok with checking in this version, 
and let's discuss separately.


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test@1248
PS9, Line 1248: |  runtime filters: RF004 <- o_orderkey, RF005 <- o_clerk
To me the re-numbering is a little strange. We can think about how to address 
this in a follow-on change. I'm thinking that ideally users should be able to 
quickly determine the number of runtime filters based on the max RF id, so we 
could assign the real RF id lazily instead of eagerly.


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test:


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-11-01 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Anonymous Coward #345, Tim 
Armstrong, Todd Lipcon, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#9).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each partitioned hash join
generates a bloom and min-max filter, but only those filters that end
up being assigned to a target make it into the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder-ir.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/runtime/timestamp-value.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter-test.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/Data.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/inline-view-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-31 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 8:

My +1 was for the backend, I can upgrade to a +2 for backend if nobody else 
wants to take a pass. I don't feel confident +1ing the frontend part.


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 01 Nov 2017 00:36:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-31 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 8: Code-Review+1

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/runtime/timestamp-value.h@164
PS8, Line 164: uint8_min
I'm confused by this variable name.


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc
File be/src/util/min-max-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@27
PS8, Line 27:
Can you maybe briefly describe what is tested for each filter type? There seems 
to be a pattern to the tests but a comment might make it easier to understand 
what coverage this provides.


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@252
PS8, Line 252:   ExecEnv* env = new ExecEnv();
Would TestEnv work? That is the semi-standard way to create an ExecEnv for 
tests.


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@353
PS8, Line 353: //IMPALA_TEST_MAIN();
Remove?


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter.cc@246
PS8, Line 246: NULL
nullptr?



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 01 Nov 2017 00:29:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-31 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 8:

I need to take another look at this - sorry for the delay.


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 31 Oct 2017 22:45:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-27 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 8:

TPCDS results:

 
+-++++-++++-+---+
 | Workload| Query  | File Format| 
Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | 
Iters |
 
+-++++-++++-+---+
 | TPCDS(_300) | TPCDS-Q73  | kudu / none / none | 
12.01  | 7.99| R +50.42%  |   0.31%|   1.04%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q68  | kudu / none / none | 
17.02  | 13.00   | R +30.97%  |   0.82%|   0.26%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q79  | kudu / none / none | 
16.61  | 12.95   | R +28.31%  |   0.56%|   0.54%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q1   | kudu / none / none | 2.36 
  | 1.99|   +18.19%  |   2.86%|   1.38%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q53  | kudu / none / none | 2.18 
  | 2.08|   +4.96%   |   0.32%|   1.98%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q8   | kudu / none / none | 4.03 
  | 3.93|   +2.38%   |   0.43%|   0.38%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q98  | kudu / none / none | 7.80 
  | 7.67|   +1.72%   |   0.66%|   0.81%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q7   | kudu / none / none | 3.75 
  | 3.71|   +1.15%   |   1.14%|   1.17%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q4   | kudu / none / none | 
29.59  | 29.27   |   +1.11%   |   4.13%|   3.67%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q65  | kudu / none / none | 
13.97  | 14.01   |   -0.28%   |   2.70%|   1.43%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q28  | kudu / none / none | 4.06 
  | 4.08|   -0.35%   |   0.84%|   0.48%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q89  | kudu / none / none | 2.46 
  | 2.48|   -0.62%   |   2.60%|   1.81%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q55  | kudu / none / none | 2.44 
  | 2.46|   -0.83%   |   1.08%|   5.41%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q42  | kudu / none / none | 2.42 
  | 2.47|   -1.89%   |   0.26%|   4.50%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q23  | kudu / none / none | 
90.96  | 95.30   |   -4.55%   |   1.06%|   2.21%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q43  | kudu / none / none | 5.05 
  | 5.34|   -5.49%   |   0.89%|   4.17%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q3   | kudu / none / none | 4.11 
  | 4.36|   -5.84%   |   2.48%|   1.15%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q61  | kudu / none / none | 8.56 
  | 9.93|   -13.83%  |   0.78%|   1.84%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q2   | kudu / none / none | 
12.19  | 15.68   | I -22.28%  | * 27.57% * |   1.11%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q47  | kudu / none / none | 
16.82  | 21.82   | I -22.91%  |   1.16%|   1.25%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q19  | kudu / none / none | 4.78 
  | 6.32| I -24.29%  |   1.14%|   0.97%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q46  | kudu / none / none | 6.58 
  | 8.77| I -24.88%  |   0.86%|   1.01%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q88  | kudu / none / none | 
14.89  | 19.84   | I -24.95%  |   0.17%|   3.79%| 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q59  | kudu / none / none | 
24.05  | 35.61   | I -32.45%  | * 11.97% * | * 20.61% * | 1   | 
4 |
 | TPCDS(_300) | TPCDS-Q34  | kudu / none / none | 3.94 
  | 6.82| I -42.24%  |   2.37%|   4.12%| 1   | 4
 |
 | TPCDS(_300) | TPCDS-Q27  | kudu / none / none | 3.97 
  | 7.91| I -49.85%  |   

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-26 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 8:

> > Patch Set 7:
 > >
 > > > > Patch Set 7:
 > >  > >
 > >  > > Perf results:
 > >  > > ...
 > >  >
 > >  > I'm surprised that only a few queries saw significant
 > speedups. Is
 > >  > this in line with what you saw with Parquet runtime filters on
 > >  > TPC-H? Or are we losing a lot by using min/max instead of
 > bloom or
 > >  > in-list style filters?
 > >
 > > Not sure about bloom filters perf, though I can run those numbers
 > for comparison.
 >
 > I haven't looked at this patch, but had a question about the
 > design:
 >
 > Are we still pushing blooms across a join to prevent shuffling of
 > data? Or are we now pushing _only_ min/max?
 >
 > It seems there is value in pushing both: the bloom for evaluation
 > on the other side of the join to prevent shuffling, and the min/max
 > to push all the way to the scanner to reduce I/O.
 >
 > Not sure if the patch is already doing this.

Impala only evaluates runtime filters in the scan. Even prior to this patch, 
the Kudu scanner was not evaluating bloom filters (and hash joins with Kudu 
scan targets don't build bloom filters).

It certainly could be useful to evaluate bloom filters on the Impala side of a 
Kudu scan, but I believe our thinking was that it wasn't worth it to implement 
that - better to just wait until bloom filters can be pushed all the way down 
into Kudu. If bloom filters in Kudu are a long way off, though, we should maybe 
reevaluate that.


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 26 Oct 2017 20:54:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-26 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 8:

(10 comments)

> I noticed that - BloomFilterBytes is always 0 in the query
 > profiles.
 > Can you please veirfy?

Yes. This line is always printed to the profile, even if there are no bloom 
filters being built. We don't expose the mem used by min-max filters in the 
profile as its negligible.

http://gerrit.cloudera.org:8080/#/c/7793/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/7//COMMIT_MSG@26
PS7, Line 26:
> Might be worth mentioning why the runtime filters were renumbered in all th
Done


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/exec/filter-context.h@106
PS7, Line 106:   Status CloneFrom(const FilterContext& from, ObjectPool* pool, 
RuntimeState* state,
> Long lines
Done


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.h@140
PS7, Line 140:   virtual void* GetMax() override { return _; }
> Nit: not sure why this is using underscores while AlwaysTrue() is using cam
Done


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.h@176
PS7, Line 176:   StringValue min_;
> Can you briefly mention what it means when these are empty - that the value
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@291
PS6, Line 291: out->__isset.min = true;
> Oh ok, thanks for clarifying. It looked superficially like a last line copy
Done


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@137
PS7, Line 137: DCHECK(thrift.__isset.max);
> Do we have end-to-end tests for these code paths? I think we could generall
I've now added e2e and unit tests for all of the truncation scenarios.


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@143
PS7, Line 143: CopyToBuffer(_buffer_, _, max_.len);
> This has a fairly large number of edge cases - it would be good to unit tes
Done


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@143
PS7, Line 143: uff
> static_cast instead of int()?
Done


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@150
PS7, Line 150:
> I would have expected that we would modify the trailing values that overflo
Done


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@207
PS7, Line 207: out->min.__set_string_val(in.min.string_val);
> It seems tricky to test these various error paths in end-to-end tests. Coul
As you say, its difficult to test the oom scenario in an e2e test, since the 
max amount of mem being used here is so small, but its covered with a unit test 
now.



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 26 Oct 2017 20:52:06 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-26 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Anonymous Coward #345, Tim 
Armstrong, Todd Lipcon, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#8).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each partitioned hash join
generates a bloom and min-max filter, but only those filters that end
up being assigned to a target make it into the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder-ir.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/runtime/timestamp-value.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter-test.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/Data.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/inline-view-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-26 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 7:

I know that He Lifu at NetEase also prototyped some support for pushing blooms 
down all the way into Kudu (not just the Impala scanner). On his TPCH 
benchmarks he got a ~50% speedup on Q17 and Q18, Q9, Q8, which I don't see in 
the results here.

I wonder if this is due to a different partitioning scheme or whether the big 
gain actually comes from pushing all the way down into the Kudu scanner. Any 
idea, given the plans for those queries?


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 26 Oct 2017 20:35:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-24 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 7:

TPCDS is a better workload for min/max filtering than TPCH, because TPCDS 
relies on dynamic partition pruning to reduce scan size from the probe side of 
the joins.

On the other hand min/max filtering won't help TPCH much because each partition 
from l_shipdate and o_orderdate will have the min and max values for l_orderkey 
and o_orderkey.


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 25 Oct 2017 02:50:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-24 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 7:

> Patch Set 7:
>
> > > Patch Set 7:
>  > >
>  > > Perf results:
>  > > ...
>  >
>  > I'm surprised that only a few queries saw significant speedups. Is
>  > this in line with what you saw with Parquet runtime filters on
>  > TPC-H? Or are we losing a lot by using min/max instead of bloom or
>  > in-list style filters?
>
> Not sure about bloom filters perf, though I can run those numbers for 
> comparison.

I haven't looked at this patch, but had a question about the design:

Are we still pushing blooms across a join to prevent shuffling of data? Or are 
we now pushing _only_ min/max?

It seems there is value in pushing both: the bloom for evaluation on the other 
side of the join to prevent shuffling, and the min/max to push all the way to 
the scanner to reduce I/O.

Not sure if the patch is already doing this.


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 24 Oct 2017 18:32:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 6:

(10 comments)

Few more comments following on from the previous one. I think my primary 
concern now is whether we have enough test coverage of edge cases (low memory, 
extreme values, etc). The code now seems solid so mainly concerned about 
preventing future regressions.

http://gerrit.cloudera.org:8080/#/c/7793/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/7//COMMIT_MSG@26
PS7, Line 26: - Updated planner tests.
Might be worth mentioning why the runtime filters were renumbered in all the 
planner tests


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/exec/filter-context.h@106
PS7, Line 106:   bool bloom_filter() const { return 
filter->filter_desc().bloom_filter; }
Long lines


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.h@140
PS7, Line 140:   static void Or(const TMinMaxFilter& in, TMinMaxFilter* out);
Nit: not sure why this is using underscores while AlwaysTrue() is using camel 
case.

To me it feels like virtual functions should be camel case.


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.h@176
PS7, Line 176:  private:
Can you briefly mention what it means when these are empty - that the values 
haven't been materialized.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@291
PS6, Line 291:   BigIntMinMaxFilter::Or(in, out);
> TColumnData doesn't have a timestamp field, so I convert timestamps with Ut
Oh ok, thanks for clarifying. It looked superficially like a last line 
copy/paste error (https://www.viva64.com/en/b/0260/).

The microseconds unix time does have lower precision than Impala's internal 
timestamp and I'm not sure if it has the same range. It may not affect 
correctness but it's a little tricky to reason about all the possible cases. It 
seems like it would be better to keep full precision until we have to push it 
to kudu - maybe we could extend TColumnValue to support a native timestamp type.

Rather confusingly, we have two TColumnValue classes - the HS2 one, which is 
only used in a few places, and our internal one. It seems safe to modify our 
internal one to include a native timestamp field.

In another place we convert timestamps to string to store in in a TColumnValue 
- see SetTColumnValue() in fe-support.cc - but it's not totally clear to me why 
(that code's been there since this commit in 2012: 
be98df19c8efe59577e6faaa6089f0383009b703).


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@137
PS7, Line 137:   if (!in.always_false) {
Do we have end-to-end tests for these code paths? I think we could generally do 
with some more targeted tests for edge cases of min/max filters.


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@143
PS7, Line 143: rin
static_cast instead of int()?


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@143
PS7, Line 143:   out->max.__set_string_val(in.max.string_val);
This has a fairly large number of edge cases - it would be good to unit test it.


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@150
PS7, Line 150:   out->min.__set_string_val(in.min.string_val);
I would have expected that we would modify the trailing values that overflowed 
as well.

i.e. incrementing [0, 255] should give [1, 0]


http://gerrit.cloudera.org:8080/#/c/7793/7/be/src/util/min-max-filter.cc@207
PS7, Line 207: case PrimitiveType::TYPE_DOUBLE:
It seems tricky to test these various error paths in end-to-end tests. Could we 
exercise them in unit tests?

Maybe also add a brief comment to explain how it's handling the error?



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-22 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 7:

I noticed that - BloomFilterBytes is always 0 in the query profiles.
Can you please veirfy?


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 23 Oct 2017 02:37:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-20 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 7:

Perf results:

 
++--+++-++---++-+---+
 | Workload   | Query| File Format| Avg(s) | Base Avg(s) | 
Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters |
 
++--+++-++---++-+---+
 | TPCH(_100) | TPCH-Q20 | kudu / none / none | 7.04   | 6.10|   
+15.45%  |   3.51%   |   3.28%| 1   | 4 |
 | TPCH(_100) | TPCH-Q8  | kudu / none / none | 8.09   | 7.38|   +9.67% 
  |   1.11%   |   2.57%| 1   | 4 |
 | TPCH(_100) | TPCH-Q17 | kudu / none / none | 19.13  | 17.50   |   +9.33% 
  |   3.91%   |   4.34%| 1   | 4 |
 | TPCH(_100) | TPCH-Q22 | kudu / none / none | 2.71   | 2.56|   +6.00% 
  |   0.90%   |   4.11%| 1   | 4 |
 | TPCH(_100) | TPCH-Q16 | kudu / none / none | 4.23   | 4.01|   +5.45% 
  |   4.13%   |   0.53%| 1   | 4 |
 | TPCH(_100) | TPCH-Q1  | kudu / none / none | 10.87  | 10.55   |   +3.05% 
  |   1.06%   |   4.61%| 1   | 4 |
 | TPCH(_100) | TPCH-Q18 | kudu / none / none | 24.93  | 24.64   |   +1.18% 
  |   3.44%   |   1.29%| 1   | 4 |
 | TPCH(_100) | TPCH-Q6  | kudu / none / none | 2.27   | 2.26|   +0.77% 
  |   0.87%   |   1.35%| 1   | 4 |
 | TPCH(_100) | TPCH-Q7  | kudu / none / none | 18.55  | 18.43   |   +0.66% 
  |   3.49%   |   1.40%| 1   | 4 |
 | TPCH(_100) | TPCH-Q3  | kudu / none / none | 10.48  | 10.43   |   +0.47% 
  |   0.98%   |   1.43%| 1   | 4 |
 | TPCH(_100) | TPCH-Q4  | kudu / none / none | 15.94  | 15.90   |   +0.27% 
  |   2.07%   |   1.26%| 1   | 4 |
 | TPCH(_100) | TPCH-Q15 | kudu / none / none | 6.01   | 6.00|   +0.13% 
  |   2.57%   |   1.76%| 1   | 4 |
 | TPCH(_100) | TPCH-Q19 | kudu / none / none | 8.02   | 8.07|   -0.63% 
  |   3.79%   |   2.33%| 1   | 4 |
 | TPCH(_100) | TPCH-Q2  | kudu / none / none | 4.07   | 4.10|   -0.66% 
  |   5.74%   |   1.90%| 1   | 4 |
 | TPCH(_100) | TPCH-Q9  | kudu / none / none | 18.00  | 18.36   |   -1.96% 
  |   5.78%   |   1.17%| 1   | 4 |
 | TPCH(_100) | TPCH-Q13 | kudu / none / none | 11.44  | 11.67   |   -1.99% 
  |   3.58%   |   1.12%| 1   | 4 |
 | TPCH(_100) | TPCH-Q12 | kudu / none / none | 6.40   | 6.56|   -2.37% 
  |   2.20%   |   2.28%| 1   | 4 |
 | TPCH(_100) | TPCH-Q10 | kudu / none / none | 5.42   | 5.60|   -3.17% 
  |   1.90%   |   2.60%| 1   | 4 |
 | TPCH(_100) | TPCH-Q5  | kudu / none / none | 10.53  | 11.07   |   -4.87% 
  |   5.59%   |   1.44%| 1   | 4 |
 | TPCH(_100) | TPCH-Q21 | kudu / none / none | 51.31  | 54.29   |   -5.48% 
  |   2.18%   |   2.39%| 1   | 4 |
 | TPCH(_100) | TPCH-Q14 | kudu / none / none | 4.76   | 7.27| I 
-34.49%  |   0.87%   |   0.54%| 1   | 4 |
 | TPCH(_100) | TPCH-Q11 | kudu / none / none | 1.27   | 2.56| I 
-50.18%  |   2.62%   |   4.17%| 1   | 4 |
 
++--+++-++---++-+---+

For the queries that regressed somewhat - Q20,8,17 - the filters simply aren't 
very selective. We might consider a path to disable a filter that we see in not 
selective the way we do with bloom filters once Kudu exposes stats are filter 
effectiveness.


--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Oct 2017 21:20:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-18 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Anonymous Coward #345, Tim 
Armstrong, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#7).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

Min-max filters are generated by the PartitionedHashJoinBuilder. For
now, min-max filters are only applied at the KuduScanner, which passes
them into the Kudu client. Because the Kudu client doesn't provide a
way to specify generic filter exprs, min-max filters are only
generated when the target expr is a bare Kudu column ref.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Codegen is used to eliminate branching on the type of the min-max
filter.

Functional Testing:
- Added new planner tests and updated the old ones.
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder-ir.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/inline-view-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-delete.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/max-row-size.test
A 
testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test
M testdata/workloads/functional-planner/queries/PlannerTest/order.test
M testdata/workloads/functional-planner/queries/PlannerTest/outer-joins.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/predicate-propagation.test
M 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-18 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 6:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG@15
PS6, Line 15: them into the Kudu client. Because the Kudu client doesn't 
provide a
> Is there any documentation for Kudu about what ordering Kudu uses for compa
I asked the Kudu team, and its not documented anywhere, but they confirmed they 
use the same ordering as Impala does.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@106
PS6, Line 106: bloom_filter
> has_bloom_filter() ? At callsites it reads like this should return a filter
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@109
PS6, Line 109: min_max_filter
> has_min_max_filter?
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc@1220
PS6, Line 1220:   MinMaxFilter::Or(src_type(), params.min_max_filter, 
min_max_filter_.get());
> I mentioned this in a later comment, but we should consider what happens if
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc
File be/src/runtime/runtime-filter-bank.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc@235
PS6, Line 235:   MinMaxFilter* min_max_filter = MinMaxFilter::Create(type, 
_pool_);
> (nit) - could combine into one line.
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc@27
PS6, Line 27:   // to a) the atomicity of / pointer assignments and b) the x86 
TSO memory model.
> Comment not relevant any more since we're using actual atomics?
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc
File be/src/util/min-max-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc@53
PS6, Line 53:   min_str = string(value->ptr, value->len);
> We should think through the behaviour with very large strings. It may make
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h@83
PS6, Line 83: #define NUMERIC_MIN_MAX_FILTER(NAME, TYPE)
 \
> Yeah, the macro seems like it may be the least of the evils.
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@24
PS6, Line 24: "
> #include "runtime/timestamp-value.inline.h"
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@77
PS6, Line 77:   std::string NAME##MinMaxFilter::DebugString() const {   
 \
> nit: don't need std:: prefix in .cc files.
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@139
PS6, Line 139:   out->min.__set_string_val(std::min(in.min.string_val, 
out->min.string_val));
> I feel like we should use our StringValue::Compare() function instead of re
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@291
PS6, Line 291:   BigIntMinMaxFilter::Or(in, out);
> Shouldn't this be calling the timestamp method?
TColumnData doesn't have a timestamp field, so I convert timestamps with 
UtcToUnixTimeMicros and pass them around in thrift as longs. (noted in a 
comment now)

Unless I'm missing a reason that comparing timestamps with way won't lead to 
the same results as comparing the unconverted TimestampValues.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@327
PS6, Line 327:   }
> Timestamp?
Same as above.


http://gerrit.cloudera.org:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@379
PS6, Line 379: if (slotRef == null || slotRef.getDesc().getColumn() == 
null
> Are there planner tests for all these cases?
Done


http://gerrit.cloudera.org:8080/#/c/7793/6/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
File testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test:


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-16 Thread Anonymous Coward (Code Review)
Anonymous Coward #345 has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 6:

(1 comment)

link error while not include this inline head file.

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@24
PS6, Line 24: "
#include "runtime/timestamp-value.inline.h"



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 16 Oct 2017 07:37:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-11 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 6:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG@15
PS6, Line 15: them into the Kudu client. Because the Kudu client doesn't 
provide a
Is there any documentation for Kudu about what ordering Kudu uses for 
comparisons for different types? I guess we already push filters so we're 
already depending on the orderings being the same as Impala, but I'd be 
interested to know.

We had a similar discussion about Parquet and the spec got clarified a lot as a 
result (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md 
now specifies it). Parquet-MR also had various odd behaviour that got shaken 
out as a result of this.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@106
PS6, Line 106: bloom_filter
has_bloom_filter() ? At callsites it reads like this should return a filter 
rather than a bool.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@109
PS6, Line 109: min_max_filter
has_min_max_filter?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc@1220
PS6, Line 1220:   MinMaxFilter::Or(src_type(), params.min_max_filter, 
min_max_filter_.get());
I mentioned this in a later comment, but we should consider what happens if the 
min/max values are very large - maybe thre's some memory management similar to 
above.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc
File be/src/runtime/runtime-filter-bank.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc@235
PS6, Line 235:   MinMaxFilter* min_max_filter = MinMaxFilter::Create(type, 
_pool_);
(nit) - could combine into one line.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc@27
PS6, Line 27:   // to a) the atomicity of / pointer assignments and b) the x86 
TSO memory model.
Comment not relevant any more since we're using actual atomics?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc
File be/src/util/min-max-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc@53
PS6, Line 53:   min_str = string(value->ptr, value->len);
We should think through the behaviour with very large strings. It may make 
sense to disable the filters if the strings are too large instead of allocating 
arbitrarily large amounts of memory.

Maybe we could also do some trickery with truncating the strings to avoid 
handling the filters not existing. E.g. if we have a 4 byte limit on min/max, 
replace min="aaa" with min="" and max="zz" with max="zzz(". 
Interested to hear what you think - might be too clever but seems like it could 
work.

It would be good to also allocate tracked memory for the string. The Parquet 
ColumnStats class solves a similar problem using StringBuffers that allocate 
from a MemPool.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h@83
PS6, Line 83: #define NUMERIC_MIN_MAX_FILTER(NAME, TYPE)
 \
Yeah, the macro seems like it may be the least of the evils.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@77
PS6, Line 77:   std::string NAME##MinMaxFilter::DebugString() const {   
 \
nit: don't need std:: prefix in .cc files.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@139
PS6, Line 139:   out->min.__set_string_val(std::min(in.min.string_val, 
out->min.string_val));
I feel like we should use our StringValue::Compare() function instead of 
relying on std::string's comparison function. I'm not sure if they are exactly 
the same but if we compare these the same way as we compare StringValue's it's 
easier to reason about.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@291
PS6, Line 291:   BigIntMinMaxFilter::Or(in, out);
Shouldn't this be calling the timestamp method?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@327
PS6, Line 327:   }
Timestamp?



[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 6:

(1 comment)

I've been meaning to do another full pass over this. Should have time tomorrow.

http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG@9
PS5, Line 9: This patch implements min-max filters for runtime filters. Each
> My intention (which I probably wasn't super consistent about, I'll try to c
This is good to know. I suppose another way of looking at it is that bloom 
filters, min-max filters are both kinds of runtime filters, and a composite 
filter with both kinds is also a kind of runtime filter. I'll keep this in mind 
doing my next pass.



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 10 Oct 2017 23:34:11 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-06 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 6:

(32 comments)

http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG@9
PS5, Line 9: This patch implements min-max filters for runtime filters. Each
> I had a general terminology question that popped up as I read through. Does
My intention (which I probably wasn't super consistent about, I'll try to clean 
it up) was that a "runtime filter" corresponds to a single filter id and might 
contain and bloom and/or min-max (eventually and/or in-list) filters.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h@80
PS5, Line 80: /// FilterContext contains all metadata for a single runtime 
filter, and allows the filter
> This seems to be an example of a place where "runtime filter" could potenti
Done


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h@107
PS5, Line 107:
> Can you add a blank line between methods?
Done


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc@422
PS5, Line 422: col_type, filter_expr->type().ToIR(codegen), 
"expr_type_arg");
> Why is decimal special here?
Decimal isn't supported for Kudu, so I didn't implement min-max filters for it 
since they wouldn't be testable.

I special cased it here because you might have a decimal bloom filter and you 
would hit this code path, but a better solution is to just check if the filter 
desc says to build a min-max filter, which will never be the case if the type 
is decimal.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc@434
PS5, Line 434: // Call Insert() on the bloom filter.
> Can we make this a lookup table? Seems like the only difference between the
Done


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/hdfs-parquet-scanner.cc@401
PS5, Line 401: if (filter->BloomAlwaysTrue()
> Can we phrase the method in a way that doesn't depend on the assumption tha
I changed this to highlight the fact that the hdfs scanner only deals with 
bloom filters for now.

The runtime filter here might have a min-max filter, if it also has a kudu scan 
target, and since min-max filters don't really have a concept of "AlwaysTrue" 
we would have to say the filter isn't always true.

That would make sense if we were actually applying the min-max filter here, but 
its sort of weird since we're only really applying the bloom filter.

It can then be changed to not be bloom specific when we add consideration of 
min-max filters here.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@174
PS5, Line 174: for (const FilterContext& ctx : scan_node_->filter_ctxs_) {
> How important is this and is it blocked by anything else? If it's impactful
Done


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@175
PS5, Line 175: filt
> nullptr
Done


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@177
PS5, Line 177: se()) {
> dynamic_cast seems unnecessary since the type should always be correct - ca
Done


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@179
PS5, Line 179: eCurre
> const string&?
Done


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-util.cc
File be/src/exec/kudu-util.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-util.cc@231
PS5, Line 231:   int64_t ts_micros;
> This timestamp code is a bit subtle. Is there a sane way to combine the con
Done


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/partitioned-hash-join-builder.cc@548
PS5, Line 548: IMER(repartition_ti
> Is this meant to double-count the filters if there is a bloom and min-max f
You're right, this needs to be counting "runtime filters" (corresponding to a 
single filter id/FilterContext).

This gets a little confusing, since you might have a bloom filter that gets 
disabled but its not counted as disabled here because the min-max filter isn't 
disabled, but it'll make a lot more sense in the long run goal here of having 
all scanners support all filter types (and of course the fact that the bloom 
filter is disabled in that situation will still show up in the runtime profile 

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-10-06 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Tim Armstrong, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#6).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter and/or a min-max filter,
depending on if it has HDFS and/or Kudu targets, respectively.

Min-max filters are generated by the PartitionedHashJoinBuilder. For
now, min-max filters are only applied at the KuduScanner, which passes
them into the Kudu client. Because the Kudu client doesn't provide a
way to specify generic filter exprs, min-max filters are only
generated when the target expr is a bare Kudu column ref.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Codegen is used to eliminate branching on the type of the min-max
filter.

Testing:
- Updated planner tests.
- Ran existing runtime filter tests.
- Ran preliminary perf tests to demonstrate that it works. Will update
  with more specific results.
- Still needs more e2e tests.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-mt.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
A testdata/workloads/functional-query/queries/QueryTest/bloom_filters.test
A testdata/workloads/functional-query/queries/QueryTest/bloom_filters_wait.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
M 
testdata/workloads/functional-query/queries/QueryTest/runtime_filters_wait.test
M tests/common/impala_test_suite.py
M tests/query_test/test_runtime_filters.py
M tests/util/test_file_parser.py
47 files changed, 1,581 insertions(+), 377 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/7793/6
--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-09-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 5:

(33 comments)

I did a pass over the backend part of it. Haven't gotten into all the 
nitty-gritty yet. The overall approach seems to make sense, had some thoughts 
on the classes and various parts of the implementation.

http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG@9
PS5, Line 9: This patch implements min-max filters for runtime filters. Each
I had a general terminology question that popped up as I read through. Does 
"runtime filter" mean a composition of a bloom filter and min-max filter, or 
does "runtime filter" mean either a bloom filter or a min-max filter.


http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG@30
PS5, Line 30: - Still needs more e2e tests.
I'm sure you have more tests in mind. I was thinking it would be important to 
test filters on every supported data type.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h@80
PS5, Line 80: /// FilterContext contains all metadata for a single runtime 
filter, and allows the filter
This seems to be an example of a place where "runtime filter" could potential 
mean the composition of a bloom and min-max filter.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h@107
PS5, Line 107:   /// Returns true if this FilterContext should build a min-max 
filter, if possible.
Can you add a blank line between methods?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc@422
PS5, Line 422:   if (filter_expr->type().type != PrimitiveType::TYPE_DECIMAL) {
Why is decimal special here?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc@434
PS5, Line 434: switch (filter_expr->type().type) {
Can we make this a lookup table? Seems like the only difference between the 
cases is an enum value and a string value.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/hdfs-parquet-scanner.cc@401
PS5, Line 401: if (filter->BloomAlwaysTrue()
Can we phrase the method in a way that doesn't depend on the assumption that 
Parquet only handles bloom filters (since this method shouldn't really care 
about the internals of the filter). Does filter->AlwaysTrue() make sense? 
Because if there's no min-max filter and the bloom filter is always true, then 
the filter as a whole is always true.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@174
PS5, Line 174:   // TODO: if the filter is always false, we can skip this 
entire scan.
How important is this and is it blocked by anything else? If it's impactful 
might be best to just get it done and other of the way.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@175
PS5, Line 175: NULL
nullptr


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@177
PS5, Line 177: dynamic_cast
dynamic_cast seems unnecessary since the type should always be correct - can we 
just use static_cast?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@179
PS5, Line 179: string
const string&?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-util.cc
File be/src/exec/kudu-util.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-util.cc@231
PS5, Line 231:   const TimestampValue* tv = reinterpret_cast(value);
This timestamp code is a bit subtle. Is there a sane way to combine the 
conversion logic with WriteKuduTimestampValue() above?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/partitioned-hash-join-builder.cc@548
PS5, Line 548: num_enabled_filters
Is this meant to double-count the filters if there is a bloom and min-max 
filter on the same column? I'm not sure what the right thing to do here is but 
it does seem based on the class structure that we consider a matching pair of 
bloom filter + min max filter to be a single "runtime filter".


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/coordinator-filter-state.h
File be/src/runtime/coordinator-filter-state.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/coordinator-filter-state.h@92
PS5, Line 92: aall
all



[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-09-26 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..


Patch Set 5:

(1 comment)

> The Parquet writer does something similar for the Parquet
 > statistics, can any of that code be reused with regards to codegen?

This was discussed in the original design thread, and we felt that there was 
too much parquet specific stuff current in/going to be added soon in the 
parquet statistics code, so we decided to just keep it separate.

http://gerrit.cloudera.org:8080/#/c/7793/2/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/2/common/thrift/ImpalaInternalService.thrift@740
PS2, Line 740:   4: optional i64 queue_timeout_ms;
 :
 :   // Default query options that are applied to requests mapped 
to this pool.
 :
> This could be a reasonable approach, or another idea is:
Done



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 27 Sep 2017 03:08:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

2017-09-26 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Lars Volker, Matthew Jacobs, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7793

to look at the new patch set (#5).

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
..

IMPALA-4252: Min-max runtime filters for Kudu

This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter and/or a min-max filter,
depending on if it has HDFS and/or Kudu targets, respectively.

Min-max filters are generated by the PartitionedHashJoinBuilder. For
now, min-max filters are only applied at the KuduScanner, which passes
them into the Kudu client. Because the Kudu client doesn't provide a
way to specify generic filter exprs, min-max filters are only
generated when the target expr is a bare Kudu column ref.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Codegen is used to eliminate branching on the type of the min-max
filter.

Testing:
- Updated planner tests.
- Ran existing runtime filter tests.
- Ran preliminary perf tests to demonstrate that it works. Will update
  with more specific results.
- Still needs more e2e tests.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/impala-internal-service.cc
M be/src/util/CMakeLists.txt
A be/src/util/min-max-filter-ir.cc
A be/src/util/min-max-filter.cc
A be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
37 files changed, 1,464 insertions(+), 146 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/7793/5
--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar