[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 28: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7787/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 28 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 07 Dec 2020 20:35:12 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 28: (8 comments) http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc File be/src/util/min-max-filter-test.cc: http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@589 PS28, Line 589: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@592 PS28, Line 592: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@597 PS28, Line 597: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@600 PS28, Line 600: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@649 PS28, Line 649: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d1##SIZE, d1##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@653 PS28, Line 653: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d1##SIZE, d2##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@657 PS28, Line 657: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d3##SIZE, d2##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@669 PS28, Line 669: CheckDecimalVals(filter##SIZE##2, decimal##SIZE##_type, d3##SIZE, d2##SIZE); \ line too long (110 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 28 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 07 Dec 2020 20:14:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#28). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds a new class of predicates called overlap predicates to aid in the determination of whether a Parquet row group or a page overlap with a range computed from an equi hash join. If not, then the entire Parquet row group or the page are skipped. The new class of overlap predicates exist as in the form of min/max filters. For the following query, the min and max in the min/max filter are computed after the hash table is popuated with data from table 'b'. These two values are compared against the min/max of each row group or page at the scan node for 'a'. select straight_join count(*) from lineitem_sorted_l_shipdate a join [SHUFFLE] lineitem_sorted_l_shipdate b where a.l_shipdate = b.l_receiptdate and b.l_commitdate = "1992-01-31"; An overlap predicate associated with a join column of type J and a scan column type of S will be formed provided the following is true: Both J and S are Booleans Both J and S are Integers (tinyint, smallint, int, or bigint) Both J and S are approximate numeric (float or double) Both J and S are Decimals with the same precision and scale Both J and S are strings (STRING, CHAR or VARCHAR) Both J and S are date Both J and S are timestamp Like any existing min/max filters, MAX_NUM_RUNTIME_FILTERS query option does not apply to the min/max filters created for overlap predicates. The overlap predicates will always be evaluated, after the min/max conjuncts (if any). Two new run-time profile counters are added to report the number of row groups or pages filtered out via the overlap predicates respectively: 1. NumMinMaxFilteredRowGroups 2. NumMinMaxFilteredPages Testing: 1. Added data type specific overlap method tests in min-max-filter-test.cc (boolean, int, string, date, timestamp and decimal); 2. Unit tested on various column types (int, bigint, string and decimal) with TPCH and TPCDS tables. Benefits were significant when the join column on the outer table is sorted, or when the min/max boundary values of the pages or row groups are monotonic; 3. Added new tests in min_max_filters.test to demonstrate the number of filtered out pages and row groups. 4. Core testing. TBD: 1. Performance measurement; Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/runtime/date-value.cc M be/src/runtime/date-value.h M be/src/runtime/decimal-value.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test 23 files changed, 1,090 insertions(+), 153 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/28 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 28 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 27: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7778/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 27 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Dec 2020 21:24:11 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 27: (8 comments) http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc File be/src/util/min-max-filter-test.cc: http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@589 PS27, Line 589: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@592 PS27, Line 592: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@597 PS27, Line 597: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@600 PS27, Line 600: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@649 PS27, Line 649: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d1##SIZE, d1##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@653 PS27, Line 653: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d1##SIZE, d2##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@657 PS27, Line 657: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d3##SIZE, d2##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@669 PS27, Line 669: CheckDecimalVals(filter##SIZE##2, decimal##SIZE##_type, d3##SIZE, d2##SIZE); \ line too long (110 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 27 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Dec 2020 21:05:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#27). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds the logic to utilize min/max stats for Parquet row groups or pages to skip these entities when they don't qualify an equi-join predicate. A new class of predicates called overlap predicates is introduced to aid in the determination of whether a Parquet row group or a page overlap with a range computed from the hash join. If not, then the entire Parquet row group or the page are skipped. The new class of predicates co-exist with the existing min/max conjuncts that are introduced based on the local or transitive scan predicates. Both classes of predicates can work individually or together with each other. The overlap predicates are evaluated after the existing min/max conjuncts. Two new run-time profile counters are added for the number of row groups or pages filtered out via the overlap predicates respectively: 1. NumMinMaxFilteredRowGroups 2. NumMinMaxFilteredPages An overlap predicate associated with a join column of type J and a scan column type of S will be formed provided the following is true: Both S and J are Booleans Both S and J are Integers (tinyint, smallint, int, or bigint) Both S and J are approximate numeric (float or double) Both S and J are Decimals with the same precision and scale Both S and J are strings (STRING, CHAR or VARCHAR) Both S and J are date Both S and J are timestamp Testing: 1. Added data type specific overlap method tests in min-max-filter-test.cc (boolean, int, string, date, timestamp and decimal); 2. Unit tested on various column types (int, bigint, string and decimal) with TPCH and TPCDS tables. Benefits were significant when the join column on the outer table is sorted, or when the min/max boundary values of the pages or row groups are monotonic; 3. Added new tests in min_max_filters.test to demonstrate filtered pages and row groups. TBD: 1. Compute a usefulness score for the overlap predicate and integrate it into MAX_NUM_RUNTIME_FILTERS limit; 2. Performance measurement; 3. Core testing. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/runtime/date-value.cc M be/src/runtime/date-value.h M be/src/runtime/decimal-value.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test 23 files changed, 1,090 insertions(+), 153 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/27 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 27 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 25: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7772/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 25 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Dec 2020 02:40:47 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 25: (8 comments) http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc File be/src/util/min-max-filter-test.cc: http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@589 PS25, Line 589: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@592 PS25, Line 592: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@597 PS25, Line 597: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@600 PS25, Line 600: EXPECT_EQ(overflow, false); \ line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@649 PS25, Line 649: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d1##SIZE, d1##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@653 PS25, Line 653: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d1##SIZE, d2##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@657 PS25, Line 657: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, d3##SIZE, d2##SIZE); \ line too long (108 > 90) http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@669 PS25, Line 669: CheckDecimalVals(filter##SIZE##2, decimal##SIZE##_type, d3##SIZE, d2##SIZE); \ line too long (110 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 25 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Dec 2020 02:19:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#25). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds the logic to utilize min/max stats for Parquet row groups or pages to skip these entities when they don't qualify an equi-join predicate. A new class of predicates called overlap predicates is introduced to aid in the determination of whether a Parquet row group or a page overlap with a range computed from the hash join. If not, then the entire Parquet row group or the page are skipped. The new class of predicates co-exist with the existing min/max conjuncts that are introduced based on the local or transitive scan predicates. Both classes of predicates can work individually or together with each other. The overlap predicates are evaluated after the existing min/max conjuncts. Two new run-time profile counters are added for the number of row groups or pages filtered via the overlap predicates respectively: 1. NumMinMaxFilteredRowGroups 2. NumMinMaxFilteredPages Testing: 1. Added data type specific overlap method tests in min-max-filter-test.cc (boolean, int, string, date, timestamp and decimal); 2. Unit tested on various column types (int, bigint, string and decimal) with TPCH tables. Benefits were significant when the join column on the outer table is sorted, and somewhat observable when the min/max boundary values of the pages or row groups are monotonic; 3. Added new tests in min_max_filters.test (invoked from test_runtime_filters.py) to demonstrate filtered pages in run-time counter NumMinMaxFilteredPage. TBD: 1. Convert remaining unit tests into query tests; 2. Performance measurement; 3. Checkout the effect of implicit casting in join predicate on overlap evaluation; 4. Compute a usefulness score for the overlap predicate and integrate it into MAX_NUM_RUNTIME_FILTERS limit; 5. Core testing. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/runtime/date-value.cc M be/src/runtime/date-value.h M be/src/runtime/decimal-value.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M testdata/datasets/tpch/tpch_schema_template.sql M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test 24 files changed, 1,046 insertions(+), 153 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/25 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 25 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 24: (3 comments) http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/date-value.cc File be/src/runtime/date-value.cc: http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/date-value.cc@369 PS24, Line 369: DateValue DateValue::SubtractDays(int64_t days) const { I am not sure if this is really useful, as we already have AddDays that can be called with negative values. http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/timestamp-value.cc@217 PS24, Line 217: add There is already an implementation to add intervals to timestamps at https://github.com/apache/impala/blob/master/be/src/exprs/timestamp-functions-ir.cc#L685 It would be good to use the same implementation, because for bit time_duration the add/sub can tricky. If I saw correctly then you only add nanoseconds, so it would be enough to add an "addNanoSecond" functions that could handle negative values too. timestamp-functions-ir.h could expose a function that does the add and we could call it from here. http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/timestamp-value.cc@226 PS24, Line 226: TimestampValue(date_ + boost::gregorian::date_duration(1), this doesn't work correctly if 't' is more then one day -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 24 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 03 Dec 2020 22:29:10 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 24: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7765/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 24 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 03 Dec 2020 00:29:13 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#24). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds the logic to utilize min/max stats for Parquet row groups or pages to skip these entities when they don't qualify an equi-join predicate. A new class of predicates called overlap predicates is introduced to aid in the determination of whether a Parquet row group or a page overlap with a range computed from the hash join. If not, then the entire Parquet row group or the page are skipped. The new class of predicates co-exist with the existing min/max conjuncts that are introduced based on the local or transitive scan predicates. Both classes of predicates can work individually or together with each other. The overlap predicates are evaluated after the existing min/max conjuncts. Two new run-time profile counters are added for the number of row groups or pages filtered via the overlap predicates respectively: 1. NumMinMaxFilteredRowGroups 2. NumMinMaxFilteredPages Testing: 1. Added data type specific overlap method tests in min-max-filter-test.cc (boolean, int, string, date and timestamp); 2. Unit tested on various column types (int, bigint, string and decimal) with TPCH tables. Benefits were significant when the join column on the outer table is sorted, and somewhat observable when the min/max boundary values of the pages or row groups are monotonic; TBD: 1. Convert unit tests performed into query tests; 2. Performance testing; 2. Core testing. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/runtime/date-value.cc M be/src/runtime/date-value.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java 21 files changed, 895 insertions(+), 142 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/24 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 24 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 22: (1 comment) http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG@9 PS12, Line 9: This patch adds the logic to utilize min/max stats > Okay. Here is another good use case select a.l_orderkey from lineitem a join [SHUFFLE] orders b where a.l_orderkey = b.o_orderkey and b.o_custkey = 5 order by l_orderkey ; where 92 pages were filtered out in one impalad. The reason for the good filtering efficiency is that the low values in pages on l_orderkey are somewhat monotonic. In ::EvalOverlap(), data: min=1, max=4275687, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=1, max=3228897, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=17959, max=3246245, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=35235, max=3263680, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=52737, max=3280196, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=68708, max=3290466, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=79142, max=3294566, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=83169, max=3298692, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=87237, max=3302721, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=91328, max=3306725, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=95329, max=3310720, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=99334, max=3314819, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=103428, max=3318818, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=107458, max=3322980, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=111462, max=3327015, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=115591, max=3331171, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=119712, max=3335331, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=123810, max=3339429, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=127971, max=3343553, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=131973, max=3347585, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=136102, max=3351588, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=140354, max=3355745, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=144422, max=3359684, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=148544, max=3363814, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=152546, max=3367907, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=156640, max=3372006, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=160582, max=3376130, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=164677, max=3380229, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=168770, max=3384258, filter: min=224167, max=2630562 It seems we can help create such an arrangement for a Parquet data file by a "partial sort" hint on a column, such as follows. The partial sort order can be done at the row group level. CREATE EXTERNAL TABLE functional.alltypesagg ( id INT, bool_col BOOLEAN, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT,
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 22: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7754/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 22 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 01 Dec 2020 01:10:21 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#22). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds the logic to utilize min/max stats for Parquet row groups or pages to skip these entities when they don't qualify an equi-join predicate. A new class of predicates called overlap predicates is introduced to aid in the determination of whether a Parquet row group or a page overlap with a range computed from the hash join. If not, then the entire Parquet row group or the page are skipped. The new class of predicates co-exist with the existing min/max conjuncts that are introduced based on the local or transitive scan predicates. Both classes of predicates can work individually or together with each other. The overlap predicates are evaluated after the existing min/max conjuncts. TBD: 1. Unit/performance testing; 2. Core testing. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java 16 files changed, 688 insertions(+), 118 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/22 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 22 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG@9 PS12, Line 9: This patch adds the logic to utilize min/max stats > On the scope of the work. I think getting the row group/page filtering working (with all data types, etc) is a good end-point for this patch. My understanding is that the row/partition filtering will get enabled automatically once the filters are generated and I just want to understand the implications of that: * It looks like the min-max filters are ordered after the bloom filters for evaluation purposes. * It looks like the min-max filters don't count towards the MAX_NUM_RUNTIME_FILTERS limit - https://impala.apache.org/docs/build/html/topics/impala_max_num_runtime_filters.html#max_num_runtime_filters. So this means we will maybe get some new source/destination pairs, which might change the runtime behaviour of some plans. I suspect this is all a net win, since the min-max filters should be relatively cheap to construct and will get automatically disabled if they're ineffective in the scan, but there is a bit of overhead added. So I think we want to do some benchmarks to make sure there's no regressions before changing the default. Probably TPC-DS since it is heavy on the filters. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 12 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 30 Nov 2020 18:31:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 21: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7751/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 21 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 30 Nov 2020 18:20:44 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#21). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds the logic to utilize min/max stats for Parquet row groups or pages to skip these entities when they don't qualify an equi-join predicate. A new class of predicates called overlap predicates is introduced to aid in the determination of whether a Parquet row group or a page overlap with a range computed from the hash join. If not, then the entire Parquet row group or the page are skipped. The new class of predicates co-exist with the existing min/max conjuncts that are introduced based on the local scan predicates. Both classes of predicates can work individually or together with each other. The overlap predicates are evaluated after the existing min/max conjuncts. TBD: 1. Unit/performance testing; 2. Core testing. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java 16 files changed, 661 insertions(+), 119 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/21 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 21 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 9: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/7690/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 9 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Thu, 19 Nov 2020 19:52:57 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/16720/9/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/16720/9/common/thrift/PlanNodes.thrift@299 PS9, Line 299: 12: optional list slot_usage_map line has trailing whitespace -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 9 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Thu, 19 Nov 2020 19:36:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds the logic to utilize min/max stats for Parquet row groups or pages to skip these entities when they don't qualify an equi-join predicate. A new class of predicates called overlap predicates is introduced to aid in the determination of whether a Parquet row group or a page overlap with the a range computed from the hash join. If not, then the entire Parquet row group or the page are skipped. The new class of predicates co-exist with the existing min/max conjuncts that are introduced based on the local scan predicates. Both classes of predicates can work individually or togther with each other. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java 14 files changed, 442 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/9 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 9 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 3: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/7667/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 3 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 18 Nov 2020 00:06:34 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/hdfs-scan-node.cc File be/src/exec/hdfs-scan-node.cc: http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/hdfs-scan-node.cc@377 PS3, Line 377: VLOG(1) << "HdfsScanNode::ScannerThread(): filter_ctx_" line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/hdfs-scan-node.cc@378 PS3, Line 378: << ", size=" << filter_ctxs_.size(); line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@755 PS3, Line 755: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@766 PS3, Line 766: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@808 PS3, Line 808: VLOG(1) << "For predicate " << root.DebugString() line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@863 PS3, Line 863: } line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@866 PS3, Line 866: // Pull in the min and max of the page at slot line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16720/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/16720/3/common/thrift/PlanNodes.thrift@299 PS3, Line 299: 12: optional list slot_usage_map line has trailing whitespace -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 3 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 17 Nov 2020 23:46:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds the logic to utilize min/max stats for Parquet pages to skip certain pages when they don't qualify an equi-join predicate. In addition to the existing min/max conjuncts that are introduced based on the local scan predicates, a new class of predicates called page overlap predicates is introduced. The later is used to decide whether a Parquet page with min and max statistics overlaps with the range defined by the min and max values from the hash join. If it is not, then the Parquet page can be skipped. Either class of predicates can work individually or togther with the other. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scan-node.cc M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java 13 files changed, 338 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/3 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 3 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7647/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 2 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 13 Nov 2020 15:12:28 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16720 Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds the logic to utilize min/max stats for Parquet pages to skip certain pages when they don't qualify an equi-join predicate. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java 5 files changed, 89 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/2 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 2 Gerrit-Owner: Qifan Chen