[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 28:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7787/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 07 Dec 2020 20:35:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 28:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc
File be/src/util/min-max-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@589
PS28, Line 589: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@592
PS28, Line 592: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@597
PS28, Line 597: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@600
PS28, Line 600: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@649
PS28, Line 649: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d1##SIZE, d1##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@653
PS28, Line 653: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d1##SIZE, d2##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@657
PS28, Line 657: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d3##SIZE, d2##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/28/be/src/util/min-max-filter-test.cc@669
PS28, Line 669: CheckDecimalVals(filter##SIZE##2, decimal##SIZE##_type, 
d3##SIZE, d2##SIZE); \
line too long (110 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 07 Dec 2020 20:14:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-07 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#28). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the determination of whether a Parquet row group or a page
overlap with a range computed from an equi hash join. If not, then
the entire Parquet row group or the page are skipped. The new class
of overlap predicates exist as in the form of min/max filters.

For the following query, the min and max in the min/max filter are
computed after the hash table is popuated with data from table 'b'.
These two values are compared against the min/max of each row group
or page at the scan node for 'a'.

  select straight_join count(*)
  from lineitem_sorted_l_shipdate a join [SHUFFLE]
   lineitem_sorted_l_shipdate b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with a join column of type J and a scan
column type of S will be formed provided the following is true:
   Both J and S are Booleans
   Both J and S are Integers (tinyint, smallint, int, or bigint)
   Both J and S are approximate numeric (float or double)
   Both J and S are Decimals with the same precision and scale
   Both J and S are strings (STRING, CHAR or VARCHAR)
   Both J and S are date
   Both J and S are timestamp

Like any existing min/max filters, MAX_NUM_RUNTIME_FILTERS query option
does not apply to the min/max filters created for overlap predicates.
The overlap predicates will always be evaluated, after the min/max
conjuncts (if any).

Two new run-time profile counters are added to report the number of row
groups or pages filtered out via the overlap predicates respectively:
  1. NumMinMaxFilteredRowGroups
  2. NumMinMaxFilteredPages

Testing:
1. Added data type specific overlap method tests in
   min-max-filter-test.cc (boolean, int, string, date, timestamp and
   decimal);
2. Unit tested on various column types (int, bigint, string
   and decimal) with TPCH and TPCDS tables. Benefits were significant
   when the join column on the outer table is sorted, or when the
   min/max boundary values of the pages or row groups are monotonic;
3. Added new tests in min_max_filters.test to demonstrate the number of
   filtered out pages and row groups.
4. Core testing.

TBD:
1. Performance measurement;

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/decimal-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
23 files changed, 1,090 insertions(+), 153 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/28
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 27:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7778/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Dec 2020 21:24:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 27:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc
File be/src/util/min-max-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@589
PS27, Line 589: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@592
PS27, Line 592: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@597
PS27, Line 597: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@600
PS27, Line 600: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@649
PS27, Line 649: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d1##SIZE, d1##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@653
PS27, Line 653: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d1##SIZE, d2##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@657
PS27, Line 657: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d3##SIZE, d2##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/27/be/src/util/min-max-filter-test.cc@669
PS27, Line 669: CheckDecimalVals(filter##SIZE##2, decimal##SIZE##_type, 
d3##SIZE, d2##SIZE); \
line too long (110 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Dec 2020 21:05:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-04 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#27). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet row
groups or pages to skip these entities when they don't qualify an
equi-join predicate.

A new class of predicates called overlap predicates is introduced to aid
in the determination of whether a Parquet row group or a page overlap
with a range computed from the hash join. If not, then the entire
Parquet row group or the page are skipped.

The new class of predicates co-exist with the existing min/max conjuncts
that are introduced based on the local or transitive scan predicates.
Both classes of predicates can work individually or together with each
other. The overlap predicates are evaluated after the existing min/max
conjuncts.

Two new run-time profile counters are added for the number of row groups
or pages filtered out via the overlap predicates respectively:
  1. NumMinMaxFilteredRowGroups
  2. NumMinMaxFilteredPages

An overlap predicate associated with a join column of type J and a scan
column type of S will be formed provided the following is true:
   Both S and J are Booleans
   Both S and J are Integers (tinyint, smallint, int, or bigint)
   Both S and J are approximate numeric (float or double)
   Both S and J are Decimals with the same precision and scale
   Both S and J are strings (STRING, CHAR or VARCHAR)
   Both S and J are date
   Both S and J are timestamp

Testing:
1. Added data type specific overlap method tests in
   min-max-filter-test.cc (boolean, int, string, date, timestamp and
   decimal);
2. Unit tested on various column types (int, bigint, string
   and decimal) with TPCH and TPCDS tables. Benefits were significant
   when the join column on the outer table is sorted, or when the
   min/max boundary values of the pages or row groups are monotonic;
3. Added new tests in min_max_filters.test to demonstrate filtered
   pages and row groups.

TBD:
1. Compute a usefulness score for the overlap predicate and integrate
   it into MAX_NUM_RUNTIME_FILTERS limit;
2. Performance measurement;
3. Core testing.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/decimal-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
23 files changed, 1,090 insertions(+), 153 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/27
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 25:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7772/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Dec 2020 02:40:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 25:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc
File be/src/util/min-max-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@589
PS25, Line 589: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@592
PS25, Line 592: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@597
PS25, Line 597: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@600
PS25, Line 600: EXPECT_EQ(overflow, false); 
  \
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@649
PS25, Line 649: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d1##SIZE, d1##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@653
PS25, Line 653: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d1##SIZE, d2##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@657
PS25, Line 657: CheckDecimalVals(filter##SIZE, decimal##SIZE##_type, 
d3##SIZE, d2##SIZE);  \
line too long (108 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/25/be/src/util/min-max-filter-test.cc@669
PS25, Line 669: CheckDecimalVals(filter##SIZE##2, decimal##SIZE##_type, 
d3##SIZE, d2##SIZE); \
line too long (110 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Dec 2020 02:19:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-03 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#25). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet row
groups or pages to skip these entities when they don't qualify an
equi-join predicate.

A new class of predicates called overlap predicates is introduced to aid
in the determination of whether a Parquet row group or a page overlap
with a range computed from the hash join. If not, then the entire
Parquet row group or the page are skipped.

The new class of predicates co-exist with the existing min/max conjuncts
that are introduced based on the local or transitive scan predicates.
Both classes of predicates can work individually or together with each
other. The overlap predicates are evaluated after the existing min/max
conjuncts.

Two new run-time profile counters are added for the number of row groups
or pages filtered via the overlap predicates respectively:
  1. NumMinMaxFilteredRowGroups
  2. NumMinMaxFilteredPages

Testing:
1. Added data type specific overlap method tests in
   min-max-filter-test.cc (boolean, int, string, date, timestamp and
   decimal);
2. Unit tested on various column types (int, bigint, string
   and decimal) with TPCH tables. Benefits were significant when the
   join column on the outer table is sorted, and somewhat observable
   when the min/max boundary values of the pages or row groups are
   monotonic;
3. Added new tests in min_max_filters.test (invoked from
   test_runtime_filters.py) to demonstrate filtered pages in
   run-time counter NumMinMaxFilteredPage.

TBD:
1. Convert remaining unit tests into query tests;
2. Performance measurement;
3. Checkout the effect of implicit casting in join predicate on
   overlap evaluation;
4. Compute a usefulness score for the overlap predicate and integrate
   it into MAX_NUM_RUNTIME_FILTERS limit;
5. Core testing.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/decimal-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/datasets/tpch/tpch_schema_template.sql
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
24 files changed, 1,046 insertions(+), 153 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/25
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-03 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 24:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/date-value.cc
File be/src/runtime/date-value.cc:

http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/date-value.cc@369
PS24, Line 369: DateValue DateValue::SubtractDays(int64_t days) const {
I am not sure if this is really useful, as we already have AddDays that can be 
called with negative values.


http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/timestamp-value.cc@217
PS24, Line 217: add
There is already an implementation to add intervals to timestamps at 
https://github.com/apache/impala/blob/master/be/src/exprs/timestamp-functions-ir.cc#L685
 It would be good to use the same implementation, because for bit time_duration 
the add/sub can tricky.

If I saw correctly then you only add nanoseconds, so it would be enough to add 
an "addNanoSecond" functions that could handle negative values too. 
timestamp-functions-ir.h could expose a function that does the add and we could 
call it from here.


http://gerrit.cloudera.org:8080/#/c/16720/24/be/src/runtime/timestamp-value.cc@226
PS24, Line 226: TimestampValue(date_ + boost::gregorian::date_duration(1),
this doesn't work correctly if 't' is more then one day



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 03 Dec 2020 22:29:10 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 24:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7765/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 03 Dec 2020 00:29:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-02 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#24). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet row
groups or pages to skip these entities when they don't qualify an
equi-join predicate.

A new class of predicates called overlap predicates is introduced to aid
in the determination of whether a Parquet row group or a page overlap
with a range computed from the hash join. If not, then the entire
Parquet row group or the page are skipped.

The new class of predicates co-exist with the existing min/max conjuncts
that are introduced based on the local or transitive scan predicates.
Both classes of predicates can work individually or together with each
other. The overlap predicates are evaluated after the existing min/max
conjuncts.

Two new run-time profile counters are added for the number of row groups
or pages filtered via the overlap predicates respectively:
  1. NumMinMaxFilteredRowGroups
  2. NumMinMaxFilteredPages

Testing:
1. Added data type specific overlap method tests in
   min-max-filter-test.cc (boolean, int, string, date and timestamp);
2. Unit tested on various column types (int, bigint, string
   and decimal) with TPCH tables. Benefits were significant when the
   join column on the outer table is sorted, and somewhat observable
   when the min/max boundary values of the pages or row groups are
   monotonic;

TBD:
1. Convert unit tests performed into query tests;
2. Performance testing;
2. Core testing.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
21 files changed, 895 insertions(+), 142 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/24
-- 
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-01 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 22:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG@9
PS12, Line 9: This patch adds the logic to utilize min/max stats
> Okay.
Here is another good use case

select a.l_orderkey from   
lineitem a join [SHUFFLE] orders b  
  
where a.l_orderkey = b.o_orderkey   
and b.o_custkey = 5 order by l_orderkey ;

where 92 pages were filtered out in one impalad.

The reason for the good filtering efficiency is that the low values in pages on 
l_orderkey are somewhat monotonic.

In ::EvalOverlap(), data: min=1, max=4275687, filter: min=224167, max=2630562
In ::EvalOverlap(), data: min=1, max=3228897, filter: min=224167, max=2630562
In ::EvalOverlap(), data: min=17959, max=3246245, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=35235, max=3263680, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=52737, max=3280196, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=68708, max=3290466, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=79142, max=3294566, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=83169, max=3298692, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=87237, max=3302721, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=91328, max=3306725, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=95329, max=3310720, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=99334, max=3314819, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=103428, max=3318818, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=107458, max=3322980, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=111462, max=3327015, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=115591, max=3331171, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=119712, max=3335331, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=123810, max=3339429, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=127971, max=3343553, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=131973, max=3347585, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=136102, max=3351588, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=140354, max=3355745, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=144422, max=3359684, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=148544, max=3363814, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=152546, max=3367907, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=156640, max=3372006, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=160582, max=3376130, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=164677, max=3380229, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=168770, max=3384258, filter: min=224167, 
max=2630562


It seems we can help create such an arrangement for a Parquet data file by a 
"partial sort" hint on a column, such as follows. The partial sort order can be 
done at the row group level.

CREATE EXTERNAL TABLE functional.alltypesagg (  

  id INT,   

  bool_col 
BOOLEAN,

   tinyint_col TINYINT, 

   smallint_col SMALLINT,   


   int_col INT, 

   bigint_col BIGINT,   

   
float_col FLOAT,

[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 22:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7754/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 01 Dec 2020 01:10:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-30 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#22). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet row
groups or pages to skip these entities when they don't qualify an
equi-join predicate.

A new class of predicates called overlap predicates is introduced to aid
in the determination of whether a Parquet row group or a page overlap
with a range computed from the hash join. If not, then the entire
Parquet row group or the page are skipped.

The new class of predicates co-exist with the existing min/max conjuncts
that are introduced based on the local or transitive scan predicates.
Both classes of predicates can work individually or together with each
other. The overlap predicates are evaluated after the existing min/max
conjuncts.

TBD:
1. Unit/performance testing;
2. Core testing.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
16 files changed, 688 insertions(+), 118 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/22
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-30 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG@9
PS12, Line 9: This patch adds the logic to utilize min/max stats
> On the scope of the work.
I think getting the row group/page filtering working (with all data types, etc) 
is a good end-point for this patch.

My understanding is that the row/partition filtering will get enabled 
automatically once the filters are generated and I just want to understand the 
implications of that:

* It looks like the min-max filters are ordered after the bloom filters for 
evaluation purposes.
* It looks like the min-max filters don't count towards the 
MAX_NUM_RUNTIME_FILTERS limit - 
https://impala.apache.org/docs/build/html/topics/impala_max_num_runtime_filters.html#max_num_runtime_filters.
 So this means we will maybe get some new source/destination pairs, which might 
change the runtime behaviour of some plans.

I suspect this is all a net win, since the min-max filters should be relatively 
cheap to construct and will get automatically disabled if they're ineffective 
in the scan, but there is a bit of overhead added.

So I think we want to do some benchmarks to make sure there's no regressions 
before changing the default. Probably TPC-DS since it is heavy on the filters.



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 12
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 30 Nov 2020 18:31:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 21:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7751/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 30 Nov 2020 18:20:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-30 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#21). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet row
groups or pages to skip these entities when they don't qualify an
equi-join predicate.

A new class of predicates called overlap predicates is introduced to aid
in the determination of whether a Parquet row group or a page overlap
with a range computed from the hash join. If not, then the entire
Parquet row group or the page are skipped.

The new class of predicates co-exist with the existing min/max conjuncts
that are introduced based on the local scan predicates. Both classes of
predicates can work individually or together with each other. The
overlap predicates are evaluated after the existing min/max conjuncts.

TBD:
1. Unit/performance testing;
2. Core testing.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
16 files changed, 661 insertions(+), 119 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/21
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 9:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/7690/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 9
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 19 Nov 2020 19:52:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16720/9/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/16720/9/common/thrift/PlanNodes.thrift@299
PS9, Line 299:   12: optional list slot_usage_map
line has trailing whitespace



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 9
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 19 Nov 2020 19:36:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-19 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet row
groups or pages to skip these entities when they don't qualify an
equi-join predicate.

A new class of predicates called overlap predicates is introduced to aid
in the determination of whether a Parquet row group or a page overlap
with the a range computed from the hash join. If not, then the entire
Parquet row group or the page are skipped. The new class of predicates
co-exist with the existing min/max conjuncts that are introduced based
on the local scan predicates. Both classes of predicates can work
individually or togther with each other.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
14 files changed, 442 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/9
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 9
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 3:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/7667/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 3
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 18 Nov 2020 00:06:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/hdfs-scan-node.cc@377
PS3, Line 377:   VLOG(1) << "HdfsScanNode::ScannerThread(): filter_ctx_"
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/hdfs-scan-node.cc@378
PS3, Line 378:   << ", size=" << filter_ctxs_.size();
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@755
PS3, Line 755:
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@766
PS3, Line 766:
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@808
PS3, Line 808:   VLOG(1) << "For predicate " << root.DebugString()
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@863
PS3, Line 863:   }
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16720/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@866
PS3, Line 866: // Pull in the min and max of the page at slot
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16720/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/16720/3/common/thrift/PlanNodes.thrift@299
PS3, Line 299:   12: optional list slot_usage_map
line has trailing whitespace



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 3
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 17 Nov 2020 23:46:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-17 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet pages
to skip certain pages when they don't qualify an equi-join predicate.

In addition to the existing min/max conjuncts that are introduced based
on the local scan predicates, a new class of predicates called page
overlap predicates is introduced. The later is used to decide whether
a Parquet page with min and max statistics overlaps with the range
defined by the min and max values from the hash join. If it is not, then
the Parquet page can be skipped. Either class of predicates can work
individually or togther with the other.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
13 files changed, 338 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/3
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 3
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7647/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 2
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 13 Nov 2020 15:12:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-11-13 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16720


Change subject: [WIP] IMPALA-10325 Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
..

[WIP] IMPALA-10325 Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet pages
to skip certain pages when they don't qualify an equi-join predicate.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
5 files changed, 89 insertions(+), 5 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/2
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 2
Gerrit-Owner: Qifan Chen