[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for 6 of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. We do not add test cases for 2 of the affected JUnit tests when this feature is enabled since it results in flaky tests. These two JUnit tests are testResourceRequirements() and testSpillableBufferSizing(). In this patch we only test them when the feature is disabled. (5) There are 5 Python end to end tests that consist of queries that would produce different results. Added an additional query for each affected query when this feature is disabled. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Reviewed-on: http://gerrit.cloudera.org:8080/12974 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/admission-max-min-mem-limits.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/inline-view.test M testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters.test M testdata/workloads/functional-query/queries/QueryTest/set.test M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test 23 files changed, 2,145 insertions(+), 25 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 25 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 24: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 24 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 21 Jun 2019 03:28:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 24: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 24 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 21:58:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 23: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 23 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 21:58:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 24: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4515/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 24 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 21:58:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 23: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3697/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 23 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 16:39:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#23). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for 6 of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. We do not add test cases for 2 of the affected JUnit tests when this feature is enabled since it results in flaky tests. These two JUnit tests are testResourceRequirements() and testSpillableBufferSizing(). In this patch we only test them when the feature is disabled. (5) There are 5 Python end to end tests that consist of queries that would produce different results. Added an additional query for each affected query when this feature is disabled. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/admission-max-min-mem-limits.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/inline-view.test M testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters.test M testdata/workloads/functional-query/queries/QueryTest/set.test M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test 23 files changed, 2,145 insertions(+), 25 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/23 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 23 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 21: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3676/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 21 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 02:26:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#21). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. (5) There are 5 Python end to end tests that consist of queries that would produce different results. Added an additional query for each affected query when this feature is disabled. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/admission-max-min-mem-limits.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/inline-view.test M testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters.test M testdata/workloads/functional-query/queries/QueryTest/set.test M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test 25 files changed, 3,042 insertions(+), 23 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/21 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 21 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 20: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/3658/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 20 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 18 Jun 2019 19:15:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has abandoned this change. ( http://gerrit.cloudera.org:8080/13419 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Abandoned The actual review is: https://gerrit.cloudera.org/#/c/12974/ -- To view, visit http://gerrit.cloudera.org:8080/13419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: Ie1b28e56a8a98eaf1871766ad6ca1f62c9688fa7 Gerrit-Change-Number: 13419 Gerrit-PatchSet: 6 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13419 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 1: This can be abandoned, right? The actual review is: https://gerrit.cloudera.org/#/c/12974/ -- To view, visit http://gerrit.cloudera.org:8080/13419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie1b28e56a8a98eaf1871766ad6ca1f62c9688fa7 Gerrit-Change-Number: 13419 Gerrit-PatchSet: 1 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 18 Jun 2019 18:56:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#20). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. (5) There are 5 Python end to end tests that consist of queries that would produce different results. Added an additional query for each affected query when this feature is disabled. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/admission-max-min-mem-limits.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/inline-view.test M testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters.test M testdata/workloads/functional-query/queries/QueryTest/set.test M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test 25 files changed, 3,042 insertions(+), 23 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/20 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 20 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 18: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3648/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 18 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 17 Jun 2019 22:03:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#18). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. (5) There are 5 Python end to end tests that consist of queries that would produce different results. Added an additional query for each affected query when this feature is disabled. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/admission-max-min-mem-limits.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/inline-view.test M testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters.test M testdata/workloads/functional-query/queries/QueryTest/set.test M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test 25 files changed, 3,045 insertions(+), 23 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/18 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 18 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 16: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3638/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 16 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sun, 16 Jun 2019 18:40:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 17: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3639/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 17 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sun, 16 Jun 2019 18:39:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#17). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. (5) There are 2 Python end to end tests that consist of queries that would produce different results. Added an additional query for each affected query when this feature is disabled. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters.test M testdata/workloads/functional-query/queries/QueryTest/set.test M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test 23 files changed, 2,984 insertions(+), 24 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/17 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 17 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#16). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. (5) There are 2 Python end to end tests that consist of queries that would produce different results. Added an additional query for each affected query when this feature is disabled. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters.test M testdata/workloads/functional-query/queries/QueryTest/set.test M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test 23 files changed, 2,984 insertions(+), 24 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/16 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 16 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 15: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4458/ -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 15 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 13 Jun 2019 00:28:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 15: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4458/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 15 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 12 Jun 2019 18:54:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 14: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 14 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 12 Jun 2019 18:54:03 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 15: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 15 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 12 Jun 2019 18:54:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3573/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 14 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 12 Jun 2019 01:14:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/set.test 21 files changed, 2,940 insertions(+), 22 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/14 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 14 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 12: (3 comments) I have a few more nits, then I'm happy. http://gerrit.cloudera.org:8080/#/c/12974/11/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/12974/11/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@567 PS11, Line 567: // TODO: It seems that the cardinality of the SelectNode should be 1 instead TODO(IMPALA-8647) is a bit more standard. http://gerrit.cloudera.org:8080/#/c/12974/11/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@768 PS11, Line 768: // The cardinality check performed by this method Should be a javadoc comment - can you check the rest of the methods in this file to make sure they're all javadoc comments. http://gerrit.cloudera.org:8080/#/c/12974/11/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@779 PS11, Line 779: // This method allows us to inspect the cardinality of Should be a javadoc comment -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 12 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 11 Jun 2019 23:00:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3562/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 12 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 11 Jun 2019 17:58:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/set.test 21 files changed, 2,927 insertions(+), 22 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/12 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 12 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 10: (6 comments) Mainly comments about comments and some cleanup. One administrative thing - it would have been a little easier to review PS10 if you did the rebase in a separate patchset. The diff from PS9->PS10 was noisy because of unrelated changes in query-options.cc picked up from the rebase. http://gerrit.cloudera.org:8080/#/c/12974/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/12974/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@179 PS10, Line 179: private static double ESTIMATED_COMPRESSION_FACTOR_LEGACY = 3.58;// to change Can you remove these "to change" comments? I don't think they help much. http://gerrit.cloudera.org:8080/#/c/12974/10/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/12974/10/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@225 PS10, Line 225: // functional.alltypesmixedformat is a table of 4 partitions, I have some nits about these test comments. I think they should be javadoc comments, just for consistency. The text is also wrapped at < 90 lines - in some cases the comment would fit on fewer lines. http://gerrit.cloudera.org:8080/#/c/12974/10/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@243 PS10, Line 243: // True cardinality of tpch_text_gzip.lineitem is 6,001,215. Thanks for these comments about the cardinality, this is actually really helpful to understand the test. http://gerrit.cloudera.org:8080/#/c/12974/10/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@516 PS10, Line 516: // Estimated cardinality of the NestedLoopJoinNode is 550,564 = 742 * 742. I guess we're not so good at estimating cardinality of tiny parquet files because of the footer? http://gerrit.cloudera.org:8080/#/c/12974/10/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@564 PS10, Line 564: //TODO: It seems that the cardinality of the SelectNode should be 1 instead Nice catch. Can you file a JIRA for this and mention in TODO? This is much better for tracking purposes - bugs that are only tracked by TODOs in the code tend to be forgotten easily. And consider fixing it in a separate commit. It seems like an oversight, I think this code should probably have a max(1, ...) to avoid setting it to 0. cardinality_ = Math.round(((double) getChild(0).cardinality_) * computeSelectivity()); Preconditions.checkState(cardinality_ >= 0); http://gerrit.cloudera.org:8080/#/c/12974/10/testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-disabled.test: PS10: I think you can revert this file to its original name (same for the other similar cases where gerrit shows a rename). Encoding all of the options in the file name isn't really scalable, so I'd prefer default-join-distr-mode-shuffle.test for conciseness. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 10 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 10 Jun 2019 18:07:43 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3552/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 10 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 10 Jun 2019 05:11:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" in the method testBasicsWithoutStats() with "verifyCardinality("SELECT a FROM functional.tinytable", 2);". (2) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (3) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (4) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java R testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/set.test 27 files changed, 2,927 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/10 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 10 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 9: (9 comments) Thanks for the patience with this. I have a few more asks but this is looking pretty good. http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@172 PS9, Line 172: // of the file before compression. Can you leave a comment explaining briefly how the estimates were produced. This is just useful in case someone needs to update the estimates. http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@310 PS9, Line 310: nit: only need one blank line http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1230 PS9, Line 1230: //Compute the estimated table size when taking compression into consideration Please use a javadoc method comment, i.e. /** */ http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1241 PS9, Line 1241: estimatedPartitionSize = estimatedPartitionSize nit: to be more concise, estimatedPartitionSize += ... http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1244 PS9, Line 1244: } else {// When the text file is not compressed. Nit: comment placement here and on l1250 is a little non-standard. It's ok, but maybe just move it to the next line for consistency? http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1258 PS9, Line 1258: throw new RuntimeException("Unknown Hdfs compressed format: " Could write this more densely as: if (VALID_LEGACY_FORMATS.contains(format)) { estimatedPartitionSize = estimatedPartitionSize + Math.round(p.getSize() * ESTIMATED_COMPRESSION_FACTOR_LEGACY); } else { Preconditions.checkState((VALID_COLUMNAR_FORMATS.contains(format), "Unknown HDFS compressed format: %s", this)); } http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1262 PS9, Line 1262: estimatedTableSize = estimatedTableSize + nit: to be more concise, estimatedTableSize += estimatedPartitionSize http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@43 PS9, Line 43: tolerance Constant should be upper case. It's also not 100% obvious this is the cardinality tolerance. So maybe CARDINALITY_TOLERANCE http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@723 PS9, Line 723: expected, planRoot.getCardinality(), expected * tolerance); It looks like we made existing tests looser. I think we should keep those tests strict, i.e. enforce that cardinality exactly matches, and use approximate checks for the new tests where estimates are based on files sizes. It would be good if the method names reflected the behaviour, e.g. verifyApproxCardinality(). Otherwise someone reading the tests would assume that the checks are exact. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 9 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 05 Jun 2019 22:45:50 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 9: > Patch Set 9: > > (21 comments) > > > Patch Set 8: > > > > (1 comment) > Hi all, I have addressed your previous comments. Please review the updated patch set. Thank you very much! -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 9 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 03 Jun 2019 01:33:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 9: (21 comments) > Patch Set 8: > > (1 comment) Hi all, I have tried to addressed your previous comments. Please review the updated patch set. Thank you very much! http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h File be/src/service/query-options.h: http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h@48 PS8, Line 48: #define QUERY_OPTS_TABLE\ > Remove the last line, no need to describe symptoms of a DCHECK here. Done http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift@391 PS8, Line 391: // scanning. > EST->ESTIMATE. No real need to abbreviate in a safety valve argument that w Done http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@285 PS8, Line 285: private int numFilesNoDiskIds_ = 0; > Can you rewrite the comment to just describe what the value is. No need to Done http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@288 PS8, Line 288: // List of conjuncts for min/max values of parquet::Statistics, that are used to skip > Should be a constant, not a variable, i.e. private static double DEFAULT_RO Done http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1148 PS8, Line 1148: " sel=" + Double.toString(computeSelectivity())); > I would prefer if you just passed in the query options to this methods. Oth Done http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1190 PS8, Line 1190: } > Do we have a test case for this? It seems currently we do not have a test case for this. After some investigation, I found the following table could exercise this code path. CREATE TABLE array_demo ( pets ARRAY ) STORED AS PARQUET; Note that sumAvgRowSizes is equal to 0 not because there is no Column defined for this table. It is because the type of the current Column is not of ScalarType. In this specific case, the current Column is of ArrayType. Similarly, if the column is of MapType, sumAvgRowSizes would be equal to 0 as well. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1198 PS8, Line 1198: // the hdfs table. > Should we estimate the compression factor differently depending on the file Thanks. There are 8 supported file format defined in https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java. After some discussions with Tim, I divided the files into 3 categories - uncompressed, legacy compressed (e.g., text, avro, rc, seq), and columnar (e.g., parquet and orc). Depending on the category of a file, we multiply the size of the file by its corresponding compression factor to derive an estimated original size of the file before compression, based on which we could compute an estimate of the number of rows in the file according to the estimate of the row width. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: PS8: Thanks! I will do as suggested. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@234 PS8, Line 234: > Not true if there's a group by - we should test that too. Thanks! Will add another test case for group by. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@262 PS8, Line 262: verifyCardinality("SELECT COUNT(a) FROM functional.tinytable", 1, true, > Can we use something cleaner for constant lists like: Arrays.asList(0, 1, 0 Thanks! Will do as suggested. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@454 PS8, Line 454: functional.tinytable. > I think this new planner test option is awkward since the planner tests alr Thanks for the suggestion! I will revise verifyCardinality as suggested. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@482 PS8, Line 482: List path = Arrays.asList(0); Thanks for the suggestion! I have added a static variable called "tolerance" in this class that denotes the margin of error. The default value of
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3479/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 9 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 03 Jun 2019 00:51:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" with "verifyCardinality("SELECT a FROM functional.tinytable", 10);". (2) In CardinalityTest.java, added a new test to ensure that the returned cardinality is still -1 when this feature is disabled. (3) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. For each tested PlanNode, the behaviors before and after we disable the feature are both tested. (4) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (5) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Specifically, each tested query in a newly added test files involves at least one hdfs table without available statistics. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java R testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/set.test 27 files changed, 2,877 insertions(+), 25 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/9 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 9 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13419 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3373/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/13419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie1b28e56a8a98eaf1871766ad6ca1f62c9688fa7 Gerrit-Change-Number: 13419 Gerrit-PatchSet: 1 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 24 May 2019 02:11:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13419 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/13419/1/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/13419/1/common/thrift/ImpalaInternalService.thrift@351 PS1, Line 351: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/13419/1/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/13419/1/common/thrift/ImpalaService.thrift@399 PS1, Line 399: line has trailing whitespace -- To view, visit http://gerrit.cloudera.org:8080/13419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie1b28e56a8a98eaf1871766ad6ca1f62c9688fa7 Gerrit-Change-Number: 13419 Gerrit-PatchSet: 1 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 24 May 2019 01:32:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13419 Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" with "verifyCardinality("SELECT a FROM functional.tinytable", 10);". (2) In CardinalityTest.java, added a new test to ensure that the returned cardinality is still -1 when this feature is disabled. (3) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. (4) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (5) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Change-Id: Ie1b28e56a8a98eaf1871766ad6ca1f62c9688fa7 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java R testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test C testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/set.test 26 files changed, 14,075 insertions(+), 76 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/13419/1 -- To view, visit http://gerrit.cloudera.org:8080/13419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ie1b28e56a8a98eaf1871766ad6ca1f62c9688fa7 Gerrit-Change-Number: 13419 Gerrit-PatchSet: 1 Gerrit-Owner: Fang-Yu Rao
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1198 PS8, Line 1198: // estimated row count by ESTIMATED_COMPRESSION_FACTOR. Should we estimate the compression factor differently depending on the file format? eg text I would guess would have a much lower compression factor than parquet. Might be interesting to look at some emprical data here since changing it later could change plans, and seems not too hard to do a little research -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 8 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 17 May 2019 04:44:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 8: (20 comments) Another round of thoughts on the test infrastructure. I think the product code looks fine, just want to make sure the tests are more maintainable and easier to work on in the future. http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h File be/src/service/query-options.h: http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h@48 PS8, Line 48: // The Debug webpage won't be able to be connected once the DCHECK fails. Remove the last line, no need to describe symptoms of a DCHECK here. http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift@391 PS8, Line 391: DISABLE_HDFS_NUM_ROWS_EST EST->ESTIMATE. No real need to abbreviate in a safety valve argument that will be rarely used. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@285 PS8, Line 285: // Added this field to prevent the case in the method getStatsNumRows when Can you rewrite the comment to just describe what the value is. No need to include the story about why you added it here. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@288 PS8, Line 288: private double defaultRowWidth_ = 1.0; Should be a constant, not a variable, i.e. private static double DEFAULT_ROW_WIDTH; Maybe also rename to indicate that it's an estimate, not the real row width, i.e. DEFAULT_ROW_WIDTH_ESTIMATE. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1148 PS8, Line 1148: private long getStatsNumRows(Analyzer analyzer) { I would prefer if you just passed in the query options to this methods. Otherwise when reading code one might think that it's using the analyzer in a more complex way. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1190 PS8, Line 1190: // In the case when there is no Column defined, we use an ultimate Do we have a test case for this? http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: PS8: We should also test the nodes with the new estimate functionality disabled in these unit tests. Just to make sure that all branches in the code that handle missing stats are covered. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@234 PS8, Line 234: // The cardinality of an AggregateNode is always 1 no matter Not true if there's a group by - we should test that too. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@262 PS8, Line 262: path.add(0); Can we use something cleaner for constant lists like: Arrays.asList(0, 1, 0); http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@454 PS8, Line 454: GENERATE_DISTRIBUTED_PLAN I think this new planner test option is awkward since the planner tests already have a different mechanism to test single-node plans. Also the default for planner tests is to generate a distributed plan, which makes it more confusing. I'd propose that we instead implement it purely within this file instead - i.e. just add an argument to verifyCardinality. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@482 PS8, Line 482: protected void verifyCardinality(String query, long expected) { For the tests that rely on file-size-based estimates, I think we want some margin of error for the cardinality estimates.I think Paul added something for the planner tests. The problem is that file sizes may vary a little bit, e.g. because of changes to the file writers or random variation, so cardinality estimates based on file size can vary too. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@253 PS8, Line 253: public void testFkPkJoinDetectionWithHDFSNumRowsEstEnabled() { Let's call the ones with the default options testFkPkJoinDetection() - no need to embed the fact that it's
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3247/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 8 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 16 May 2019 02:05:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional query option to revert the change in case of regression. Testing: (1) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" with "verifyCardinality("SELECT a FROM functional.tinytable", 10);". (2) In CardinalityTest.java, added a new test to ensure that the returned cardinality is still -1 when this feature is disabled. (3) In CarginalityTest.java, added more tests to check the cardinality of most PlanNode implementations. (4) In set.test, modified three related test cases to make sure that the added query option is included after executing "set all" in various scenarios. (5) There are 8 JUnit tests in PlannerTest.java that would produce different distributed query plans when this feature is enabled. Added an additional JUnit test for each of those 8 affected JUnit tests when this feature is enabled. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java R testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test C testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-enabled.test R testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test A testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test M testdata/workloads/functional-query/queries/QueryTest/set.test 26 files changed, 14,074 insertions(+), 75 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/8 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 8 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has removed Todd Lipcon from this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Removed reviewer Todd Lipcon. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: deleteReviewer Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 8 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has removed Todd Lipcon from this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Removed reviewer Todd Lipcon. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: deleteReviewer Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 7 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 ) Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2876/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 4 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 24 Apr 2019 05:09:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available
Fang-Yu Rao has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12974 Change subject: IMPALA-7608: Estimate row count from file size when no stats available .. IMPALA-7608: Estimate row count from file size when no stats available Added the feature that computes an estimated number of rows in the current hdfs table if the statistics for the cardinality of the current hdfs table is not available. Also added an additional flag to revert the change in case of regression. Testing: (i) In CardinalityTest.java, replaced the original statement "verifyCardinality("SELECT a FROM functional.tinytable", -1);" with "verifyCardinality("SELECT a FROM functional.tinytable", 1);". (ii) In CardinalityTest.java, added a new test "testBasicsWithoutStatsWithNumRowsEstDisabled()" to ensure that the returned cardinality is still -1 when this feature is disabled. (iii) Corrected the corresponding expected results of PlannerTest. (iv) Corrected the corresponding expected result of QueryTest. Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a --- M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/CardinalityTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java M testdata/workloads/functional-planner/queries/PlannerTest/default-join-distr-mode-shuffle.test M testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test M testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test M testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test 13 files changed, 414 insertions(+), 352 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/74/12974/4 -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 4 Gerrit-Owner: Fang-Yu Rao Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Paul Rogers Gerrit-Reviewer: Tim Armstrong