[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 17: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 17 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 28 Nov 2020 07:30:05 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of non-empty files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; If there are too many empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Reviewed-on: http://gerrit.cloudera.org:8080/16723 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 17 files changed, 517 insertions(+), 20 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 18 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 17: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 17 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 28 Nov 2020 01:52:12 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 17: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6718/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 17 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 28 Nov 2020 01:52:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 16: Code-Review+2 Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 16 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 28 Nov 2020 01:49:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 16: > Patch Set 15: > > > Patch Set 9: > > > > > Patch Set 8: > > > > > > > Patch Set 8: > > > > > > > > > Patch Set 8: > > > > > > > > > > > Patch Set 8: Verified-1 > > > > > > > > > > > > Build failed: > > > > > > https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ > > > > > > > > > > There's 1 failure: > > > > > [gw1] FAILED > > > > > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit > > > > > > > > > > However, on my desktop I ran query_test/test_exprs.py and all tests > > > > > under it passed. > > > > > > > > On Jenkins this test hit an OOM GC overhead limit exceeded: > > > > gw1] linux2 -- Python 2.7.16 > > > > /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python > > > > query_test/test_exprs.py:176: in test_statement_expression_limit > > > > assert re.search(expected_err_re, str(err)) > > > > E assert None > > > > E+ where None = ('Exceeded the > > > > statement expression limit \\(25\\)\nStatement has .* > > > > expressions.', "ImpalaBeeswaxException:\n INNER EXCEPTION: > > > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC > > > > overhead limit exceeded\n") > > > > E+where = re.search > > > > E+and "ImpalaBeeswaxException:\n INNER EXCEPTION: > > > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC > > > > overhead limit exceeded\n" = str(ImpalaBeeswaxException()) > > > > > > Oh wait, there's actually another failure in FE: > > > > > > 23:47:52 [INFO] Running org.apache.impala.analysis.ParserTest > > > 23:47:52 [ERROR] Tests run: 98, Failures: 1, Errors: 0, Skipped: 0, Time > > > elapsed: 0.829 s <<< FAILURE! - in org.apache.impala.analysis.ParserTest > > > 23:47:52 [ERROR] TestGetErrorMsg(org.apache.impala.analysis.ParserTest) > > > Time elapsed: 0.005 s <<< FAILURE! > > > at > > > org.apache.impala.analysis.ParserTest.ParserError(ParserTest.java:77) > > > at > > > org.apache.impala.analysis.ParserTest.TestGetErrorMsg(ParserTest.java:3475) > > > > > > Will need to look into this one. > > > > PS9 fixes the ParserTest issue..just had to update the list of expected > > keywords in the test after a WHERE clause. Not yet sure about the other > > OOM failure since I cannot repro locally. > > One difference between my machine's execution vs the one on Jenkins is that > the Jenkins one runs it under a dockerised container and the JVM Max heap is > 1.78GB which is smaller than the 2GB on my machine. > > From ubuntu-16.04-dockerised-tests-3539/logs/ee_tests/impalad_coord_exec-0: > impalad.4069f4f171ef.ubuntu.log.INFO.20201126-001441.1: JVM: max heap size: > Total=1.78 GB > impalad.4069f4f171ef.ubuntu.log.INFO.20201126-001441.1: JVM: non-heap > committed: Total=114.50 MB > > I looked closer at the 'GC overhead limit exceeded' issue with the > test_statement_expression_limit() test and I suspect it happens because in my > original patch the default construction of the Expr object also allocates an > empty ArrayList for hints (this was consistent with how we handle other types > of hints). However, this is not really necessary and it can be done on > demand only when an actual WHERE clause hint is supplied. This particular > test creates a huge number of Exprs and considering that the Dockerised run > has a slightly smaller max heap, it might cause the GC to run more often > without reclaiming sufficient free memory. > > In PS 15 I have made the change to allocate the arraylist only when a > predicate hint was supplied. Hopefully it addresses the issue. It looks like the test_statement_expression_limit passed in the latest run on PS 15. There were couple of load failures which seem like existing flaky test issue especially https://issues.apache.org/jira/browse/IMPALA-10344 (although the same error occurs for a different table). I will start another run with a +2 carry-over. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 16 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 28 Nov 2020 01:48:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 16: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6717/ -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 16 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Nov 2020 23:49:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 16: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6717/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 16 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Nov 2020 18:12:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 15: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7743/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 15 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Nov 2020 18:08:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 15: > Patch Set 9: > > > Patch Set 8: > > > > > Patch Set 8: > > > > > > > Patch Set 8: > > > > > > > > > Patch Set 8: Verified-1 > > > > > > > > > > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ > > > > > > > > There's 1 failure: > > > > [gw1] FAILED > > > > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit > > > > > > > > However, on my desktop I ran query_test/test_exprs.py and all tests > > > > under it passed. > > > > > > On Jenkins this test hit an OOM GC overhead limit exceeded: > > > gw1] linux2 -- Python 2.7.16 > > > /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python > > > query_test/test_exprs.py:176: in test_statement_expression_limit > > > assert re.search(expected_err_re, str(err)) > > > E assert None > > > E+ where None = ('Exceeded the > > > statement expression limit \\(25\\)\nStatement has .* expressions.', > > > "ImpalaBeeswaxException:\n INNER EXCEPTION: > > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC > > > overhead limit exceeded\n") > > > E+where = re.search > > > E+and "ImpalaBeeswaxException:\n INNER EXCEPTION: > > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC > > > overhead limit exceeded\n" = str(ImpalaBeeswaxException()) > > > > Oh wait, there's actually another failure in FE: > > > > 23:47:52 [INFO] Running org.apache.impala.analysis.ParserTest > > 23:47:52 [ERROR] Tests run: 98, Failures: 1, Errors: 0, Skipped: 0, Time > > elapsed: 0.829 s <<< FAILURE! - in org.apache.impala.analysis.ParserTest > > 23:47:52 [ERROR] TestGetErrorMsg(org.apache.impala.analysis.ParserTest) > > Time elapsed: 0.005 s <<< FAILURE! > > at > > org.apache.impala.analysis.ParserTest.ParserError(ParserTest.java:77) > > at > > org.apache.impala.analysis.ParserTest.TestGetErrorMsg(ParserTest.java:3475) > > > > Will need to look into this one. > > PS9 fixes the ParserTest issue..just had to update the list of expected > keywords in the test after a WHERE clause. Not yet sure about the other OOM > failure since I cannot repro locally. One difference between my machine's execution vs the one on Jenkins is that the Jenkins one runs it under a dockerised container and the JVM Max heap is 1.78GB which is smaller than the 2GB on my machine. >From ubuntu-16.04-dockerised-tests-3539/logs/ee_tests/impalad_coord_exec-0: impalad.4069f4f171ef.ubuntu.log.INFO.20201126-001441.1: JVM: max heap size: Total=1.78 GB impalad.4069f4f171ef.ubuntu.log.INFO.20201126-001441.1: JVM: non-heap committed: Total=114.50 MB I looked closer at the 'GC overhead limit exceeded' issue with the test_statement_expression_limit() test and I suspect it happens because in my original patch the default construction of the Expr object also allocates an empty ArrayList for hints (this was consistent with how we handle other types of hints). However, this is not really necessary and it can be done on demand only when an actual WHERE clause hint is supplied. This particular test creates a huge number of Exprs and considering that the Dockerised run has a slightly smaller max heap, it might cause the GC to run more often without reclaiming sufficient free memory. In PS 15 I have made the change to allocate the arraylist only when a predicate hint was supplied. Hopefully it addresses the issue. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 15 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Nov 2020 18:06:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#15). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of non-empty files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; If there are too many empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 17 files changed, 517 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/15 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 15 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 14: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6714/ -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 14 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 26 Nov 2020 04:28:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 14: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6714/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 14 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Nov 2020 22:55:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 14: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 14 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Nov 2020 22:55:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7737/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 13 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Nov 2020 22:50:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 13: Code-Review+2 (1 comment) Carry forward +2 http://gerrit.cloudera.org:8080/#/c/16723/12/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16723/12/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@1211 PS12, Line 1211: strBuilder.append(ToSqlUtils.getPlanHintsSql(options, > line too long (96 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 13 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Nov 2020 22:30:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#13). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of non-empty files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; If there are too many empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 17 files changed, 514 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/13 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 13 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 12: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 12 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Nov 2020 01:10:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7732/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 12 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Nov 2020 00:47:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 12: > Uploaded patch set 12. Additional testing found a bug that when the WHERE clause hint was within a view, it was not being recognized by the simple limit optimization. The reason was the SQL generation was not propagating this additional hint and HMS definition for the view did not have the hint. PS 12 fixes this. I have done manual testing with such views. However, I haven't been able to add a test with a hint in a view. I tried creating a view with hints in the functional_schema_template.sql and the view was created but doing a 'DESCRIBE EXTENDED' shows hint was missing. This needs further exploration on the test infrastructure. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 12 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Nov 2020 00:33:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/16723/12/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16723/12/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@1211 PS12, Line 1211: strBuilder.append(ToSqlUtils.getPlanHintsSql(options, whereClause_.getPredicateHints())) line too long (96 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 12 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Nov 2020 00:27:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#12). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of non-empty files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; If there are too many empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 17 files changed, 514 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/12 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 12 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 11: > Patch Set 11: > > (4 comments) > > > Patch Set 11: > > > > (4 comments) > > > > Just a couple corner cases I have run into; given this is an opt-in > > optimization now it might not be incorrect to ignore these. > > > > I think it's good to think about the case where this optimization helps and > > not risk an incorrect limit in other cases. Where this helps most. > > a. lots of files > > b. small limits > > > > a) the scan range and scheduling overhead is only slow when there are many > > hosts + files. > > > > b) for large limits maybe the bulk of query run time goes to fetching > > results and not the planning, but that said it may not hurt too much in > > this case. > > Thanks for the comments. I will create a follow-up JIRA to address couple of > these comments considering that this CR was +2 ed. > Note that the more generalized issue of optimizing for limits is something > that Tim and I had some offline discussion about and he created > 'IMPALA-10347: Explore approaches to optimizing queries that will likely be > short-circuited by limits' Created https://issues.apache.org/jira/browse/IMPALA-10353 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 11 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Nov 2020 02:37:12 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 11: (4 comments) > Patch Set 11: > > (4 comments) > > Just a couple corner cases I have run into; given this is an opt-in > optimization now it might not be incorrect to ignore these. > > I think it's good to think about the case where this optimization helps and > not risk an incorrect limit in other cases. Where this helps most. > a. lots of files > b. small limits > > a) the scan range and scheduling overhead is only slow when there are many > hosts + files. > > b) for large limits maybe the bulk of query run time goes to fetching results > and not the planning, but that said it may not hurt too much in this case. Thanks for the comments. I will create a follow-up JIRA to address couple of these comments considering that this CR was +2 ed. Note that the more generalized issue of optimizing for limits is something that Tim and I had some offline discussion about and he created 'IMPALA-10347: Explore approaches to optimizing queries that will likely be short-circuited by limits' http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@854 PS11, Line 854: FileSystem partitionFs; > I'd consider only doing this optimization for "record oriented" or "splitta It simplifies the logic and our internal testing at least if we could apply it across the board..perhaps for text format we have an allowance that 10% more files be considered to accommodate 'invalid' files .. will that be acceptable ? http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@867 PS11, Line 867: for (FileDescriptor fd : partition.getFileDescriptors()) { > Also maybe a threshold on the number of scan ranges where we wouldn't bothe This is related to your other comment below about bailing out under certain conditions. I'll try to run the benchmark .. that's a good idea. http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@873 PS11, Line 873: simpleLimitNumRows++; // conservatively estimate 1 row per file > The flip side of this might be for simple limits that are relatively large One complication is that the total number of files is not known up front (unless we aggregate it up front). We are pruning at 2 levels: once in the HdfsPartitionPruner where we limit the number of partitions and then here where we limit the number of files per partition. In both places, as each partition is processed, we look at the # files but don't know the total. We could decide to do the aggregation of the num files by making a separate pass over all partitions during HdfsPartitionPruner but that will add some overhead. http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1121 PS11, Line 1121: > For Hive ACID and deleted records would this logic still work? Might be a h For ACID tables with deleted rows, the planner will internally create an Hash Anti Join to handle the not-in, so yeah the limit should not be applied in such cases because it is no longer a simple scan. I will create a separate JIRA to handle that case since additional testing and code changes would be needed. Thanks for raising this. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 11 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Nov 2020 02:06:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Shant Hovsepian has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 11: (4 comments) Just a couple corner cases I have run into; given this is an opt-in optimization now it might not be incorrect to ignore these. I think it's good to think about the case where this optimization helps and not risk an incorrect limit in other cases. Where this helps most. a. lots of files b. small limits a) the scan range and scheduling overhead is only slow when there are many hosts + files. b) for large limits maybe the bulk of query run time goes to fetching results and not the planning, but that said it may not hurt too much in this case. http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@854 PS11, Line 854: FileSystem partitionFs; I'd consider only doing this optimization for "record oriented" or "splittable" file formats, like parquet/avro. For text tables it's not uncommon for parsing or record boundary issues to cause an entire file to be invalid. http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@867 PS11, Line 867: for (FileDescriptor fd : partition.getFileDescriptors()) { Also maybe a threshold on the number of scan ranges where we wouldn't bother with the optimization, be/src/benchmarks/scheduler-benchmark is a good test to run, if I recall tens of nodes and hundreds of files isn't too slow. Just thinking that the cases where this optimization assumption is potentially problematic are when there are a few malformed files better to err on the side of caution then. http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@873 PS11, Line 873: simpleLimitNumRows++; // conservatively estimate 1 row per file The flip side of this might be for simple limits that are relatively large or larger than the number of files a maximum bail out threshold might make sense. I.e. if the limit is 10,000 then no sense in doing this optimization unless the number of files is much greater than 10,000? http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1121 PS11, Line 1121: For Hive ACID and deleted records would this logic still work? Might be a helpful test. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 11 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 23 Nov 2020 21:32:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 11: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6690/ -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 11 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 23 Nov 2020 00:02:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 11: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 11 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sun, 22 Nov 2020 18:32:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 11: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6690/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 11 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sun, 22 Nov 2020 18:32:03 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 10: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6689/ -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 10 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 11:16:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 10: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6689/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 10 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 05:46:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 10: Code-Review+2 Rebased on latest master. Carry forward +2 . -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 10 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 05:36:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7711/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 9 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 04:17:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 9: > Patch Set 8: > > > Patch Set 8: > > > > > Patch Set 8: > > > > > > > Patch Set 8: Verified-1 > > > > > > > > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ > > > > > > There's 1 failure: > > > [gw1] FAILED > > > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit > > > > > > However, on my desktop I ran query_test/test_exprs.py and all tests under > > > it passed. > > > > On Jenkins this test hit an OOM GC overhead limit exceeded: > > gw1] linux2 -- Python 2.7.16 > > /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python > > query_test/test_exprs.py:176: in test_statement_expression_limit > > assert re.search(expected_err_re, str(err)) > > E assert None > > E+ where None = ('Exceeded the > > statement expression limit \\(25\\)\nStatement has .* expressions.', > > "ImpalaBeeswaxException:\n INNER EXCEPTION: > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC > > overhead limit exceeded\n") > > E+where = re.search > > E+and "ImpalaBeeswaxException:\n INNER EXCEPTION: > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC > > overhead limit exceeded\n" = str(ImpalaBeeswaxException()) > > Oh wait, there's actually another failure in FE: > > 23:47:52 [INFO] Running org.apache.impala.analysis.ParserTest > 23:47:52 [ERROR] Tests run: 98, Failures: 1, Errors: 0, Skipped: 0, Time > elapsed: 0.829 s <<< FAILURE! - in org.apache.impala.analysis.ParserTest > 23:47:52 [ERROR] TestGetErrorMsg(org.apache.impala.analysis.ParserTest) Time > elapsed: 0.005 s <<< FAILURE! > at > org.apache.impala.analysis.ParserTest.ParserError(ParserTest.java:77) > at > org.apache.impala.analysis.ParserTest.TestGetErrorMsg(ParserTest.java:3475) > > Will need to look into this one. PS9 fixes the ParserTest issue..just had to update the list of expected keywords in the test after a WHERE clause. Not yet sure about the other OOM failure since I cannot repro locally. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 9 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 03:58:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#9). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of non-empty files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; If there are too many empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 17 files changed, 507 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/9 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 9 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 8: > Patch Set 8: > > > Patch Set 8: > > > > > Patch Set 8: Verified-1 > > > > > > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ > > > > There's 1 failure: > > [gw1] FAILED > > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit > > > > However, on my desktop I ran query_test/test_exprs.py and all tests under > > it passed. > > On Jenkins this test hit an OOM GC overhead limit exceeded: > gw1] linux2 -- Python 2.7.16 > /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python > query_test/test_exprs.py:176: in test_statement_expression_limit > assert re.search(expected_err_re, str(err)) > E assert None > E+ where None = ('Exceeded the > statement expression limit \\(25\\)\nStatement has .* expressions.', > "ImpalaBeeswaxException:\n INNER EXCEPTION: 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC overhead > limit exceeded\n") > E+where = re.search > E+and "ImpalaBeeswaxException:\n INNER EXCEPTION: 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC overhead > limit exceeded\n" = str(ImpalaBeeswaxException()) Oh wait, there's actually another failure in FE: 23:47:52 [INFO] Running org.apache.impala.analysis.ParserTest 23:47:52 [ERROR] Tests run: 98, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.829 s <<< FAILURE! - in org.apache.impala.analysis.ParserTest 23:47:52 [ERROR] TestGetErrorMsg(org.apache.impala.analysis.ParserTest) Time elapsed: 0.005 s <<< FAILURE! at org.apache.impala.analysis.ParserTest.ParserError(ParserTest.java:77) at org.apache.impala.analysis.ParserTest.TestGetErrorMsg(ParserTest.java:3475) Will need to look into this one. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 8 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 03:24:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 8: > Patch Set 8: > > > Patch Set 8: Verified-1 > > > > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ > > There's 1 failure: > [gw1] FAILED > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit > > However, on my desktop I ran query_test/test_exprs.py and all tests under it > passed. On Jenkins this test hit an OOM GC overhead limit exceeded: gw1] linux2 -- Python 2.7.16 /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python query_test/test_exprs.py:176: in test_statement_expression_limit assert re.search(expected_err_re, str(err)) E assert None E+ where None = ('Exceeded the statement expression limit \\(25\\)\nStatement has .* expressions.', "ImpalaBeeswaxException:\n INNER EXCEPTION: \n MESSAGE: OutOfMemoryError: GC overhead limit exceeded\n") E+where = re.search E+and "ImpalaBeeswaxException:\n INNER EXCEPTION: \n MESSAGE: OutOfMemoryError: GC overhead limit exceeded\n" = str(ImpalaBeeswaxException()) -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 8 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 03:14:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 8: > Patch Set 8: Verified-1 > > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ There's 1 failure: [gw1] FAILED query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit However, on my desktop I ran query_test/test_exprs.py and all tests under it passed. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 8 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 03:09:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 8: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 8 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Nov 2020 02:48:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 8: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 8 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Nov 2020 21:20:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 7: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 7 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Nov 2020 21:20:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 8: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 8 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Nov 2020 21:20:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7703/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 7 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Nov 2020 19:00:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 6: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7702/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 6 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Nov 2020 18:55:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 7: (4 comments) http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/cup/sql-parser.cup@3115 PS5, Line 3115: // clause. An attempt was made to set this for individual exprs > I guess this is a bit limiting in the it applies only to the whole where cl That was my original attempt too..setting the hint at the predicate level but when I was running into shift/reduce conflicts. I gave it another try today with a few variations and still could not get it to work. I have left a comment in the code about this. For reference, here's one change I tried: expr ::= non_pred_expr:e {: RESULT = e; :} | opt_plan_hints:pred_hints predicate:p {: p.setPredicateHints(pred_hints); RESULT = p; :}; This generated quite a few conflicts..example: Warning : *** Shift/Reduce conflict found in state #253 between opt_plan_hints ::= (*) and case_expr ::= (*) KW_CASE expr case_when_clause_list case_else_clause KW_END and case_expr ::= (*) KW_CASE case_when_clause_list case_else_clause KW_END under symbol KW_CASE Resolved in favor of shifting. Warning : *** Shift/Reduce conflict found in state #253 between opt_plan_hints ::= (*) and cast_expr ::= (*) KW_CAST LPAREN expr KW_AS type_def cast_format_val RPAREN under symbol KW_CAST Resolved in favor of shifting. .. ... http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/analysis/Predicate.java File fe/src/main/java/org/apache/impala/analysis/Predicate.java: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/analysis/Predicate.java@30 PS5, Line 30: { > maybe hasAlwaysTrueHint_ just to make it crystal-clear that it's not actual Changed this and the names of the setter/getter also. http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@869 PS5, Line 869: if ((fsHasBlocks && fd.getNumFileBlocks() == 0) > nit: use braces for multi-line if Done http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870 PS5, Line 870: d.getFileLength() < > We already had to deal with a similar issue here I will follow up with the doc writer on this. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 7 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Nov 2020 18:47:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#7). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of non-empty files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; If there are too many empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 16 files changed, 505 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/7 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 7 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#6). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of non-empty files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; If there are too many empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 16 files changed, 505 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/6 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 6 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 5: (4 comments) http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/cup/sql-parser.cup@3115 PS5, Line 3115: KW_WHERE opt_plan_hints:pred_hints expr:e I guess this is a bit limiting in the it applies only to the whole where clause. Should it be part of the expr production below so it can be attached to any expression? I don't think this affects the functionality of this patch, since we're only checking the top-level statement anyway, but it seems like itwould me more elegant to have the expr hint be associated with the expr in the parser? If there are complications with that, maybe a comment here explaining the limitation would be sufficient. http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/analysis/Predicate.java File fe/src/main/java/org/apache/impala/analysis/Predicate.java: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/analysis/Predicate.java@30 PS5, Line 30: isAlwaysTrue_ maybe hasAlwaysTrueHint_ just to make it crystal-clear that it's not actually a guarantee? http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@869 PS5, Line 869: if (fsHasBlocks && fd.getNumFileBlocks() == 0) continue; nit: use braces for multi-line if http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870 PS5, Line 870: fd.getFileLength() > Yes. Totally agree. We probably can live with the 0-row data files through We already had to deal with a similar issue here https://impala.apache.org/docs/build/html/topics/impala_optimize_partition_key_scans.html ; we should document similarly Generally it doesn't make any sense to write files with 0 rows and it should be rare. Our experience is that some misbehaving tools can generate 0 row files (we've seen Spark do it with issues like https://issues.apache.org/jira/browse/SPARK-10216). You're right that the files are non-empty because they have the footer with the schema. I don't think there's an upper bound on the size either though, cause they could have an arbitrarily complex scheme. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Nov 2020 01:29:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 5: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870 PS5, Line 870: fd.getFileLength() > The getFileLength() check is used in other places too..so I borrowed that f Yes. Totally agree. We probably can live with the 0-row data files through documentation. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 19 Nov 2020 17:52:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870 PS5, Line 870: fd.getFileLength() > Is it possible for a Parquet data file with empty # of rows pass this test? The getFileLength() check is used in other places too..so I borrowed that from generateScanRangeSpec(). For self-describing schema file formats like Parquet, yes it is possible for the length to be non-zero and num_rows zero. I think that handling such cases will need some major rework of this computeScanRangeLocation() method since right now it is agnostic to the file format (it does care about file system type but not so much the formats). Further, I believe other changes will be needed in the metadata catalog layer to ensure this FileMetaData is plumbed through although I haven't looked closely into that. The trade-off is the size of the metadata in catalog cache could blow up for large number of files and we are already run into significant memory issues. http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@947 PS5, Line 947: if (isSimpleLimit && simpleLimitNumRows == : analyzer.getSimpleLimitStatus().second) { : // for the simple limit case if the estimated rows has already reached the limit : // there's no need to process more partitions : break; : } > This is good. Ack -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 19 Nov 2020 16:51:29 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 5: (2 comments) Looks very good! The empty parquet data file may be a corner case to worry about. http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870 PS5, Line 870: fd.getFileLength() Is it possible for a Parquet data file with empty # of rows pass this test? Note that due to the meta-data portion, such a file will have some number of bytes. See one case here https://github.com/G-Research/ParquetSharp/issues/110. If we can look at the meta-data of such a file, the number of rows is right there. 871 struct FileMetaData { 872 /** Version of this file **/ 873 1: required i32 version 874 875 /** Parquet schema for this file. This schema contains metadata for all the columns. 876* The schema is represented as a tree with a single root. The nodes of the tree 877* are flattened to a list by doing a depth-first traversal. 878* The column metadata contains the path in the schema for that column which can be 879* used to map columns to nodes in the schema. 880* The first element is the root **/ 881 2: required list schema; 882 883 /** Number of rows in this file **/ 884 3: required i64 num_rows http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@947 PS5, Line 947: if (isSimpleLimit && simpleLimitNumRows == : analyzer.getSimpleLimitStatus().second) { : // for the simple limit case if the estimated rows has already reached the limit : // there's no need to process more partitions : break; : } This is good. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 19 Nov 2020 14:30:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7683/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 19 Nov 2020 02:23:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test: http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@241 PS4, Line 241: limit 1 > Yeah, that is a tricky issue. I went with a simple approach of skipping the empty files when we are iterating over the file descriptors and only increment the count for the non-empty files. The determination of empty vs non-empty also depends on the file system type..so there's a check for that. PS 5 has the latest changes. I also modified the commit message accordingly. I haven't added new tests for this since I am not quite sure if we have a data set that has such mixed empty and non-empty files per partition. If you guys think it is important to test that, I am open to adding one. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 19 Nov 2020 02:17:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#5). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of non-empty files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; If there are too many empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 16 files changed, 500 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/5 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test: http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@241 PS4, Line 241: limit 1 > In HdfsScanNode.computeScanRangeLocation(), we skip computing the scan rang Yeah, that is a tricky issue. If we take the optimistic approach, we could add a new step in (2) when empty files are found. That is, in this new step we come up with extra up to non-empty files for a total of empty files found in the pruned list. Another approach would be abandon the pruned list and go with the full list. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 4 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 19 Nov 2020 01:42:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test: http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@241 PS4, Line 241: limit 1 > This makes me feel we should skip those files with 0 rows during pruning. I In HdfsScanNode.computeScanRangeLocation(), we skip computing the scan range if file is empty: Line 912 on master: // Skips files that have no associated blocks. if (fileDesc.getNumFileBlocks() == 0) continue; However, we populate the totalFilesPerFs_ treemap earlier .. on line 891 and that's the one that gets used to display the EXPLAIN string. So, yeah there's some inconsistency in the display (although it is possible it is intentional to show all files including empty ones in the explain). For my patch, there are 2 steps in which the pruning happens: (1) in HdfsPartitionPruner when I am limiting the number of partitions based only the number of file descriptors per partition - i.e not examining each file descriptor since that would have overhead, and (2) in HdfsScanNode I am limiting the number of files since that code already iterates over the file descriptors. I guess I could skip empty files in step 2 even though it would mess up the calculation that was done in step 1. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 4 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 18 Nov 2020 23:14:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 4: (1 comment) Looks nice! In addition to the empty file concern, I wonder if in the explain output, one can clearly see the application of this optimization, other than checking out the files scanned vs the total one by one. Such an indicator could be very useful in rule out a problem (if any) in the area quickly. Sorry I was not able to see it in the code. http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test: http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@241 PS4, Line 241: limit 1 This makes me feel we should skip those files with 0 rows during pruning. In my test for a table with textfile format, I can add empty files in the folder for the table and impala will process it. Query: explain select * from table_bar ++ | Explain String | ++ | Max Per-Host Resource Reservation: Memory=0B Threads=2 | | Per-Host Resource Estimates: Memory=10MB | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.table_bar | | | | PLAN-ROOT SINK | | | | | 01:EXCHANGE [UNPARTITIONED] | | | | | 00:SCAN HDFS [default.table_bar] | |HDFS partitions=1/1 files=1 size=0B | |row-size=4B cardinality=0 | ++ [09:24:31 qchen@qifan-10229: parquet] sqlci -q "select * from table_bar" Starting Impala Shell with no authentication using Python 2.7.16 Warning: live_progress only applies to interactive shell sessions, and is being skipped for now. Opened TCP connection to localhost:21000 Connected to localhost:21000 Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build ebe72ec25f4c6daabaa27f6daddd03b887806507) Query: select * from table_bar Query submitted at: 2020-11-18 09:24:48 (Coordinator: http://qifan-10229:25000) Query progress can be monitored at: http://qifan-10229:25000/query_plan?query_id=df40c6ecaeeb3a0e:11dd5cb7 Fetched 0 row(s) in 4.64s drop table if exists table_bar purge; create table if not exists table_bar (a int) STORED AS textfile location '/tmp/table_bar_dir'; touch empty.txt hdfs dfs -copyFromLocal empty.txt /tmp/table_bar_dir -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 4 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 18 Nov 2020 14:43:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7668/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 4 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 18 Nov 2020 00:12:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 4: (4 comments) http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS3, Line 209: for (FeFsPartition p : partitions) { : numRows += p.getNumFileDescriptors(); : prunedPartitions.add(p); : if (numRows >= analyzer.getSimpleLimitStatus().second) { : // here we only limit the partitions; later in HdfsScanNode we will : // limit the file descriptors within a partition : break; : } > Yes, I think I like the always true hint option. It seems the new code shou Done http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@845 PS3, Line 845: long simpleLimitNumRows = 0; // only used for the simple limit case > maybe simpleLimitNumRows to make it clearer what it is? Ok to ignore. Done http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test: http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@24 PS2, Line 24: # limit on both sides of a union all > Add a case where the where clause has a subquery, too? Added an IN subquery test that does NOT get optimized. Also, previously I had added an EXISTS subquery test that DOES get optimized (further below). http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@26 PS2, Line 26: union all > Yeah, I had concerns about this too but was debating how to allow for the W Discussed this offline and added the always_true hint for WHERE clause. This is examined in SelectStmt.checkSimpleLimitStmt(). Modified the tests to include this hint where applicable. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 4 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 18 Nov 2020 00:01:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#4). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further, if there's a WHERE clause, it must have an 'always_true' hint in order for the optimization to be considered. For example: set optimize_simple_limit = true; SELECT * FROM T WHERE /* +always_true */ LIMIT 10; Note that if there are empty files in the partitions, it is possible that the query may produce fewer rows although those are still valid rows. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added an e2e test. Since result rows are non-deterministic, only simple count(*) query on top of subquery with limit was added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 16 files changed, 487 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/4 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 4 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 3: (4 comments) Looks good! http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601 PS3, Line 601: (1 row per file) > Yeah, I had initially thought of that but the file metadata as it exists to Done http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180 PS3, Line 180: order-by > Not sure what you meant here. In the presence of order-by, we cannot push Yes, I agree order-by should not be allowed. Sorry I was not thinking when I type. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS3, Line 209: for (FeFsPartition p : partitions) { : numRows += p.getNumFileDescriptors(); : prunedPartitions.add(p); : if (numRows >= analyzer.getSimpleLimitStatus().second) { : // here we only limit the partitions; later in HdfsScanNode we will : // limit the file descriptors within a partition : break; : } > I think on balance it'd be better to make the partition selection determini Yes, I agree and have experienced similar situations before. I recall at the end we had to apply the randomness. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS3, Line 209: for (FeFsPartition p : partitions) { : numRows += p.getNumFileDescriptors(); : prunedPartitions.add(p); : if (numRows >= analyzer.getSimpleLimitStatus().second) { : // here we only limit the partitions; later in HdfsScanNode we will : // limit the file descriptors within a partition : break; : } > Yes, these are the pruned partitions that will be sent over to the runtime. Yes, I think I like the always true hint option. It seems the new code should check that first and then allow the WHERE clause. Customers can forget the hint here and there in the query and end up with incorrect results. Wait and see on randomize the partitions to scan sounds good to me. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 17 Nov 2020 20:08:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS3, Line 209: for (FeFsPartition p : partitions) { : numRows += p.getNumFileDescriptors(); : prunedPartitions.add(p); : if (numRows >= analyzer.getSimpleLimitStatus().second) { : // here we only limit the partitions; later in HdfsScanNode we will : // limit the file descriptors within a partition : break; : } > Yes, these are the pruned partitions that will be sent over to the runtime. I think on balance it'd be better to make the partition selection deterministic, just cause it'll be easier to debug, support, etc. Agree that it might be a good optimization for workloads that were mostly comprised of queries like this, but we're not really anticipating such workloads. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@845 PS3, Line 845: long numRows = 0; // only used for the simple limit case maybe simpleLimitNumRows to make it clearer what it is? Ok to ignore. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 17 Nov 2020 19:50:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601 PS3, Line 601: (1 row per file) > Sounds like for parquet/orc, we can take a sample of few data files and get Yeah, I had initially thought of that but the file metadata as it exists today see [1] is not customized for Parquet or ORC so it does not keep track of the rowcount per file (we do have numRows_ per partition but that's an estimate obtained from hive metastore). We could certainly do more aggressive pruning during planning time if we leverage that. This could be a future enhancement (maybe there's an existing jira.. I will check). [1] https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L103 http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180 PS3, Line 180: order-by > Sounds like we should allow the order by clause since it does not increase/ Not sure what you meant here. In the presence of order-by, we cannot push the limit to the scan anyways. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS3, Line 209: for (FeFsPartition p : partitions) { : numRows += p.getNumFileDescriptors(); : prunedPartitions.add(p); : if (numRows >= analyzer.getSimpleLimitStatus().second) { : // here we only limit the partitions; later in HdfsScanNode we will : // limit the file descriptors within a partition : break; : } > I wonder if the prunnedPartitions returned here are the only ones to be sca Yes, these are the pruned partitions that will be sent over to the runtime. Regarding the WHERE clause, Tim had raised similar concern and we had some offline discussion about how to safely allow predicates. The reason for allowing at least some type of predicate is that for various reasons the user may be doing a limit on top of a view that has a predicate that is always true. For example, https://issues.apache.org/jira/browse/IMPALA-10064 describes one such use case we have seen. Tim had a suggestion for using a hint for the where clause (where /* +always_true */) and I agree that it is a safe option since it puts the onus on the user. I will update the patch with those changes. For randomly picking a subset of partitions..hmm..my initial thought is it may be over-doing the optimization. Let's see the adoption of this functionality in a concurrent workload scenario..note that the optimize_simple_limit is disabled by default. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 17 Nov 2020 19:41:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 3: (4 comments) Like it! http://gerrit.cloudera.org:8080/#/c/16723/3/be/src/service/query-options.h File be/src/service/query-options.h: http://gerrit.cloudera.org:8080/#/c/16723/3/be/src/service/query-options.h@226 PS3, Line 226: OPTIMIZE_SIMPLE_LIMIT OPTIMIZE_SIMPLE_LIMIT_QUERY probably is more specific. http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601 PS3, Line 601: (1 row per file) Sounds like for parquet/orc, we can take a sample of few data files and get the number of rows from the meta-data section for each. This probably can help speed up the pruning process. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180 PS3, Line 180: order-by Sounds like we should allow the order by clause since it does not increase/decrease # of rows. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS3, Line 209: for (FeFsPartition p : partitions) { : numRows += p.getNumFileDescriptors(); : prunedPartitions.add(p); : if (numRows >= analyzer.getSimpleLimitStatus().second) { : // here we only limit the partitions; later in HdfsScanNode we will : // limit the file descriptors within a partition : break; : } I wonder if the prunnedPartitions returned here are the only ones to be scanned during run-time. If so, I think for this optimization to work, we should not allow any WHERE clause. The other point is that we may consider randomly pick a small subset of partitions to reduce the chance of contention from multiple small limit queries. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 17 Nov 2020 17:35:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7657/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 17 Nov 2020 00:22:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 3: > Patch Set 2: > > (1 comment) I addressed the refactoring comment and another minor comment. For the handling of WHERE predicates, I will do an update based on additional offline discussion. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 17 Nov 2020 00:08:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java File fe/src/main/java/org/apache/impala/analysis/Analyzer.java: http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@184 PS2, Line 184: // Tracks the simple LIMIT status of this query block. First item of the > Can you explain what the two values are? Done (added explanation in the code) http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1560 PS2, Line 1560: scanNode.init(analyzer); > It seems like this would most naturally fit as part of HdfsPartition prunin Refactored into a separate function in the HdfsPartitionPruner and invoked it from the pruner itself at the final stage of pruning. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 17 Nov 2020 00:03:53 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/PartitionSet.java File fe/src/main/java/org/apache/impala/analysis/PartitionSet.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/PartitionSet.java@89 PS3, Line 89: partitions_ = pruner.prunePartitions(analyzer, transformedConjuncts, true, null).first; line too long (93 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 17 Nov 2020 00:01:14 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#3). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. If enabled, in certain cases the query may produce fewer rows (due to filter conditions or presence of empty files) although those rows are valid rows that would have been present if the optimization was disabled. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added e2e tests. Since result rows are non-deterministic, only simple count(*) queries on top of subqueries with limit were added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 13 files changed, 365 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/3 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test: http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@26 PS2, Line 26: where bigint_col < 100 > Trying to apply this with non-partition predicates doesn't seem safe. Yeah, I had concerns about this too but was debating how to allow for the WHERE clause of the type supported in IMPALA-10064 which involve a view containing a predicate. I can make the algorithm more restrictive: 1. only allow BinaryPredicate and one side of it must be a partitioning column. 2. don't allow subqueries in the predicate as it ends up getting transformed to joins -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 2 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 16 Nov 2020 18:15:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 2: (4 comments) I have some serious concerns about the behaviour with predicates, not necessarily that it wouldn't be useful in some circumstances, but that it's changing the meaning of the LIMIT clause too much, even considering that it's under a query option. Otherwise I had more minor comments on the rest of the code. http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java File fe/src/main/java/org/apache/impala/analysis/Analyzer.java: http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@184 PS2, Line 184: // Tracks the simple LIMIT status of this query block Can you explain what the two values are? http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1560 PS2, Line 1560: // check eligibility of simple limit optimization: It seems like this would most naturally fit as part of HdfsPartition pruning, which is invoked above. Would be good to move this logic into a separate function regardless, but it seems like it could fit in that step. http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test: http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@24 PS2, Line 24: # limit with a WHERE clause Add a case where the where clause has a subquery, too? http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@26 PS2, Line 26: where bigint_col < 100 Trying to apply this with non-partition predicates doesn't seem safe. I get there's a use case for limiting the input data scanned even when it changes results but this seems very different and surprising behavior for the LIMIT clause compared to the other optimisation. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 2 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 16 Nov 2020 17:26:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7653/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 2 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 16 Nov 2020 04:53:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@847 PS1, Line 847: (analyzer.getQueryCtx().client_request.getQuery_options() > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1328 PS1, Line 1328: inlineViewRef.getAnalyzer().setSimpleLimitStatus(new Pair<>(new Boolean(true), > line too long (106 > 90) Done http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1564 PS1, Line 1564: if (hdfsTblRef.getSampleParams() == null > line too long (95 > 90) Done http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1581 PS1, Line 1581: } > line too long (96 > 90) Done http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1584 PS1, Line 1584: finalPartitions, hdfsTblRef, aggInfo, pair.second, isPartitionKeyScan); > line too long (91 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 2 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 16 Nov 2020 04:33:07 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Hello Shant Hovsepian, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16723 to look at the new patch set (#2). Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations() which, according to query profiling, has been the main contributor to the planning time especially for large number of partitions. Further, within each partition, we only consider the number of files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. If enabled, in certain cases the query may produce fewer rows (due to filter conditions or presence of empty files) although those rows are valid rows that would have been present if the optimization was disabled. Testing: - Added planner tests for the optimization - Ran query_test.py tests by enabling the optimize_simple_limit - Added e2e tests. Since result rows are non-deterministic, only simple count(*) queries on top of subqueries with limit were added. Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test 11 files changed, 348 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/2 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 2 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7652/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 1 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sun, 15 Nov 2020 02:06:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 ) Change subject: IMPALA-10314: Optimize planning time for simple limits .. Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@847 PS1, Line 847: (analyzer.getQueryCtx().client_request.getQuery_options().isOptimize_simple_limit() line too long (91 > 90) http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1328 PS1, Line 1328: inlineViewRef.getAnalyzer().setSimpleLimitStatus(new Pair<>(new Boolean(true), outerStatus.second)); line too long (106 > 90) http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1564 PS1, Line 1564: && analyzer.getQueryCtx().client_request.getQuery_options().isOptimize_simple_limit() line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1581 PS1, Line 1581: scanNode = new HdfsScanNode(ctx_.getNextNodeId(), tupleDesc, conjuncts, finalPartitions, line too long (96 > 90) http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1584 PS1, Line 1584: scanNode = new HdfsScanNode(ctx_.getNextNodeId(), tupleDesc, conjuncts, partitions, line too long (91 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 1 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Sun, 15 Nov 2020 01:45:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits
Aman Sinha has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16723 Change subject: IMPALA-10314: Optimize planning time for simple limits .. IMPALA-10314: Optimize planning time for simple limits This patch optimizes the planning time for simple limit queries by only considering a minimal set of partitions whose file descriptors add up to N (the specified limit). Each file is conservatively estimated to contain 1 row. This reduces the number of partitions processed by HdfsScanNode.computeScanRangeLocations(). Further, within each partition, we only consider the number of files that brings the total to N. This is an opt-in optimization. A new planner option OPTIMIZE_SIMPLE_LIMIT enables this optimization. Testing: - Added new planner tests that enable this optimization - Ran PlannerTest - Ran query_test.py tests by enabling the optimize_simple_limit - TODO: add e2e tests (I haven't added them yet since the results are non-determinstic. Only simple count(*) queries on top of subqueries with limit could probably be added) Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test 10 files changed, 302 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/1 -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 1 Gerrit-Owner: Aman Sinha