[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 17: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 17
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 28 Nov 2020 07:30:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of non-empty files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

If there are too many empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Reviewed-on: http://gerrit.cloudera.org:8080/16723
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
17 files changed, 517 insertions(+), 20 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 18
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 17: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 17
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 28 Nov 2020 01:52:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 17:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6718/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 17
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 28 Nov 2020 01:52:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 16: Code-Review+2

Carry +2


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 16
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 28 Nov 2020 01:49:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 16:

> Patch Set 15:
>
> > Patch Set 9:
> >
> > > Patch Set 8:
> > >
> > > > Patch Set 8:
> > > >
> > > > > Patch Set 8:
> > > > >
> > > > > > Patch Set 8: Verified-1
> > > > > >
> > > > > > Build failed: 
> > > > > > https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/
> > > > >
> > > > > There's 1 failure:
> > > > > [gw1] FAILED 
> > > > > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit
> > > > >
> > > > > However, on my desktop I ran query_test/test_exprs.py and all tests 
> > > > > under it passed.
> > > >
> > > > On Jenkins this test hit an OOM GC overhead limit exceeded:
> > > > gw1] linux2 -- Python 2.7.16 
> > > > /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python
> > > > query_test/test_exprs.py:176: in test_statement_expression_limit
> > > > assert re.search(expected_err_re, str(err))
> > > > E   assert None
> > > > E+  where None = ('Exceeded the 
> > > > statement expression limit \\(25\\)\nStatement has .* 
> > > > expressions.', "ImpalaBeeswaxException:\n INNER EXCEPTION:  > > > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC 
> > > > overhead limit exceeded\n")
> > > > E+where  = re.search
> > > > E+and   "ImpalaBeeswaxException:\n INNER EXCEPTION:  > > > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC 
> > > > overhead limit exceeded\n" = str(ImpalaBeeswaxException())
> > >
> > > Oh wait, there's actually another failure in FE:
> > >
> > > 23:47:52 [INFO] Running org.apache.impala.analysis.ParserTest
> > > 23:47:52 [ERROR] Tests run: 98, Failures: 1, Errors: 0, Skipped: 0, Time 
> > > elapsed: 0.829 s <<< FAILURE! - in org.apache.impala.analysis.ParserTest
> > > 23:47:52 [ERROR] TestGetErrorMsg(org.apache.impala.analysis.ParserTest)  
> > > Time elapsed: 0.005 s  <<< FAILURE!
> > > at 
> > > org.apache.impala.analysis.ParserTest.ParserError(ParserTest.java:77)
> > > at 
> > > org.apache.impala.analysis.ParserTest.TestGetErrorMsg(ParserTest.java:3475)
> > >
> > > Will need to look into this one.
> >
> > PS9 fixes the ParserTest issue..just had to update the list of expected 
> > keywords in the test after a WHERE clause.  Not yet sure about the other 
> > OOM failure since I cannot repro locally.
>
> One difference between my machine's execution vs the one on Jenkins is that 
> the Jenkins one runs it under a dockerised container and the JVM Max heap is 
> 1.78GB which is smaller than the 2GB on my machine.
>
> From ubuntu-16.04-dockerised-tests-3539/logs/ee_tests/impalad_coord_exec-0:
> impalad.4069f4f171ef.ubuntu.log.INFO.20201126-001441.1:  JVM: max heap size: 
> Total=1.78 GB
> impalad.4069f4f171ef.ubuntu.log.INFO.20201126-001441.1:  JVM: non-heap 
> committed: Total=114.50 MB
>
> I looked closer at the 'GC overhead limit exceeded' issue with the 
> test_statement_expression_limit() test and I suspect it happens because in my 
> original patch the default construction of the Expr object also allocates an 
> empty ArrayList for hints (this was consistent with how we handle other types 
> of hints).  However, this is not really necessary and it can be done on 
> demand only when an actual WHERE clause hint is supplied. This particular 
> test creates a huge number of Exprs and considering that the Dockerised run 
> has a slightly smaller max heap, it might cause the GC to run more often 
> without reclaiming sufficient free memory.
>
> In PS 15 I have made the change to allocate the arraylist only when a 
> predicate hint was supplied. Hopefully it addresses the issue.

It looks like the test_statement_expression_limit passed in the latest run on 
PS 15. There were couple of load failures which seem like existing flaky test 
issue especially https://issues.apache.org/jira/browse/IMPALA-10344 (although 
the same error occurs for a different table).

I will start another run with a +2 carry-over.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 16
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 28 Nov 2020 01:48:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 16: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6717/


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 16
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Nov 2020 23:49:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6717/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 16
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Nov 2020 18:12:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7743/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 15
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Nov 2020 18:08:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 15:

> Patch Set 9:
>
> > Patch Set 8:
> >
> > > Patch Set 8:
> > >
> > > > Patch Set 8:
> > > >
> > > > > Patch Set 8: Verified-1
> > > > >
> > > > > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/
> > > >
> > > > There's 1 failure:
> > > > [gw1] FAILED 
> > > > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit
> > > >
> > > > However, on my desktop I ran query_test/test_exprs.py and all tests 
> > > > under it passed.
> > >
> > > On Jenkins this test hit an OOM GC overhead limit exceeded:
> > > gw1] linux2 -- Python 2.7.16 
> > > /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python
> > > query_test/test_exprs.py:176: in test_statement_expression_limit
> > > assert re.search(expected_err_re, str(err))
> > > E   assert None
> > > E+  where None = ('Exceeded the 
> > > statement expression limit \\(25\\)\nStatement has .* expressions.', 
> > > "ImpalaBeeswaxException:\n INNER EXCEPTION:  > > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC 
> > > overhead limit exceeded\n")
> > > E+where  = re.search
> > > E+and   "ImpalaBeeswaxException:\n INNER EXCEPTION:  > > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC 
> > > overhead limit exceeded\n" = str(ImpalaBeeswaxException())
> >
> > Oh wait, there's actually another failure in FE:
> >
> > 23:47:52 [INFO] Running org.apache.impala.analysis.ParserTest
> > 23:47:52 [ERROR] Tests run: 98, Failures: 1, Errors: 0, Skipped: 0, Time 
> > elapsed: 0.829 s <<< FAILURE! - in org.apache.impala.analysis.ParserTest
> > 23:47:52 [ERROR] TestGetErrorMsg(org.apache.impala.analysis.ParserTest)  
> > Time elapsed: 0.005 s  <<< FAILURE!
> > at 
> > org.apache.impala.analysis.ParserTest.ParserError(ParserTest.java:77)
> > at 
> > org.apache.impala.analysis.ParserTest.TestGetErrorMsg(ParserTest.java:3475)
> >
> > Will need to look into this one.
>
> PS9 fixes the ParserTest issue..just had to update the list of expected 
> keywords in the test after a WHERE clause.  Not yet sure about the other OOM 
> failure since I cannot repro locally.

One difference between my machine's execution vs the one on Jenkins is that the 
Jenkins one runs it under a dockerised container and the JVM Max heap is 1.78GB 
which is smaller than the 2GB on my machine.

>From ubuntu-16.04-dockerised-tests-3539/logs/ee_tests/impalad_coord_exec-0:
impalad.4069f4f171ef.ubuntu.log.INFO.20201126-001441.1:  JVM: max heap size: 
Total=1.78 GB
impalad.4069f4f171ef.ubuntu.log.INFO.20201126-001441.1:  JVM: non-heap 
committed: Total=114.50 MB

I looked closer at the 'GC overhead limit exceeded' issue with the 
test_statement_expression_limit() test and I suspect it happens because in my 
original patch the default construction of the Expr object also allocates an 
empty ArrayList for hints (this was consistent with how we handle other types 
of hints).  However, this is not really necessary and it can be done on demand 
only when an actual WHERE clause hint is supplied. This particular test creates 
a huge number of Exprs and considering that the Dockerised run has a slightly 
smaller max heap, it might cause the GC to run more often without reclaiming 
sufficient free memory.

In PS 15 I have made the change to allocate the arraylist only when a predicate 
hint was supplied. Hopefully it addresses the issue.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 15
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Nov 2020 18:06:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-27 Thread Aman Sinha (Code Review)
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#15).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of non-empty files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

If there are too many empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
17 files changed, 517 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/15
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 15
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 14: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6714/


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 14
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 26 Nov 2020 04:28:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6714/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 14
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Nov 2020 22:55:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 14: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 14
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Nov 2020 22:55:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7737/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 13
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Nov 2020 22:50:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-25 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 13: Code-Review+2

(1 comment)

Carry forward +2

http://gerrit.cloudera.org:8080/#/c/16723/12/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16723/12/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@1211
PS12, Line 1211: strBuilder.append(ToSqlUtils.getPlanHintsSql(options,
> line too long (96 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 13
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Nov 2020 22:30:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-25 Thread Aman Sinha (Code Review)
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#13).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of non-empty files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

If there are too many empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
17 files changed, 514 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/13
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 13
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-24 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 12: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 12
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Nov 2020 01:10:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7732/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 12
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Nov 2020 00:47:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-24 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 12:

> Uploaded patch set 12.

Additional testing found a bug that when the WHERE clause hint was within a 
view, it was not being recognized by the simple limit optimization.  The reason 
was the SQL generation was not propagating this additional hint and HMS 
definition for the view did not have the hint.
PS 12 fixes this.

I have done manual testing with such views. However, I haven't been able to add 
a test with a hint in a view. I tried creating a view with hints in the 
functional_schema_template.sql and the view was created but doing a 'DESCRIBE 
EXTENDED' shows hint was missing. This needs further exploration on the test 
infrastructure.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 12
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Nov 2020 00:33:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16723/12/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16723/12/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@1211
PS12, Line 1211: strBuilder.append(ToSqlUtils.getPlanHintsSql(options, 
whereClause_.getPredicateHints()))
line too long (96 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 12
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Nov 2020 00:27:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-24 Thread Aman Sinha (Code Review)
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#12).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of non-empty files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

If there are too many empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
17 files changed, 514 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/12
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 12
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-23 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 11:

> Patch Set 11:
>
> (4 comments)
>
> > Patch Set 11:
> >
> > (4 comments)
> >
> > Just a couple corner cases I have run into; given this is an opt-in 
> > optimization now it might not be incorrect to ignore these.
> >
> > I think it's good to think about the case where this optimization helps and 
> > not risk an incorrect limit in other cases. Where this helps most.
> > a. lots of files
> > b. small limits
> >
> > a) the scan range and scheduling overhead is only slow when there are many 
> > hosts + files.
> >
> > b) for large limits maybe the bulk of query run time goes to fetching 
> > results and not the planning, but that said it may not hurt too much in 
> > this case.
>
> Thanks for the comments. I will create a follow-up JIRA to address couple of 
> these comments considering that this CR was +2 ed.
> Note that the more generalized issue of optimizing for limits is something 
> that Tim and I had some offline discussion about and he created 
> 'IMPALA-10347: Explore approaches to optimizing queries that will likely be 
> short-circuited by limits'

Created https://issues.apache.org/jira/browse/IMPALA-10353


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 11
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Nov 2020 02:37:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-23 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 11:

(4 comments)

> Patch Set 11:
>
> (4 comments)
>
> Just a couple corner cases I have run into; given this is an opt-in 
> optimization now it might not be incorrect to ignore these.
>
> I think it's good to think about the case where this optimization helps and 
> not risk an incorrect limit in other cases. Where this helps most.
> a. lots of files
> b. small limits
>
> a) the scan range and scheduling overhead is only slow when there are many 
> hosts + files.
>
> b) for large limits maybe the bulk of query run time goes to fetching results 
> and not the planning, but that said it may not hurt too much in this case.

Thanks for the comments. I will create a follow-up JIRA to address couple of 
these comments considering that this CR was +2 ed.
Note that the more generalized issue of optimizing for limits is something that 
Tim and I had some offline discussion about and he created 'IMPALA-10347: 
Explore approaches to optimizing queries that will likely be short-circuited by 
limits'

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@854
PS11, Line 854:   FileSystem partitionFs;
> I'd consider only doing this optimization for "record oriented" or "splitta
It simplifies the logic and our internal testing at least if we could apply it 
across the board..perhaps for text format we have an allowance that 10% more 
files be considered to accommodate 'invalid' files .. will that be acceptable ?


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@867
PS11, Line 867:   for (FileDescriptor fd : 
partition.getFileDescriptors()) {
> Also maybe a threshold on the number of scan ranges where we wouldn't bothe
This is related to your other comment below about bailing out under certain 
conditions. I'll try to run the benchmark .. that's a good idea.


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@873
PS11, Line 873: simpleLimitNumRows++;  // conservatively estimate 1 
row per file
> The flip side of this might be for simple limits that are relatively large
One complication is that the total number of files is not known up front 
(unless we aggregate it up front). We are pruning at 2 levels: once in the 
HdfsPartitionPruner where we limit the number of partitions and then here where 
we limit the number of files per partition.  In both places, as each partition 
is processed, we look at the # files but don't know the total.  We could decide 
to do the aggregation of the num files by making a separate pass over all 
partitions during HdfsPartitionPruner but that will add some overhead.


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1121
PS11, Line 1121:
> For Hive ACID and deleted records would this logic still work? Might be a h
For ACID tables with deleted rows, the planner will internally create an Hash 
Anti Join to handle the not-in, so yeah the limit should not be applied in such 
cases because it is no longer a simple scan.  I will create a separate JIRA to 
handle that case since additional testing and code changes would be needed.
Thanks for raising this.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 11
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Nov 2020 02:06:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-23 Thread Shant Hovsepian (Code Review)
Shant Hovsepian has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 11:

(4 comments)

Just a couple corner cases I have run into; given this is an opt-in 
optimization now it might not be incorrect to ignore these.

I think it's good to think about the case where this optimization helps and not 
risk an incorrect limit in other cases. Where this helps most.
a. lots of files
b. small limits

a) the scan range and scheduling overhead is only slow when there are many 
hosts + files.

b) for large limits maybe the bulk of query run time goes to fetching results 
and not the planning, but that said it may not hurt too much in this case.

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@854
PS11, Line 854:   FileSystem partitionFs;
I'd consider only doing this optimization for "record oriented" or "splittable" 
file formats, like parquet/avro. For text tables it's not uncommon for parsing 
or record boundary issues to cause an entire file to be invalid.


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@867
PS11, Line 867:   for (FileDescriptor fd : 
partition.getFileDescriptors()) {
Also maybe a threshold on the number of scan ranges where we wouldn't bother 
with the optimization, be/src/benchmarks/scheduler-benchmark is a good test to 
run, if I recall tens of nodes and hundreds of files isn't too slow. Just 
thinking that the cases where this optimization assumption is potentially 
problematic are when there are a few malformed files better to err on the side 
of caution then.


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@873
PS11, Line 873: simpleLimitNumRows++;  // conservatively estimate 1 
row per file
The flip side of this might be for simple limits that are relatively large or 
larger than the number of files a maximum bail out threshold might make sense. 
I.e. if the limit is 10,000 then no sense in doing this optimization unless the 
number of files is much greater than 10,000?


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1121
PS11, Line 1121:
For Hive ACID and deleted records would this logic still work? Might be a 
helpful test.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 11
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 23 Nov 2020 21:32:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 11: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6690/


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 11
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 23 Nov 2020 00:02:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 11: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 11
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sun, 22 Nov 2020 18:32:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6690/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 11
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sun, 22 Nov 2020 18:32:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 10: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6689/


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 10
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 11:16:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6689/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 10
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 05:46:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 10: Code-Review+2

Rebased on latest master.  Carry forward +2 .


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 10
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 05:36:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7711/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 9
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 04:17:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 9:

> Patch Set 8:
>
> > Patch Set 8:
> >
> > > Patch Set 8:
> > >
> > > > Patch Set 8: Verified-1
> > > >
> > > > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/
> > >
> > > There's 1 failure:
> > > [gw1] FAILED 
> > > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit
> > >
> > > However, on my desktop I ran query_test/test_exprs.py and all tests under 
> > > it passed.
> >
> > On Jenkins this test hit an OOM GC overhead limit exceeded:
> > gw1] linux2 -- Python 2.7.16 
> > /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python
> > query_test/test_exprs.py:176: in test_statement_expression_limit
> > assert re.search(expected_err_re, str(err))
> > E   assert None
> > E+  where None = ('Exceeded the 
> > statement expression limit \\(25\\)\nStatement has .* expressions.', 
> > "ImpalaBeeswaxException:\n INNER EXCEPTION:  > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC 
> > overhead limit exceeded\n")
> > E+where  = re.search
> > E+and   "ImpalaBeeswaxException:\n INNER EXCEPTION:  > 'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC 
> > overhead limit exceeded\n" = str(ImpalaBeeswaxException())
>
> Oh wait, there's actually another failure in FE:
>
> 23:47:52 [INFO] Running org.apache.impala.analysis.ParserTest
> 23:47:52 [ERROR] Tests run: 98, Failures: 1, Errors: 0, Skipped: 0, Time 
> elapsed: 0.829 s <<< FAILURE! - in org.apache.impala.analysis.ParserTest
> 23:47:52 [ERROR] TestGetErrorMsg(org.apache.impala.analysis.ParserTest)  Time 
> elapsed: 0.005 s  <<< FAILURE!
> at 
> org.apache.impala.analysis.ParserTest.ParserError(ParserTest.java:77)
> at 
> org.apache.impala.analysis.ParserTest.TestGetErrorMsg(ParserTest.java:3475)
>
> Will need to look into this one.

PS9 fixes the ParserTest issue..just had to update the list of expected 
keywords in the test after a WHERE clause.  Not yet sure about the other OOM 
failure since I cannot repro locally.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 9
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 03:58:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#9).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of non-empty files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

If there are too many empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
17 files changed, 507 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/9
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 9
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 8:

> Patch Set 8:
>
> > Patch Set 8:
> >
> > > Patch Set 8: Verified-1
> > >
> > > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/
> >
> > There's 1 failure:
> > [gw1] FAILED 
> > query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit
> >
> > However, on my desktop I ran query_test/test_exprs.py and all tests under 
> > it passed.
>
> On Jenkins this test hit an OOM GC overhead limit exceeded:
> gw1] linux2 -- Python 2.7.16 
> /home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python
> query_test/test_exprs.py:176: in test_statement_expression_limit
> assert re.search(expected_err_re, str(err))
> E   assert None
> E+  where None = ('Exceeded the 
> statement expression limit \\(25\\)\nStatement has .* expressions.', 
> "ImpalaBeeswaxException:\n INNER EXCEPTION:  'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC overhead 
> limit exceeded\n")
> E+where  = re.search
> E+and   "ImpalaBeeswaxException:\n INNER EXCEPTION:  'beeswaxd.ttypes.BeeswaxException'>\n MESSAGE: OutOfMemoryError: GC overhead 
> limit exceeded\n" = str(ImpalaBeeswaxException())

Oh wait, there's actually another failure in FE:

23:47:52 [INFO] Running org.apache.impala.analysis.ParserTest
23:47:52 [ERROR] Tests run: 98, Failures: 1, Errors: 0, Skipped: 0, Time 
elapsed: 0.829 s <<< FAILURE! - in org.apache.impala.analysis.ParserTest
23:47:52 [ERROR] TestGetErrorMsg(org.apache.impala.analysis.ParserTest)  Time 
elapsed: 0.005 s  <<< FAILURE!
at org.apache.impala.analysis.ParserTest.ParserError(ParserTest.java:77)
at 
org.apache.impala.analysis.ParserTest.TestGetErrorMsg(ParserTest.java:3475)

Will need to look into this one.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 8
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 03:24:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 8:

> Patch Set 8:
>
> > Patch Set 8: Verified-1
> >
> > Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/
>
> There's 1 failure:
> [gw1] FAILED 
> query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit
>
> However, on my desktop I ran query_test/test_exprs.py and all tests under it 
> passed.

On Jenkins this test hit an OOM GC overhead limit exceeded:
gw1] linux2 -- Python 2.7.16 
/home/ubuntu/Impala/bin/../infra/python/env-gcc7.5.0/bin/python
query_test/test_exprs.py:176: in test_statement_expression_limit
assert re.search(expected_err_re, str(err))
E   assert None
E+  where None = ('Exceeded the 
statement expression limit \\(25\\)\nStatement has .* expressions.', 
"ImpalaBeeswaxException:\n INNER EXCEPTION: \n MESSAGE: OutOfMemoryError: GC overhead 
limit exceeded\n")
E+where  = re.search
E+and   "ImpalaBeeswaxException:\n INNER EXCEPTION: \n MESSAGE: OutOfMemoryError: GC overhead 
limit exceeded\n" = str(ImpalaBeeswaxException())


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 8
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 03:14:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 8:

> Patch Set 8: Verified-1
>
> Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/

There's 1 failure:
[gw1] FAILED 
query_test/test_exprs.py::TestExprLimits::test_statement_expression_limit

However, on my desktop I ran query_test/test_exprs.py and all tests under it 
passed.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 8
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 03:09:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 8: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 8
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Nov 2020 02:48:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6687/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 8
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Nov 2020 21:20:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 7
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Nov 2020 21:20:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 8: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 8
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Nov 2020 21:20:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7703/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 7
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Nov 2020 19:00:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7702/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 6
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Nov 2020 18:55:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/cup/sql-parser.cup@3115
PS5, Line 3115: // clause. An attempt was made to set this for individual exprs
> I guess this is a bit limiting in the it applies only to the whole where cl
That was my original attempt too..setting the hint at the predicate level but 
when I was running into shift/reduce conflicts.  I gave it another try today 
with a few variations and still could not get it to work.  I have left a 
comment in the code about this.

For reference, here's one change I tried:
expr ::=
  non_pred_expr:e
  {: RESULT = e; :}
  | opt_plan_hints:pred_hints predicate:p
  {:
p.setPredicateHints(pred_hints);
RESULT = p;
  :};

 This generated quite a few conflicts..example:
Warning : *** Shift/Reduce conflict found in state #253
  between opt_plan_hints ::= (*)
  and case_expr ::= (*) KW_CASE expr case_when_clause_list case_else_clause 
KW_END
  and case_expr ::= (*) KW_CASE case_when_clause_list case_else_clause 
KW_END
  under symbol KW_CASE
  Resolved in favor of shifting.

Warning : *** Shift/Reduce conflict found in state #253
  between opt_plan_hints ::= (*)
  and cast_expr ::= (*) KW_CAST LPAREN expr KW_AS type_def cast_format_val 
RPAREN
  under symbol KW_CAST
  Resolved in favor of shifting.
..
...


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/analysis/Predicate.java
File fe/src/main/java/org/apache/impala/analysis/Predicate.java:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/analysis/Predicate.java@30
PS5, Line 30:  {
> maybe hasAlwaysTrueHint_ just to make it crystal-clear that it's not actual
Changed this and the names of the setter/getter also.


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@869
PS5, Line 869: if ((fsHasBlocks && fd.getNumFileBlocks() == 0)
> nit: use braces for multi-line if
Done


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870
PS5, Line 870: d.getFileLength() <
> We already had to deal with a similar issue here
I will follow up with the doc writer on this.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 7
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Nov 2020 18:47:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#7).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of non-empty files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

If there are too many empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
16 files changed, 505 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/7
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 7
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-20 Thread Aman Sinha (Code Review)
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#6).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of non-empty files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

If there are too many empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
16 files changed, 505 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/6
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 6
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/cup/sql-parser.cup@3115
PS5, Line 3115:   KW_WHERE opt_plan_hints:pred_hints expr:e
I guess this is a bit limiting in the it applies only to the whole where 
clause. Should it be part of the expr production below so it can be attached to 
any expression?

I don't think this affects the functionality of this patch, since we're only 
checking the top-level statement anyway, but it seems like itwould me more 
elegant to have the expr hint be associated with the expr in the parser?

If there are complications with that, maybe a comment here explaining the 
limitation would be sufficient.


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/analysis/Predicate.java
File fe/src/main/java/org/apache/impala/analysis/Predicate.java:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/analysis/Predicate.java@30
PS5, Line 30: isAlwaysTrue_
maybe hasAlwaysTrueHint_ just to make it crystal-clear that it's not actually a 
guarantee?


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@869
PS5, Line 869: if (fsHasBlocks && fd.getNumFileBlocks() == 0) 
continue;
nit: use braces for multi-line if


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870
PS5, Line 870: fd.getFileLength()
> Yes. Totally agree. We probably can live with the 0-row data files through
We already had to deal with a similar issue here
https://impala.apache.org/docs/build/html/topics/impala_optimize_partition_key_scans.html
 ; we should document similarly

Generally it doesn't make any sense to write files with 0 rows and it should be 
rare. Our experience is that some misbehaving tools can generate 0 row files 
(we've seen Spark do it with issues like 
https://issues.apache.org/jira/browse/SPARK-10216).

You're right that the files are non-empty because they have the footer with the 
schema. I don't think there's an upper bound on the size either though, cause 
they could have an arbitrarily complex scheme.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Nov 2020 01:29:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-19 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 5: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870
PS5, Line 870: fd.getFileLength()
> The getFileLength() check is used in other places too..so I borrowed that f
Yes. Totally agree. We probably can live with the 0-row data files through 
documentation.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 19 Nov 2020 17:52:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-19 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870
PS5, Line 870: fd.getFileLength()
> Is it possible for a Parquet data file with empty # of rows pass this test?
The getFileLength() check is used in other places too..so I borrowed that from 
generateScanRangeSpec().  For self-describing schema file formats like Parquet, 
 yes it is possible for the length to be non-zero and  num_rows zero. I think 
that handling such cases will need some major rework of this 
computeScanRangeLocation() method since right now it is agnostic to the file 
format (it does care about file system type but not so much the formats).  
Further,  I believe other changes will be needed in the metadata catalog layer 
to ensure this FileMetaData is plumbed through although I haven't looked 
closely into that. The trade-off is the size of the metadata in catalog cache 
could blow up for large number of files and we are already run into significant 
memory issues.


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@947
PS5, Line 947: if (isSimpleLimit && simpleLimitNumRows ==
 :   analyzer.getSimpleLimitStatus().second) {
 : // for the simple limit case if the estimated rows has 
already reached the limit
 : // there's no need to process more partitions
 : break;
 :   }
> This is good.
Ack



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 19 Nov 2020 16:51:29 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-19 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 5:

(2 comments)

Looks very good!

The empty parquet data file may be a corner case to worry about.

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870
PS5, Line 870: fd.getFileLength()
Is it possible for a Parquet data file with empty # of rows pass this test? 
Note that due to the meta-data portion, such a file will have some number of 
bytes.  See one case here https://github.com/G-Research/ParquetSharp/issues/110.

If we can look at the meta-data of such a file, the number of rows is right 
there.

871 struct FileMetaData { 
872   /** Version of this file **/
873   1: required i32 version  
874   
875   /** Parquet schema for this file.  This schema contains metadata for all 
the columns.
876* The schema is represented as a tree with a single root.  The nodes of 
the tree
877* are flattened to a list by doing a depth-first traversal.
878* The column metadata contains the path in the schema for that column 
which can be
879* used to map columns to nodes in the schema.
 
880* The first element is the root **/  
881   2: required list schema;  
882   
883   /** Number of rows in this file **/   
884   3: required i64 num_rows


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@947
PS5, Line 947: if (isSimpleLimit && simpleLimitNumRows ==
 :   analyzer.getSimpleLimitStatus().second) {
 : // for the simple limit case if the estimated rows has 
already reached the limit
 : // there's no need to process more partitions
 : break;
 :   }
This is good.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 19 Nov 2020 14:30:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7683/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 19 Nov 2020 02:23:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-18 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test:

http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@241
PS4, Line 241:  limit 1
> Yeah, that is a tricky issue.
I went with a simple approach of skipping the empty files when we are iterating 
over the file descriptors and only increment the count for the non-empty files. 
The determination of empty vs non-empty also depends on the file system 
type..so there's a check for that.  PS 5 has the latest changes.   I also 
modified the commit message accordingly.

I haven't added new tests for this since I am not quite sure if we have a data 
set that has such mixed empty and non-empty files per partition.  If you guys 
think it is important to test that, I am open to adding one.



-- 
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 19 Nov 2020 02:17:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-18 Thread Aman Sinha (Code Review)
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#5).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of non-empty files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

If there are too many empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
16 files changed, 500 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/5
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-18 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test:

http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@241
PS4, Line 241:  limit 1
> In HdfsScanNode.computeScanRangeLocation(), we skip computing the scan rang
Yeah, that is a tricky issue.

If we take the optimistic approach, we could add a new step in (2) when empty 
files are found. That is, in this new step we come up with extra up to  
non-empty files for a total of  empty files found in the pruned list.

Another approach would be abandon the pruned list and go with the full list.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 4
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 19 Nov 2020 01:42:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-18 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test:

http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@241
PS4, Line 241:  limit 1
> This makes me feel we should skip those files with 0 rows during pruning. I
In HdfsScanNode.computeScanRangeLocation(), we skip computing the scan range if 
file is empty:
  Line 912 on master:
  // Skips files that have no associated blocks.
  if (fileDesc.getNumFileBlocks() == 0) continue;
However, we populate the totalFilesPerFs_  treemap  earlier .. on line 891 and 
that's the one that gets used to display the EXPLAIN string. So, yeah there's 
some inconsistency in the display (although it is possible it is intentional to 
show all files including empty ones in the explain).

For my patch, there are 2 steps in which the pruning happens: (1) in  
HdfsPartitionPruner when I am limiting the number of partitions based only the 
number of file descriptors per partition - i.e not examining each file 
descriptor since that would have overhead,  and (2) in HdfsScanNode I am 
limiting the number of files since that code already iterates over the file 
descriptors.   I guess I could skip empty files in step 2 even though it would 
mess up the calculation that was done in step 1.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 4
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 18 Nov 2020 23:14:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-18 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 4:

(1 comment)

Looks nice!

In addition to the empty file concern, I wonder if in the explain output, one 
can clearly see the application of this optimization, other than checking out 
the files scanned vs the total one by one. Such an indicator could be very 
useful in rule out a problem (if any) in the area quickly. Sorry I was not able 
to see it in the code.

http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test:

http://gerrit.cloudera.org:8080/#/c/16723/4/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@241
PS4, Line 241:  limit 1
This makes me feel we should skip those files with 0 rows during pruning. In my 
test for a table with textfile format, I can add empty files in the folder for 
the table and impala will process it.

Query: explain select * from table_bar
++
| Explain String
 |
++
| Max Per-Host Resource Reservation: Memory=0B Threads=2
 |
| Per-Host Resource Estimates: Memory=10MB  
 |
| WARNING: The following tables are missing relevant table and/or column 
statistics. |
| default.table_bar 
 |
|   
 |
| PLAN-ROOT SINK
 |
| | 
 |
| 01:EXCHANGE [UNPARTITIONED]   
 |
| | 
 |
| 00:SCAN HDFS [default.table_bar]  
 |
|HDFS partitions=1/1 files=1 size=0B
 |
|row-size=4B cardinality=0  
 |
++

[09:24:31 qchen@qifan-10229: parquet] sqlci -q "select * from table_bar"
Starting Impala Shell with no authentication using Python 2.7.16
Warning: live_progress only applies to interactive shell sessions, and is being 
skipped for now.
Opened TCP connection to localhost:21000
Connected to localhost:21000
Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build 
ebe72ec25f4c6daabaa27f6daddd03b887806507)
Query: select * from table_bar
Query submitted at: 2020-11-18 09:24:48 (Coordinator: http://qifan-10229:25000)
Query progress can be monitored at: 
http://qifan-10229:25000/query_plan?query_id=df40c6ecaeeb3a0e:11dd5cb7
Fetched 0 row(s) in 4.64s


drop table if exists table_bar purge;
create table if not exists table_bar (a int)
STORED AS textfile
location '/tmp/table_bar_dir';

touch empty.txt
hdfs dfs -copyFromLocal empty.txt /tmp/table_bar_dir



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 4
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 18 Nov 2020 14:43:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7668/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 4
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 18 Nov 2020 00:12:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-17 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS3, Line 209: for (FeFsPartition p : partitions) {
 : numRows += p.getNumFileDescriptors();
 : prunedPartitions.add(p);
 : if (numRows >= analyzer.getSimpleLimitStatus().second) {
 :   // here we only limit the partitions; later in 
HdfsScanNode we will
 :   // limit the file descriptors within a partition
 :   break;
 : }
> Yes, I think I like the always true hint option. It seems the new code shou
Done


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@845
PS3, Line 845: long simpleLimitNumRows = 0; // only used for the simple 
limit case
> maybe simpleLimitNumRows to make it clearer what it is? Ok to ignore.
Done


http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test:

http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@24
PS2, Line 24: # limit on both sides of a union all
> Add a case where the where clause has a subquery, too?
Added an IN subquery test that does NOT get optimized. Also, previously I had 
added an EXISTS subquery test that DOES get optimized (further below).


http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@26
PS2, Line 26: union all
> Yeah, I had concerns about this too but was debating how to allow for the W
Discussed this offline and added the always_true hint for WHERE clause.  This 
is examined in SelectStmt.checkSimpleLimitStmt().  Modified the tests to 
include this hint where applicable.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 4
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 18 Nov 2020 00:01:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-17 Thread Aman Sinha (Code Review)
Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#4).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. Further,
if there's a WHERE clause, it must have an 'always_true'
hint in order for the optimization to be considered. For
example:
  set optimize_simple_limit = true;
  SELECT * FROM T
WHERE /* +always_true */ 
  LIMIT 10;

Note that if there are empty files in the partitions, it is
possible that the query may produce fewer rows although
those are still valid rows.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added an e2e test. Since result rows are non-deterministic,
   only simple count(*) query on top of subquery with limit
   was added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
16 files changed, 487 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/4
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 4
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-17 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 3:

(4 comments)

Looks good!

http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601
PS3, Line 601: (1 row per file)
> Yeah, I had initially thought of that but the file metadata as it exists to
Done


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180
PS3, Line 180: order-by
> Not sure what you meant here.  In the presence of order-by, we cannot push
Yes, I agree order-by should not be allowed. Sorry I was not thinking when I 
type.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS3, Line 209: for (FeFsPartition p : partitions) {
 : numRows += p.getNumFileDescriptors();
 : prunedPartitions.add(p);
 : if (numRows >= analyzer.getSimpleLimitStatus().second) {
 :   // here we only limit the partitions; later in 
HdfsScanNode we will
 :   // limit the file descriptors within a partition
 :   break;
 : }
> I think on balance it'd be better to make the partition selection determini
Yes, I agree and have experienced similar situations before. I recall at the 
end we had to apply the randomness.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS3, Line 209: for (FeFsPartition p : partitions) {
 : numRows += p.getNumFileDescriptors();
 : prunedPartitions.add(p);
 : if (numRows >= analyzer.getSimpleLimitStatus().second) {
 :   // here we only limit the partitions; later in 
HdfsScanNode we will
 :   // limit the file descriptors within a partition
 :   break;
 : }
> Yes, these are the pruned partitions that will be sent over to the runtime.
Yes, I think I like the always true hint option. It seems the new code should 
check that first and then allow the WHERE clause. Customers can forget the hint 
here and there in the query and end up with incorrect results.

Wait and see on randomize the partitions to scan sounds good to me.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 17 Nov 2020 20:08:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-17 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS3, Line 209: for (FeFsPartition p : partitions) {
 : numRows += p.getNumFileDescriptors();
 : prunedPartitions.add(p);
 : if (numRows >= analyzer.getSimpleLimitStatus().second) {
 :   // here we only limit the partitions; later in 
HdfsScanNode we will
 :   // limit the file descriptors within a partition
 :   break;
 : }
> Yes, these are the pruned partitions that will be sent over to the runtime.
I think on balance it'd be better to make the partition selection 
deterministic, just cause it'll be easier to debug, support, etc.

Agree that it might be a good optimization for workloads that were mostly 
comprised of queries like this, but we're not really anticipating such 
workloads.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@845
PS3, Line 845: long numRows = 0; // only used for the simple limit case
maybe simpleLimitNumRows to make it clearer what it is? Ok to ignore.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 17 Nov 2020 19:50:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-17 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601
PS3, Line 601: (1 row per file)
> Sounds like for parquet/orc, we can take a sample of few data files and get
Yeah, I had initially thought of that but the file metadata as it exists today  
see [1] is not customized for Parquet or ORC so it does not keep track of the 
rowcount per file (we do have  numRows_ per partition but that's an estimate 
obtained from hive metastore). We could certainly do more aggressive pruning 
during planning time if we leverage that.  This could be a future enhancement 
(maybe there's an existing jira.. I will check).

[1] 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L103


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180
PS3, Line 180: order-by
> Sounds like we should allow the order by clause since it does not increase/
Not sure what you meant here.  In the presence of order-by, we cannot push the 
limit to the scan anyways.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS3, Line 209: for (FeFsPartition p : partitions) {
 : numRows += p.getNumFileDescriptors();
 : prunedPartitions.add(p);
 : if (numRows >= analyzer.getSimpleLimitStatus().second) {
 :   // here we only limit the partitions; later in 
HdfsScanNode we will
 :   // limit the file descriptors within a partition
 :   break;
 : }
> I wonder if the prunnedPartitions returned here are the only ones to be sca
Yes, these are the pruned partitions that will be sent over to the runtime. 
Regarding the WHERE clause,  Tim had raised similar concern and we had some 
offline discussion about how to safely allow predicates.  The reason for 
allowing at least some type of predicate is that for various reasons the user 
may be doing a limit on top of a view that has a predicate that is always true. 
 For example,  https://issues.apache.org/jira/browse/IMPALA-10064 describes one 
such use case we have seen.

Tim had a suggestion for using a hint for the where clause (where /* 
+always_true */) and I agree that it is a safe option since it puts the onus on 
the user.  I will update the patch with those changes.

For randomly picking a subset of partitions..hmm..my initial thought is it may 
be over-doing the optimization. Let's see the adoption of this functionality in 
a concurrent workload scenario..note that the optimize_simple_limit is disabled 
by default.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 17 Nov 2020 19:41:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-17 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 3:

(4 comments)

Like it!

http://gerrit.cloudera.org:8080/#/c/16723/3/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/16723/3/be/src/service/query-options.h@226
PS3, Line 226: OPTIMIZE_SIMPLE_LIMIT
OPTIMIZE_SIMPLE_LIMIT_QUERY probably is more specific.


http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601
PS3, Line 601: (1 row per file)
Sounds like for parquet/orc, we can take a sample of few data files and get the 
number of rows from the meta-data section for each. This probably can help 
speed up the pruning process.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180
PS3, Line 180: order-by
Sounds like we should allow the order by clause since it does not 
increase/decrease # of rows.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS3, Line 209: for (FeFsPartition p : partitions) {
 : numRows += p.getNumFileDescriptors();
 : prunedPartitions.add(p);
 : if (numRows >= analyzer.getSimpleLimitStatus().second) {
 :   // here we only limit the partitions; later in 
HdfsScanNode we will
 :   // limit the file descriptors within a partition
 :   break;
 : }
I wonder if the prunnedPartitions returned here are the only ones to be scanned 
during run-time. If so, I think for this optimization to work, we should not 
allow any WHERE clause.

The other point is that we may consider randomly pick a small subset of 
partitions to reduce the chance of contention from multiple small limit queries.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 17 Nov 2020 17:35:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7657/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 17 Nov 2020 00:22:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-16 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 3:

> Patch Set 2:
>
> (1 comment)

I addressed the refactoring comment and another minor comment.  For the 
handling of WHERE predicates, I will do an update based on additional offline 
discussion.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 17 Nov 2020 00:08:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-16 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@184
PS2, Line 184:   // Tracks the simple LIMIT status of this query block. First 
item of the
> Can you explain what the two values are?
Done (added explanation in the code)


http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1560
PS2, Line 1560:   scanNode.init(analyzer);
> It seems like this would most naturally fit as part of HdfsPartition prunin
Refactored into a separate function in the HdfsPartitionPruner and invoked it 
from the pruner itself at the final stage of pruning.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 17 Nov 2020 00:03:53 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
File fe/src/main/java/org/apache/impala/analysis/PartitionSet.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/PartitionSet.java@89
PS3, Line 89:   partitions_ = pruner.prunePartitions(analyzer, 
transformedConjuncts, true, null).first;
line too long (93 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 17 Nov 2020 00:01:14 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-16 Thread Aman Sinha (Code Review)
Hello Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#3).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. If enabled,
in certain cases the query may produce fewer rows (due to
filter conditions or presence of empty files) although
those rows are valid rows that would have been present if
the optimization was disabled.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added e2e tests. Since result rows are non-deterministic,
   only simple count(*) queries on top of subqueries with limit
   were added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/PartitionSet.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
13 files changed, 365 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/3
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-16 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test:

http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@26
PS2, Line 26: where bigint_col < 100
> Trying to apply this with non-partition predicates doesn't seem safe.
Yeah, I had concerns about this too but was debating how to allow for the WHERE 
clause of the type supported in IMPALA-10064 which involve a view containing a 
predicate.  I can make the algorithm more restrictive:
 1.  only allow BinaryPredicate and one side of it must be a partitioning 
column.
 2. don't allow subqueries in the predicate as it ends up getting transformed 
to joins



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 16 Nov 2020 18:15:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-16 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 2:

(4 comments)

I have some serious concerns about the behaviour with predicates, not 
necessarily that it wouldn't be useful in some circumstances, but that it's 
changing the meaning of the LIMIT clause too much, even considering that it's 
under a query option.

Otherwise I had more minor comments on the rest of the code.

http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@184
PS2, Line 184:   // Tracks the simple LIMIT status of this query block
Can you explain what the two values are?


http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16723/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1560
PS2, Line 1560:   // check eligibility of simple limit optimization:
It seems like this would most naturally fit as part of HdfsPartition pruning, 
which is invoked above. Would be good to move this logic into a separate 
function regardless, but it seems like it could fit in that step.


http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test:

http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@24
PS2, Line 24: # limit with a WHERE clause
Add a case where the where clause has a subquery, too?


http://gerrit.cloudera.org:8080/#/c/16723/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@26
PS2, Line 26: where bigint_col < 100
Trying to apply this with non-partition predicates doesn't seem safe.

I get there's a use case for limiting the input data scanned even when it 
changes results but this seems very different and surprising behavior for the 
LIMIT clause compared to the other optimisation.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 16 Nov 2020 17:26:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7653/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 16 Nov 2020 04:53:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-15 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@847
PS1, Line 847: (analyzer.getQueryCtx().client_request.getQuery_options()
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1328
PS1, Line 1328:   inlineViewRef.getAnalyzer().setSimpleLimitStatus(new 
Pair<>(new Boolean(true),
> line too long (106 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1564
PS1, Line 1564:   if (hdfsTblRef.getSampleParams() == null
> line too long (95 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1581
PS1, Line 1581:   }
> line too long (96 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1584
PS1, Line 1584: finalPartitions, hdfsTblRef, aggInfo, pair.second, 
isPartitionKeyScan);
> line too long (91 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 16 Nov 2020 04:33:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-15 Thread Aman Sinha (Code Review)
Hello Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16723

to look at the new patch set (#2).

Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations() which, according
to query profiling, has been the main contributor to the
planning time especially for large number of partitions.
Further, within each partition, we only consider the number
of files that brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization. If enabled,
in certain cases the query may produce fewer rows (due to
filter conditions or presence of empty files) although
those rows are valid rows that would have been present if
the optimization was disabled.

Testing:
 - Added planner tests for the optimization
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - Added e2e tests. Since result rows are non-deterministic,
   only simple count(*) queries on top of subqueries with limit
   were added.

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
11 files changed, 348 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/2
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7652/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 1
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sun, 15 Nov 2020 02:06:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
..


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@847
PS1, Line 847: 
(analyzer.getQueryCtx().client_request.getQuery_options().isOptimize_simple_limit()
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1328
PS1, Line 1328:   inlineViewRef.getAnalyzer().setSimpleLimitStatus(new 
Pair<>(new Boolean(true), outerStatus.second));
line too long (106 > 90)


http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1564
PS1, Line 1564:   && 
analyzer.getQueryCtx().client_request.getQuery_options().isOptimize_simple_limit()
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1581
PS1, Line 1581: scanNode = new HdfsScanNode(ctx_.getNextNodeId(), 
tupleDesc, conjuncts, finalPartitions,
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/16723/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1584
PS1, Line 1584: scanNode = new HdfsScanNode(ctx_.getNextNodeId(), 
tupleDesc, conjuncts, partitions,
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 1
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Sun, 15 Nov 2020 01:45:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

2020-11-14 Thread Aman Sinha (Code Review)
Aman Sinha has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16723


Change subject: IMPALA-10314: Optimize planning time for simple limits
..

IMPALA-10314: Optimize planning time for simple limits

This patch optimizes the planning time for simple limit
queries by only considering a minimal set of partitions
whose file descriptors add up to N (the specified limit).
Each file is conservatively estimated to contain 1 row.

This reduces the number of partitions processed by
HdfsScanNode.computeScanRangeLocations(). Further, within
each partition, we only consider the number of files that
brings the total to N.

This is an opt-in optimization. A new planner option
OPTIMIZE_SIMPLE_LIMIT enables this optimization.

Testing:
 - Added new planner tests that enable this optimization
 - Ran PlannerTest
 - Ran query_test.py tests by enabling the optimize_simple_limit
 - TODO: add e2e tests (I haven't added them yet since the
   results are non-determinstic. Only simple count(*) queries
   on top of subqueries with limit could probably be added)

Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
10 files changed, 302 insertions(+), 5 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/16723/1
--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 1
Gerrit-Owner: Aman Sinha