from:"Tim Armstrong \(Code Review\)"

[Impala-ASF-CR] IMPALA-7556: Decouple BufferManagement from the ScanRange and IoMgr

2021-05-28 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17413 )

Change subject: IMPALA-7556: Decouple BufferManagement from the ScanRange and 
IoMgr
..


Patch Set 9:

Glad to see someone new getting their hands dirty in this code, it makes sense 
to make ScanRange responsible for fewer things...


--
To view, visit http://gerrit.cloudera.org:8080/17413
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibd74691b50b46114f95a8641034c05d07ddeec97
Gerrit-Change-Number: 17413
Gerrit-PatchSet: 9
Gerrit-Owner: Amogh Margoor 
Gerrit-Reviewer: Amogh Margoor 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 29 May 2021 06:18:39 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10503: testdata load hits hive memory limit errors during hive inserts

2021-02-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17061 )

Change subject: IMPALA-10503: testdata load hits hive memory limit errors 
during hive inserts
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/17061
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idac5f054e814070b983f7f57aef4ea9d54252bb2
Gerrit-Change-Number: 17061
Gerrit-PatchSet: 2
Gerrit-Owner: Kurt Deschler 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 12 Feb 2021 23:24:40 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9745: Propagate source type when doing constant propagation

2021-02-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17064 )

Change subject: IMPALA-9745: Propagate source type when doing constant 
propagation
..


Patch Set 3: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/17064
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3853478945229440f733c256ea225639f9178ff
Gerrit-Change-Number: 17064
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 12 Feb 2021 22:33:49 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1

2021-02-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
..


Patch Set 35:

(4 comments)

I think we should merge this once a few small issues are fixed. After merging I 
would like to see more stress and perf testing before we consider it 
produciton-ready, but I think at this point the code is best in master.

http://gerrit.cloudera.org:8080/#/c/16318/35/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/35/be/src/runtime/tmp-file-mgr.cc@1795
PS35, Line 1795: cur_write_range_->io_ctx()->Cancel();
I think this Cancel() is undesirable, because it means there are two different 
code paths that the query failure propagates by, and we may have different 
errors from them. I think passing status to write_callback() below is enough to 
cancel the query, but maybe we need to set an additional flag here to prevent 
different threads from getting blocked in ReserveLocalBufferSpace and instead 
pass the same timeout error to write_callback()


http://gerrit.cloudera.org:8080/#/c/16318/35/be/src/runtime/tmp-file-mgr.cc@1797
PS35, Line 1797: status = TMP_FILE_BUFFER_POOL_CONTEXT_CANCELLED;
As discussed, this should be a human-readable error message that explains that 
the spilling was cancelled after waiting for x seconds without getting a local 
file buffer.


http://gerrit.cloudera.org:8080/#/c/16318/35/be/src/runtime/tmp-file-mgr.cc@1941
PS35, Line 1941:   timeout = true;
It'd be clearer to just return here, instead of break then return below.


http://gerrit.cloudera.org:8080/#/c/16318/35/tests/custom_cluster/test_scratch_disk.py
File tests/custom_cluster/test_scratch_disk.py:

http://gerrit.cloudera.org:8080/#/c/16318/35/tests/custom_cluster/test_scratch_disk.py@260
PS35, Line 260:   @pytest.mark.execute_serially
Do these tests actually read back the data from remote storage? Or do they read 
from in-memory and the local files.

I.e. did you confirm that they are reading spilled data from HDFS?

If they are reading it back from in memory, maybe set 
--buffer_pool_clean_pages_limit to a lower value so that more clean pages are 
evicted



--
To view, visit http://gerrit.cloudera.org:8080/16318
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89
Gerrit-Change-Number: 16318
Gerrit-PatchSet: 35
Gerrit-Owner: Yida Wu 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Yida Wu 
Gerrit-Comment-Date: Fri, 12 Feb 2021 22:33:11 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char

2021-02-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char
..


Patch Set 13:

(5 comments)

I looked at the backend change and the approach seems sane.

I think an alternative for CHAR would be to switch to storing it in as a 
variable-length slot, i.e. StringValue, but to retain the padding logic. The 
approach you choose probably requires fewer code changes, which is good for a 
legacy data type that is hopefully rarely used.

http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@7
PS13, Line 7: IMPALA-5675: Support UTF-8 Varchar and Char
One minor thing : I think we should comment the len field, e.g. in StringValue 
StringVal, ColumnType, TypeDescriptor to be clear whether it means the length 
in bytes or the length in characters.


http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@45
PS13, Line 45:
In the backend, how did you identify all the places you needed to change the 
code? I can think of a number of places where the behaviour depends on the 
CHAR/VARCHAR length

* Cast functions to CHAR/VARCHAR
* Cast functions from STRING to CHAR, where we need to identify the "true" 
length of the CHAR. Although if we switched to the truncating behaviour 
suggested in IMPALA-1652, that might be the same for both code paths.
* Scanners where we read and may need to pad or truncate according to the data 
type.
* Expression evaluation where we turn a CHAR slot into a StringVal and a 
StringVal into a CHAR slot
* Both the codegen'd and interpreted codepaths of the above.

Generally I think my concern is that we've caught all the places and that we're 
testing the code.


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc@392
PS13, Line 392: if (!state_->utf8_mode()) {
Should we add a DCHECK in UTF-8 mode to verify that the strings returned from 
Kudu do not have > the required number of characters.


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc
File be/src/exprs/cast-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc@218
PS13, Line 218:   // TODO: Will timestamp strings have non-ASCII characters?
I don't think they could? Custom timestamp formats could have them, but that 
doesn't apply here- you can't have a custom format in a cast.


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc
File be/src/exprs/scalar-expr-evaluator.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc@316
PS13, Line 316:   // TODO: consider returning reference of the StringVal in 
UTF-8 mode.
You might have to be careful here with UDFs, since they could return a len > 
the slot size.



--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 12 Feb 2021 22:08:45 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9745: Propagate source type when doing constant propagation

2021-02-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17064 )

Change subject: IMPALA-9745: Propagate source type when doing constant 
propagation
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17064/2/fe/src/main/java/org/apache/impala/analysis/ConstantPredicateHandler.java
File fe/src/main/java/org/apache/impala/analysis/ConstantPredicateHandler.java:

http://gerrit.cloudera.org:8080/#/c/17064/2/fe/src/main/java/org/apache/impala/analysis/ConstantPredicateHandler.java@186
PS2, Line 186: Expr expr = source.getType() != target.getType() ?
Should this use !equals() instead of !=



--
To view, visit http://gerrit.cloudera.org:8080/17064
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3853478945229440f733c256ea225639f9178ff
Gerrit-Change-Number: 17064
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 12 Feb 2021 21:33:05 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9234: Support Ranger row filtering policies

2021-02-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16976 )

Change subject: IMPALA-9234: Support Ranger row filtering policies
..


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/analysis/InlineViewRef.java
File fe/src/main/java/org/apache/impala/analysis/InlineViewRef.java:

http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/analysis/InlineViewRef.java@175
PS2, Line 175: // TODO: Column-masking/Row-filtering expressions may have 
subqueries which may
Can you file a JIRA for this? I guess it is a fairly plausible use case for 
this - I found an example here - 
https://cwiki.apache.org/confluence/display/RANGER/Row-level+filtering+and+column-masking+using+Apache+Ranger+policies+in+Apache+Hive

I think we might also have to be careful about exactly what subqueries are 
allowed. E.g. if would be weird if you had a correlated references between the 
row filter subquery and the outer query, or if you could have a relative table 
reference that referred to different tables depending on which DB it was 
executed in.

edit: found the JIRA IMPALA-10483


http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java
File 
fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java:

http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@473
PS2, Line 473: when row filter is disabled
This is inaccurate - this method doesn't check whether row filtering is enabled.


http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@476
PS2, Line 476: authorizeRowFilter
Rename this method to reflect new behaviour, e.g. throwIfRowFilteringRequired()


http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerBufferAuditHandler.java
File 
fe/src/main/java/org/apache/impala/authorization/ranger/RangerBufferAuditHandler.java:

http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerBufferAuditHandler.java@139
PS2, Line 139:   // TODO: Whether we should use lowercase or uppercase 
accessType?
How would we decide this? What does Hive do?


http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/test/java/org/apache/impala/authorization/ranger/RangerAuditLogTest.java
File 
fe/src/test/java/org/apache/impala/authorization/ranger/RangerAuditLogTest.java:

http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/test/java/org/apache/impala/authorization/ranger/RangerAuditLogTest.java@454
PS2, Line 454: String databaseName = "functional";
Add a brief comment describing the row filter policies added.


http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/test/java/org/apache/impala/authorization/ranger/RangerAuditLogTest.java@502
PS2, Line 502:   authzOk(events -> {
Can you comment why we get the row filter event in one case but not the other 
(I think it's because of the user it's applied to right).


http://gerrit.cloudera.org:8080/#/c/16976/2/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test:

http://gerrit.cloudera.org:8080/#/c/16976/2/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test@3
PS2, Line 3: # Row-filtering policy keeps rows with "id % 3 = 0".
Is the expected behaviour that row filters are applied before column masks?


http://gerrit.cloudera.org:8080/#/c/16976/2/testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test:

http://gerrit.cloudera.org:8080/#/c/16976/2/testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test@1
PS2, Line 1: 
Can you add a test where the row filter references a column not in the table. 
it should fail with an analysis exception. There's an extra edge case to check 
here where it isn't in the table, but could be a correlated reference to an 
outer query.

E.g. This is a valid query, but I think test_id = id would be an invalid row 
filter on alltypes.

  select * from jointbl
  where exists(select * from alltypes where test_id = id);



--
To view, visit http://gerrit.cloudera.org:8080/16976
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I580517be241225ca15e45686381b78890178d7cc
Gerrit-Change-Number: 16976
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Review

[Impala-ASF-CR] IMPALA-4805: Avoid hash exchange before analytic function if appropriate

2021-02-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16888 )

Change subject: IMPALA-4805: Avoid hash exchange before analytic function if 
appropriate
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16888
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icb6289d1e70cfb6bbd5b38eedb00856dbc85ac77
Gerrit-Change-Number: 16888
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 23:51:29 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README

2021-02-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/17058 )

Change subject: IMPALA-10470: Add link to quickstart container from README
..

IMPALA-10470: Add link to quickstart container from README

Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
Reviewed-on: http://gerrit.cloudera.org:8080/17058
Reviewed-by: Joe McDonnell 
Tested-by: Tim Armstrong 
---
M README.md
1 file changed, 9 insertions(+), 1 deletion(-)

Approvals:
  Joe McDonnell: Looks good to me, approved
  Tim Armstrong: Verified

--
To view, visit http://gerrit.cloudera.org:8080/17058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
Gerrit-Change-Number: 17058
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README

2021-02-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17058 )

Change subject: IMPALA-10470: Add link to quickstart container from README
..


Patch Set 1:

Manually verifying since it's only a readme change.


--
To view, visit http://gerrit.cloudera.org:8080/17058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
Gerrit-Change-Number: 17058
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 17:51:15 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README

2021-02-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17058 )

Change subject: IMPALA-10470: Add link to quickstart container from README
..


Patch Set 1: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/17058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
Gerrit-Change-Number: 17058
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 17:51:07 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 4:

Hit IMPALA-10501


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 17:50:11 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 4: Code-Review+2

fix rat exclusdes


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:24:14 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)

Hello Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17050

to look at the new patch set (#4).

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..

IMPALA-9382: part 3/3 clean up runtime profile v2 text output

Eliminated some of the noisy per-instance counters
from DEFAULT verbosity.

Testing:
* Updated impala-profile-tool test with new output
* Added new impala-profile-tool test for v2 profile.

Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
---
M be/src/util/runtime-profile.cc
M bin/rat_exclude_files.txt
M testdata/impala-profiles/README
M 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt
A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt
M tests/observability/test_profile_tool.py
8 files changed, 5,760 insertions(+), 516 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/4
--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README

2021-02-10 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17058


Change subject: IMPALA-10470: Add link to quickstart container from README
..

IMPALA-10470: Add link to quickstart container from README

Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
---
M README.md
1 file changed, 9 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/17058/1
--
To view, visit http://gerrit.cloudera.org:8080/17058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
Gerrit-Change-Number: 17058
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 2: Code-Review+2

carry


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:44 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)

Hello Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17050

to look at the new patch set (#2).

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..

IMPALA-9382: part 3/3 clean up runtime profile v2 text output

Eliminated some of the noisy per-instance counters
from DEFAULT verbosity.

Testing:
* Updated impala-profile-tool test with new output
* Added new impala-profile-tool test for v2 profile.

Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
---
M be/src/util/runtime-profile.cc
M testdata/impala-profiles/README
M 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt
A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt
M tests/observability/test_profile_tool.py
7 files changed, 5,757 insertions(+), 516 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/2
--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell

[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py
File tests/observability/test_profile_tool.py:

http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py@50
PS1, Line 50: )
> flake8: E501 line too long (91 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py@53
PS1, Line 53: )
> flake8: E501 line too long (92 > 90 characters)
Done



--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:38 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-09 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David 
Rorke, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#34).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

This interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle
the partitioned case, for which we use a std::map and a
comparator based on the partition exprs. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/exprs/slot-ref.h
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/priority-queue.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/or

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-09 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..


Patch Set 33:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc
File be/src/exec/topn-node.cc:

http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@559
PS30, Line 559:   // We evict heaps starting with the heaps that were least 
effective at filtering
> I think I was partly thinking of the computation within the AnalyticEval no
Yup


http://gerrit.cloudera.org:8080/#/c/16242/33/testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test:

http://gerrit.cloudera.org:8080/#/c/16242/33/testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test@22
PS33, Line 22: limit predicate
> When browsing through the plans one more time, it occurred to me that some
I think that makes sense, it is introducing a very specific new term that might 
be confusing. I replaced it.



--
To view, visit http://gerrit.cloudera.org:8080/16242
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
Gerrit-Change-Number: 16242
Gerrit-PatchSet: 33
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 10 Feb 2021 07:21:31 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo

2021-02-09 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17030 )

Change subject: IMPALA-10469: push quickstart to apache repo
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17030/2/docker/publish_images_to_apache.sh
File docker/publish_images_to_apache.sh:

http://gerrit.cloudera.org:8080/#/c/17030/2/docker/publish_images_to_apache.sh@62
PS2, Line 62: IMAGES+=" impala_quickstart_client impala_quickstart_hms"
> Ok, I think a coherent first step is that docker-images.txt is a list of al
Good point. I did this and re-pushed the images.

Also removed the -d flag cause I don't think that was particularly useful and 
it got more complicated when we had images that didn't have debug values.



--
To view, visit http://gerrit.cloudera.org:8080/17030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a
Gerrit-Change-Number: 17030
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 09 Feb 2021 20:27:33 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo

2021-02-09 Thread Tim Armstrong (Code Review)

Hello Grant Henke, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17030

to look at the new patch set (#3).

Change subject: IMPALA-10469: push quickstart to apache repo
..

IMPALA-10469: push quickstart to apache repo

This adds a script, docker/publish_images_to_apache.sh,
that allows uploading images to the apache/impala docker hub
repo, prefixed with a version string. E.g. with the following
commands:

  ninja docker_images quickstart_docker_images
  ./docker/publish_images_to_apache.sh -v 81d5377c2

The uploaded images can then be used for the quickstart cluster,
as documented in docker/README.

Updated docs for quickstart to use a prefix from apache/impala

Remove IMPALA_QUICKSTART_VERSION, which doesn't interact well with
the tagging since the image name and version are now encoded in the
tag.

Fix an incorrect image name added to docker-images.txt:
impala_profile_tool_image.

Testing:
Ran Impala quickstart with data loading using instructions in README.

  export IMPALA_QUICKSTART_IMAGE_PREFIX="apache/impala:81d5377c2-"
  docker network create -d bridge quickstart-network
  export QUICKSTART_IP=$(docker network inspect quickstart-network -f '{{(index 
.IPAM.Config 0).Gateway}}')
  export QUICKSTART_LISTEN_ADDR=$QUICKSTART_IP

  docker-compose -f docker/quickstart.yml \
  -f docker/quickstart-kudu-minimal.yml \
  -f docker/quickstart-load-data.yml up -d

  docker run --network=quickstart-network -it \
   ${IMPALA_QUICKSTART_IMAGE_PREFIX}impala_quickstart_client
   impala-shell

Change-Id: I535d77e565b73d732ae511d7525193467086c76a
---
M docker/CMakeLists.txt
M docker/README.md
A docker/publish_images_to_apache.sh
M docker/quickstart-load-data.yml
M docker/quickstart.yml
5 files changed, 115 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/17030/3
--
To view, visit http://gerrit.cloudera.org:8080/17030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a
Gerrit-Change-Number: 17030
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-9586: update query option docs for mt dop

2021-02-09 Thread Tim Armstrong (Code Review)

Tim Armstrong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/17043 )

Change subject: IMPALA-9586: update query option docs for mt_dop
..

IMPALA-9586: update query option docs for mt_dop

There are interactions between mt_dop and num_nodes and
num_scanner_threads. Mention these in the docs.

Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd
Reviewed-on: http://gerrit.cloudera.org:8080/17043
Tested-by: Impala Public Jenkins 
Reviewed-by: Joe McDonnell 
---
M docs/topics/impala_num_nodes.xml
M docs/topics/impala_num_scanner_threads.xml
2 files changed, 9 insertions(+), 0 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Joe McDonnell: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/17043
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd
Gerrit-Change-Number: 17043
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-09 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17050


Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..

IMPALA-9382: part 3/3 clean up runtime profile v2 text output

Eliminated some of the noisy per-instance counters
from DEFAULT verbosity.

Testing:
* Updated impala-profile-tool test with new output
* Added new impala-profile-tool test for v2 profile.

Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
---
M be/src/util/runtime-profile.cc
M testdata/impala-profiles/README
M 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt
A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt
M tests/observability/test_profile_tool.py
7 files changed, 5,755 insertions(+), 516 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/1
--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-9586: update query option docs for mt dop

2021-02-08 Thread Tim Armstrong (Code Review)

Hello Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17043

to look at the new patch set (#2).

Change subject: IMPALA-9586: update query option docs for mt_dop
..

IMPALA-9586: update query option docs for mt_dop

There are interactions between mt_dop and num_nodes and
num_scanner_threads. Mention these in the docs.

Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd
---
M docs/topics/impala_num_nodes.xml
M docs/topics/impala_num_scanner_threads.xml
2 files changed, 9 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/17043/2
--
To view, visit http://gerrit.cloudera.org:8080/17043
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd
Gerrit-Change-Number: 17043
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell

[Impala-ASF-CR] IMPALA-9586: update query option docs for mt dop

2021-02-08 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17043


Change subject: IMPALA-9586: update query option docs for mt_dop
..

IMPALA-9586: update query option docs for mt_dop

There are interactions between mt_dop and num_nodes and
num_scanner_threads. Mention these in the docs.

Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd
---
M docs/topics/impala_num_nodes.xml
M docs/topics/impala_num_scanner_threads.xml
2 files changed, 9 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/17043/1
--
To view, visit http://gerrit.cloudera.org:8080/17043
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd
Gerrit-Change-Number: 17043
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-8721: re-enable test hive impala interop

2021-02-08 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17042


Change subject: IMPALA-8721: re-enable test_hive_impala_interop
..

IMPALA-8721: re-enable test_hive_impala_interop

The test now passes because HIVE-21290 was fixed.

Revert "IMPALA-8689: test_hive_impala_interop failing with "Timeout >7200s""

This reverts commit 5d8c99ce74c45a7d04f11e1f252b346d654f02bf.

Change-Id: I7e2beabd7082a45a0fc3b60d318cf698079768ff
---
M tests/custom_cluster/test_hive_parquet_codec_interop.py
1 file changed, 2 insertions(+), 4 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/17042/1
--
To view, visit http://gerrit.cloudera.org:8080/17042
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I7e2beabd7082a45a0fc3b60d318cf698079768ff
Gerrit-Change-Number: 17042
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-08 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..


Patch Set 30:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG@49
PS30, Line 49: The
> nit: 'This'
Done


http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG@57
PS30, Line 57: both
 : the partitioning
> nit: ..and non-partitioned case ?
Done


http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.h
File be/src/exec/topn-node.h:

http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.h@133
PS30, Line 133: partition are returned
> Would be good to add how the returned partitions are ordered among themselv
It seems worth being clear about the invariant, so added it here.


http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc
File be/src/exec/topn-node.cc:

http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@53
PS30, Line 53: the the
> nit: repetition 'the'
D'oh


http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@59
PS30, Line 59: the the
> nit:  repetition 'the'
D'oh


http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@398
PS30, Line 398: RETURN_IF_CANCELLED(state);
> I was wondering if the inner while loop (line 418) should also have a cance
The inner loop is bounded by the batch size so should be fine (the general rule 
has been to check for cancellation at O(batch size) intervals).


http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@559
PS30, Line 559:   // We evict heaps starting with the heaps that were least 
effective at filtering
> The eviction policy is interesting.  Considering that a rank()/rownum() can
Maybe I didn't quite get the point, but we should be able to start filtering as 
soon as we have N elements in a heap for the partition - after that point, each 
new row is either in the top N we've seen so far (in which case we discard a 
previous row) or it's not (in which case we discard the new row).


http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@666
PS30, Line 666:   // The key references memory in 'tuple_pool_'. Replace it 
with a rematerialized tuple.
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@279
PS30, Line 279:   public long estimateTopNMaterializedSize(long totalRows) {
> Should this estimate account for any per-partition overhead (additional met
We could probably add 100 bytes or so to account for the size of the heap, but 
I feel like it probably doesn't really affect estimates all that much and adds 
a bit of complexity.

I did realise that the name of this method is a bit misleading, so renamed it 
to reflect that it's just estimating the size of that many rows.


http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@646
PS30, Line 646: limit on
> nit: incomplete sentence
Done


http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
File fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java:

http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@453
PS30, Line 453:   if (limit.limit > 1) {
> Maybe simplify this if-else to  planNodeLimit = Math.max(limit.limit, 1)
Done


http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@461
PS30, Line 461:   if (partitionByExprs.isEmpty()) {
> There was a recent issue with constant partition-by and order-by exprs (htt
I modified this to check for whether they are all literals and added a test. 
There's a bit of a distinction between "constant", which means assumed constant 
within a query and literals that are guaranteed constant, so I used the latter.


http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@885
PS30, Line 885: /// Whether the source predicate was a simple < or <= 
inequality.
> This sounded confusing at first ..I interpreted it to mean true for < (the
Done


http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@928
PS30, Line 928: to .
> nit:  incomplete
Done


http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/DistributedPlanner

[Impala-ASF-CR] wip

2021-02-08 Thread Tim Armstrong (Code Review)

Tim Armstrong has abandoned this change. ( 
http://gerrit.cloudera.org:8080/17040 )

Change subject: wip
..


Abandoned
--
To view, visit http://gerrit.cloudera.org:8080/17040
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: If1919b896ee3e4eaef2d02cf8b1d29d7a0996632
Gerrit-Change-Number: 17040
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-08 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David 
Rorke, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#32).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

This interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle
the partitioned case, for which we use a std::map and a
comparator based on the partition exprs. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/exprs/slot-ref.h
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/priority-queue.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/or

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-08 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David 
Rorke, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#31).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

This interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle
the partitioned case, for which we use a std::map and a
comparator based on the partition exprs. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/exprs/slot-ref.h
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/priority-queue.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/or

[Impala-ASF-CR] wip

2021-02-08 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17040


Change subject: wip
..

wip

Change-Id: If1919b896ee3e4eaef2d02cf8b1d29d7a0996632
---
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M fe/src/main/java/org/apache/impala/analysis/SortInfo.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SelectNode.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test
10 files changed, 159 insertions(+), 362 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/17040/1
--
To view, visit http://gerrit.cloudera.org:8080/17040
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: If1919b896ee3e4eaef2d02cf8b1d29d7a0996632
Gerrit-Change-Number: 17040
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo

2021-02-08 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17030 )

Change subject: IMPALA-10469: push quickstart to apache repo
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17030/2/docker/publish_images_to_apache.sh
File docker/publish_images_to_apache.sh:

http://gerrit.cloudera.org:8080/#/c/17030/2/docker/publish_images_to_apache.sh@62
PS2, Line 62: IMAGES+=" impala_quickstart_client impala_quickstart_hms"
> What consumes docker-images.txt other than this? Should we just include the
I've seen some build tools (internal to Cloudera) that basically intersect the 
images in docker-images.txt with the image tags generated by the build and then 
consider those build outputs.

We don't build these quickstart images as part of the docker_images target, but 
I guess that's true of the debug images as well, so maybe it's harmless to add 
these. What do you think?



--
To view, visit http://gerrit.cloudera.org:8080/17030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a
Gerrit-Change-Number: 17030
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 09 Feb 2021 01:40:56 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo

2021-02-05 Thread Tim Armstrong (Code Review)

Hello Grant Henke, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17030

to look at the new patch set (#2).

Change subject: IMPALA-10469: push quickstart to apache repo
..

IMPALA-10469: push quickstart to apache repo

This adds a script, docker/publish_images_to_apache.sh,
that allows uploading images to the apache/impala docker hub
repo, prefixed with a version string. E.g. with the following
commands:

  ninja docker_images quickstart_docker_images
  ./docker/publish_images_to_apache.sh -v 81d5377c2

The uploaded images can then be used for the quickstart cluster,
as documented in docker/README.

Updated docs for quickstart to use a prefix from apache/impala

Remove IMPALA_QUICKSTART_VERSION, which doesn't interact well with
the tagging since the image name and version are now encoded in the
tag.

Fix an incorrect image name added to docker-images.txt:
impala_profile_tool_image.

Testing:
Ran Impala quickstart with data loading using instructions in README.

  export IMPALA_QUICKSTART_IMAGE_PREFIX="apache/impala:81d5377c2-"
  docker network create -d bridge quickstart-network
  export QUICKSTART_IP=$(docker network inspect quickstart-network -f '{{(index 
.IPAM.Config 0).Gateway}}')
  export QUICKSTART_LISTEN_ADDR=$QUICKSTART_IP

  docker-compose -f docker/quickstart.yml \
  -f docker/quickstart-kudu-minimal.yml \
  -f docker/quickstart-load-data.yml up -d

  docker run --network=quickstart-network -it \
   ${IMPALA_QUICKSTART_IMAGE_PREFIX}impala_quickstart_client
   impala-shell

Change-Id: I535d77e565b73d732ae511d7525193467086c76a
---
M docker/CMakeLists.txt
M docker/README.md
A docker/publish_images_to_apache.sh
M docker/quickstart-load-data.yml
M docker/quickstart.yml
5 files changed, 111 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/17030/2
--
To view, visit http://gerrit.cloudera.org:8080/17030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a
Gerrit-Change-Number: 17030
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell

[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo

2021-02-05 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17030


Change subject: IMPALA-10469: push quickstart to apache repo
..

IMPALA-10469: push quickstart to apache repo

This adds a script, docker/publish_images_to_apache.sh,
that allows uploading images to the apache/impala docker hub
repo, prefixed with a version string. E.g. with the following
commands:

  ninja docker_images quickstart_docker_images
  ./docker/publish_images_to_apache.sh -v 81d5377c2

The uploaded images can then be used for the quickstart cluster,
as documented in docker/README.

Updated docs for quickstart to use a prefix from apache/impala

Remove IMPALA_QUICKSTART_VERSION, which doesn't interact well with
the tagging since the image name and version are now encoded in the
tag.

Fix an incorrect image name added to docker-images.txt:
impala_profile_tool_image.

Testing:
Ran Impala quickstart with data loading using instructions in README.

  export IMPALA_QUICKSTART_IMAGE_PREFIX="apache/impala:81d5377c2-"
  docker network create -d bridge quickstart-network
  export QUICKSTART_IP=$(docker network inspect quickstart-network -f '{{(index 
.IPAM.Config 0).Gateway}}')
  export QUICKSTART_LISTEN_ADDR=$QUICKSTART_IP

  docker-compose -f docker/quickstart.yml \
  -f docker/quickstart-kudu-minimal.yml \
  -f docker/quickstart-load-data.yml up -d

  docker run --network=quickstart-network -it \
   ${IMPALA_QUICKSTART_IMAGE_PREFIX}impala_quickstart_client
   impala-shell

Change-Id: I535d77e565b73d732ae511d7525193467086c76a
---
M docker/CMakeLists.txt
M docker/README.md
A docker/publish_images_to_apache.sh
M docker/quickstart-load-data.yml
M docker/quickstart.yml
5 files changed, 111 insertions(+), 12 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/17030/1
--
To view, visit http://gerrit.cloudera.org:8080/17030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a
Gerrit-Change-Number: 17030
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-10475: [DOCS] elaborate SYNC DDL option

2021-02-05 Thread Tim Armstrong (Code Review)

Tim Armstrong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/17027 )

Change subject: IMPALA-10475: [DOCS] elaborate SYNC_DDL option
..

IMPALA-10475: [DOCS] elaborate SYNC_DDL option

call out that SYNC_DDL applies to all filesystem-based tables

Change-Id: I3f1bfce8430c681515101d00cabf9d70ae52e5ec
Reviewed-on: http://gerrit.cloudera.org:8080/17027
Tested-by: Impala Public Jenkins 
Reviewed-by: Tim Armstrong 
---
M docs/topics/impala_sync_ddl.xml
1 file changed, 7 insertions(+), 7 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Tim Armstrong: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/17027
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I3f1bfce8430c681515101d00cabf9d70ae52e5ec
Gerrit-Change-Number: 17027
Gerrit-PatchSet: 2
Gerrit-Owner: Shajini Thayasingh 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-10475: [DOCS] elaborate SYNC DDL option

2021-02-05 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17027 )

Change subject: IMPALA-10475: [DOCS] elaborate SYNC_DDL option
..


Patch Set 1: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/17027
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3f1bfce8430c681515101d00cabf9d70ae52e5ec
Gerrit-Change-Number: 17027
Gerrit-PatchSet: 1
Gerrit-Owner: Shajini Thayasingh 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 05 Feb 2021 22:11:32 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10389: impala-profile-tool container

2021-02-04 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17015 )

Change subject: IMPALA-10389: impala-profile-tool container
..


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt
File docker/CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt@132
PS1, Line 132:   file(WRITE ${CMAKE_SOURCE_DIR}/docker/docker-images.txt 
"${exported_image_names}")
> I'm wondering whether it is significant that the profile tool image is adde
You're right, this should go at the end. I didn't check the contents of the 
file.


http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt@172
PS1, Line 172: setup_utility_build_context.py
> I think this should be setup_build_context.py?
Done


http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt@184
PS1, Line 184: WORKING_DIRECTORY 
${IMPALA_UTILITY_BUILD_CONTEXT_DIR}/${build_type}
> Nit: indentation should line up with COMMAND
Done


http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt@186
PS1, Line 186:   #DEPENDS 
${CMAKE_SOURCE_DIR}/bin/graceful_shutdown_backends.sh
> Nit: stray line?
Missing the entrypoint script.



--
To view, visit http://gerrit.cloudera.org:8080/17015
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714
Gerrit-Change-Number: 17015
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 05 Feb 2021 00:39:36 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10389: impala-profile-tool container

2021-02-04 Thread Tim Armstrong (Code Review)

Hello Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17015

to look at the new patch set (#2).

Change subject: IMPALA-10389: impala-profile-tool container
..

IMPALA-10389: impala-profile-tool container

Add a build step for an impala-profile-tool docker image
that makes it easy to run the binary on any system.

This container is automatically built as part of the
docker build.

This sets up a new build context that doesn't pull in all of
the same dependencies or depend on the Java build

Testing:

  cat logs/cluster/profiles/* | \
docker run -i impala_profile_tool

I uploaded a build of the container to dockerhub too:

  timgarmstrong/impala_profile_tool

Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714
---
M docker/CMakeLists.txt
A docker/impala_profile_tool/Dockerfile
M docker/setup_build_context.py
A docker/utility_entrypoint.sh
4 files changed, 203 insertions(+), 46 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/17015/2
--
To view, visit http://gerrit.cloudera.org:8080/17015
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714
Gerrit-Change-Number: 17015
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-10274: Initialize impala-python as part of the CMake build

2021-02-03 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16607 )

Change subject: IMPALA-10274: Initialize impala-python as part of the CMake 
build
..


Patch Set 5: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16607
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ieff51263c55bd234028fed7101c94b4a928590f0
Gerrit-Change-Number: 16607
Gerrit-PatchSet: 5
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 04 Feb 2021 00:46:19 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-02 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David 
Rorke, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#30).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

The interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle both
the partitioning (for which we use a std::map and a
comparator based on the partition exprs). The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/exprs/slot-ref.h
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/priority-queue.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/o

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-02 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David 
Rorke, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#29).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

The interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle both
the partitioning (for which we use a std::map and a
comparator based on the partition exprs). The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/exprs/slot-ref.h
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/priority-queue.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/o

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-02 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..


Patch Set 28:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/16242/28//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16242/28//COMMIT_MSG@59
PS28, Line 59: and the tie-handling
 : semantics required by rank() predicates
> nit: I think this was really implemented in your previous patch?
Done


http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.h
File be/src/exec/topn-node.h:

http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.h@64
PS28, Line 64: int64_t limit = is_partitioned() ? per_partition_limit() :
> What's the relationship between 'include_ties' and 'is_partitioned', i.e. w
tried to make it less confusing...


http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc
File be/src/exec/topn-node.cc:

http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc@244
PS28, Line 244: U
> typo
Done


http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc@399
PS28, Line 399: RETURN_IF_ERROR(QueryMaintenance(state));
> This results in two calls to QueryMaintenance() in quick succession, here a
Good point, I just moved it to GetNextUnpartitioned() so that we don't double 
up in this method.


http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc@566
PS28, Line 566: be
> typo
Done


http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc@666
PS28, Line 666: vector> rematerialized_heaps;
  :   for (auto& entry : partition_heaps_) {
  : RETURN_IF_ERROR(entry.second->RematerializeTuples(this, 
state, temp_pool.get()));
  : DCHECK(entry.second->DCheckConsistency());
  : // The key references memory in 'tuple_pool_'. Replace it 
with a rematerialized tuple.
  : rematerialized_heaps.push_back(move(entry.second));
  :   }
  :   partition_heaps_.clear();
  :   for (auto& heap_ptr : rematerialized_heaps) {
  : const Tuple* key_tuple = heap_ptr->top();
  : partition_heaps_.emplace(key_tuple, move(heap_ptr));
  :   }
> I think this can be put in an 'else' with the above 'if (heap_ != nullptr)'
Done


http://gerrit.cloudera.org:8080/#/c/16242/28/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/16242/28/common/thrift/ImpalaService.thrift@625
PS28, Line 625:   // If > 0, the rank()/row_number() pushdown into pre-analytic 
sorts is enabled
> Maybe note the default value, and briefly the issues with setting it higher
Done



--
To view, visit http://gerrit.cloudera.org:8080/16242
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
Gerrit-Change-Number: 16242
Gerrit-PatchSet: 28
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 02 Feb 2021 18:53:14 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9588: Add extra logging to cancel tests

2021-02-02 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16985 )

Change subject: IMPALA-9588: Add extra logging to cancel tests
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16985
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ied7100a9ea2e2f0611cf8e328e589b4c8e5d5100
Gerrit-Change-Number: 16985
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 02 Feb 2021 17:33:00 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10389: impala-profile-tool container

2021-02-01 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17015


Change subject: IMPALA-10389: impala-profile-tool container
..

IMPALA-10389: impala-profile-tool container

Add a build step for an impala-profile-tool docker image
that makes it easy to run the binary on any system.

This container is automatically built as part of the
docker build.

This sets up a new build context that doesn't pull in all of
the same dependencies or depend on the Java build

Testing:

  cat logs/cluster/profiles/* | \
docker run -i impala_profile_tool

I uploaded a build of the container to dockerhub too:

  timgarmstrong/impala_profile_tool

Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714
---
M docker/CMakeLists.txt
A docker/impala_profile_tool/Dockerfile
M docker/setup_build_context.py
A docker/utility_entrypoint.sh
4 files changed, 200 insertions(+), 43 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/17015/1
--
To view, visit http://gerrit.cloudera.org:8080/17015
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714
Gerrit-Change-Number: 17015
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-10454: Bump --ssl minimum version to tls1.2

2021-01-27 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16988 )

Change subject: IMPALA-10454: Bump --ssl_minimum_version to tls1.2
..


Patch Set 2: Code-Review+2

Thanks for fixing this!


--
To view, visit http://gerrit.cloudera.org:8080/16988
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifed66646b041a061f9db92744710aef7453f39e4
Gerrit-Change-Number: 16988
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 27 Jan 2021 21:37:29 +
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-9234: Support Ranger row filtering policies

2021-01-27 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. (
http://gerrit.cloudera.org:8080/16976 )

Change subject: [WIP] IMPALA-9234: Support Ranger row filtering policies
..

Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16976/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16976/1//COMMIT_MSG@33
PS1, Line 33:place
I think we need to add some more tests for different row filter types, e.g.
more complex expressions, invalid expressions, etc.

Are there any restrictions on the types of expressions that can be used in row
filters? E.g. are UDFs allowed?

http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/TableMask.java
File fe/src/main/java/org/apache/impala/authorization/TableMask.java:

http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/TableMask.java@102
PS1, Line 102: String stmtSql = String.format("SELECT * FROM %s.%s WHERE
%s",
We might need to be a bit careful with testing invalid row filters that don't
parse, etc. It seems like this approach should work, but just trying to think
about the risks.

http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java
File
fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java:

http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@397
PS1, Line 397: List auditEvents =
auditHandler.getAuthzEvents();
Can we factor out this code and share it with the similar logic above? This
code seems subtle and it would be good to only have one implementation.

http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationContext.java
File
fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationContext.java:

http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationContext.java@56
PS1, Line 56:* We stash the List of AuthzAuditEvent's in a Map after the
analysis of the query and
This comment probably needs an update. I find this method a bit confusing
overall. Maybe it would be clearer if the comment described what it means to
call this method? Or when it would be called? The comment seems to mainly
discuss the implementation.

--
To view, visit http://gerrit.cloudera.org:8080/16976
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I580517be241225ca15e45686381b78890178d7cc
Gerrit-Change-Number: 16976
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong
Gerrit-Comment-Date: Wed, 27 Jan 2021 20:42:27 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10382: fix invalid outer join simplification

2021-01-27 Thread Tim Armstrong (Code Review)

Tim Armstrong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16845 )

Change subject: IMPALA-10382: fix invalid outer join simplification
..

IMPALA-10382: fix invalid outer join simplification

When set ENABLE_OUTER_JOIN_TO_INNER_TRANSFORMATION = true, the planner
will simplify outer joins if the predicate with case expr or conditional
function on both sides of outer join.
However, the predicate maybe not null-rejecting, if simplify the outer
join, the result is incorrect. E.g. t1.b > coalesce(t1.c, t2.c) can
return true if t2.c is null, so it is not null-rejecting predicate
for t2.

The fix is simply to support the case that the predicate with two
operands and the operator is one of (=, !=, >, <, >=, <=),
1. one of the operands or
2. if the operand is arithmetic expression and one of the children
does not contain conditional builtin function or case expr and has
tuple id in outer joined tuples.
E.g. t1.b > coalesce(t2.c, t1.c) or t1.b + coalesce(t2.c, t1.c) >
coalesce(t2.c, t1.c) is null-rejecting predicate for t1.

Testing:
* Add new plan tests in outer-to-inner-joins.test
* Add new query tests to verify the correctness on transformation

Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b
Reviewed-on: http://gerrit.cloudera.org:8080/16845
Tested-by: Impala Public Jenkins 
Reviewed-by: Tim Armstrong 
---
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/outer-to-inner-joins.test
M 
testdata/workloads/functional-query/queries/QueryTest/outer-to-inner-joins.test
3 files changed, 223 insertions(+), 11 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Tim Armstrong: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/16845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b
Gerrit-Change-Number: 16845
Gerrit-PatchSet: 5
Gerrit-Owner: Xianqing He 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Xianqing He

[Impala-ASF-CR] IMPALA-10382: fix invalid outer join simplification

2021-01-27 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16845 )

Change subject: IMPALA-10382: fix invalid outer join simplification
..


Patch Set 4: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b
Gerrit-Change-Number: 16845
Gerrit-PatchSet: 4
Gerrit-Owner: Xianqing He 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Xianqing He 
Gerrit-Comment-Date: Wed, 27 Jan 2021 17:30:34 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10274: Initialize impala-python as part of the CMake build

2021-01-26 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16607 )

Change subject: IMPALA-10274: Initialize impala-python as part of the CMake 
build
..


Patch Set 4:

(3 comments)

Seems overall fine, had some minor comments.

http://gerrit.cloudera.org:8080/#/c/16607/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16607/4//COMMIT_MSG@19
PS4, Line 19:  - Rebuilt and verified that impala-python is not reinitialized
Are there any considerations about testing against different OS versions? I 
guess mostly we're just using python2.7 in most places?


http://gerrit.cloudera.org:8080/#/c/16607/4/be/src/codegen/gen_ir_descriptions.py
File be/src/codegen/gen_ir_descriptions.py:

http://gerrit.cloudera.org:8080/#/c/16607/4/be/src/codegen/gen_ir_descriptions.py@1
PS4, Line 1: #!/usr/bin/env python
Maybe comment why this doesn't use impala-python?


http://gerrit.cloudera.org:8080/#/c/16607/4/bin/gen_build_version.py
File bin/gen_build_version.py:

http://gerrit.cloudera.org:8080/#/c/16607/4/bin/gen_build_version.py@1
PS4, Line 1: #!/usr/bin/env python
Maybe comment why this doesn't use impala-python?



--
To view, visit http://gerrit.cloudera.org:8080/16607
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ieff51263c55bd234028fed7101c94b4a928590f0
Gerrit-Change-Number: 16607
Gerrit-PatchSet: 4
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 27 Jan 2021 02:20:35 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10382: fix invalid outer join simplification

2021-01-26 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16845 )

Change subject: IMPALA-10382: fix invalid outer join simplification
..


Patch Set 2:

(1 comment)

Looks good to me except for a typo

http://gerrit.cloudera.org:8080/#/c/16845/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16845/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@3368
PS2, Line 3368: Builin
nit: Builtin



--
To view, visit http://gerrit.cloudera.org:8080/16845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b
Gerrit-Change-Number: 16845
Gerrit-PatchSet: 2
Gerrit-Owner: Xianqing He 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 27 Jan 2021 02:14:39 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8306: clarify wording on /sessions UI

2021-01-26 Thread Tim Armstrong (Code Review)

Hello Vincent Tran, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16981

to look at the new patch set (#2).

Change subject: IMPALA-8306: clarify wording on /sessions UI
..

IMPALA-8306: clarify wording on /sessions UI

Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e
---
M www/sessions.tmpl
1 file changed, 9 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/16981/2
--
To view, visit http://gerrit.cloudera.org:8080/16981
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e
Gerrit-Change-Number: 16981
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vincent Tran

[Impala-ASF-CR] IMPALA-8306: clarify wording on /sessions UI

2021-01-26 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16981 )

Change subject: IMPALA-8306: clarify wording on /sessions UI
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16981/1/www/sessions.tmpl
File www/sessions.tmpl:

http://gerrit.cloudera.org:8080/#/c/16981/1/www/sessions.tmpl@29
PS1, Line 29: once an expired session has been cleaned up
> Is there a mechanism to clean up expired session? Or are you referring to a
Very good point, I was not clearly understanding how this worked - you are 
right that the expired session can stick around indefinitely unless 
-disconnected_session_timeout kicks in.

Updated this, hope it clarifies some of the isseus.



--
To view, visit http://gerrit.cloudera.org:8080/16981
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e
Gerrit-Change-Number: 16981
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vincent Tran 
Gerrit-Comment-Date: Tue, 26 Jan 2021 20:52:04 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support

2021-01-26 Thread Tim Armstrong (Code Review)

Tim Armstrong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16982 )

Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support
..

IMPALA-10404: Update docs to reflect RLE_DICTIONARY support

Fix references to PLAIN_DICTIONARY to reflect that
RLE_DICTIONARY is supported too.

Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
Reviewed-on: http://gerrit.cloudera.org:8080/16982
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins 
---
M docs/shared/impala_common.xml
M docs/topics/impala_parquet_dictionary_filtering.xml
2 files changed, 8 insertions(+), 5 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/16982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
Gerrit-Change-Number: 16982
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shajini Thayasingh 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy

[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support

2021-01-26 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16982 )

Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16982/1/docs/topics/impala_parquet_dictionary_filtering.xml
File docs/topics/impala_parquet_dictionary_filtering.xml:

http://gerrit.cloudera.org:8080/#/c/16982/1/docs/topics/impala_parquet_dictionary_filtering.xml@61
PS1, Line 61:
> nit: extra spaces
Done



--
To view, visit http://gerrit.cloudera.org:8080/16982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
Gerrit-Change-Number: 16982
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shajini Thayasingh 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 26 Jan 2021 16:35:51 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support

2021-01-26 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16982 )

Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support
..


Patch Set 2: Code-Review+2

carry


--
To view, visit http://gerrit.cloudera.org:8080/16982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
Gerrit-Change-Number: 16982
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shajini Thayasingh 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 26 Jan 2021 16:35:42 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support

2021-01-26 Thread Tim Armstrong (Code Review)

Hello Zoltan Borok-Nagy, Shajini Thayasingh, Csaba Ringhofer, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16982

to look at the new patch set (#2).

Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support
..

IMPALA-10404: Update docs to reflect RLE_DICTIONARY support

Fix references to PLAIN_DICTIONARY to reflect that
RLE_DICTIONARY is supported too.

Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
---
M docs/shared/impala_common.xml
M docs/topics/impala_parquet_dictionary_filtering.xml
2 files changed, 8 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16982/2
--
To view, visit http://gerrit.cloudera.org:8080/16982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
Gerrit-Change-Number: 16982
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Shajini Thayasingh 
Gerrit-Reviewer: Zoltan Borok-Nagy

[Impala-ASF-CR] IMPALA-9793: Impala quickstart cluster with docker-compose

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15966 )

Change subject: IMPALA-9793: Impala quickstart cluster with docker-compose
..


Patch Set 11:

Hit IMPALA-10316


--
To view, visit http://gerrit.cloudera.org:8080/15966
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c
Gerrit-Change-Number: 15966
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 26 Jan 2021 05:38:40 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-01-25 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#28).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

The interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle both
the partitioning (for which we use a std::map and a
comparator based on the partition exprs) and the tie-handling
semantics required by rank() predicates. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

We currently use the partitioned top-n node to implement
rank() pushdown in all cases because of the tie-handling
support. We also cannot use the merging exchange for
rank() because the limit does not handle ties in the same way,
so we need to generate a hash exchange with a partitioned
top-n node on top of the exchange.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.

[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16982


Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support
..

IMPALA-10404: Update docs to reflect RLE_DICTIONARY support

Fix references to PLAIN_DICTIONARY to reflect that
RLE_DICTIONARY is supported too.

Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
---
M docs/shared/impala_common.xml
M docs/topics/impala_parquet_dictionary_filtering.xml
2 files changed, 8 insertions(+), 5 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16982/1
--
To view, visit http://gerrit.cloudera.org:8080/16982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
Gerrit-Change-Number: 16982
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-8306: clarify wording on /sessions UI

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16981


Change subject: IMPALA-8306: clarify wording on /sessions UI
..

IMPALA-8306: clarify wording on /sessions UI

Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e
---
M www/sessions.tmpl
1 file changed, 6 insertions(+), 4 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/16981/1
--
To view, visit http://gerrit.cloudera.org:8080/16981
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e
Gerrit-Change-Number: 16981
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-9793: Impala quickstart cluster with docker-compose

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15966 )

Change subject: IMPALA-9793: Impala quickstart cluster with docker-compose
..


Patch Set 11:

Thanks for the review!


--
To view, visit http://gerrit.cloudera.org:8080/15966
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c
Gerrit-Change-Number: 15966
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 25 Jan 2021 21:34:06 +
Gerrit-HasComments: No

[Impala-ASF-CR] WIP: IMPALA-1652: fix char semantics

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has abandoned this change. ( 
http://gerrit.cloudera.org:8080/16339 )

Change subject: WIP: IMPALA-1652: fix char semantics
..


Abandoned
--
To view, visit http://gerrit.cloudera.org:8080/16339
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I00980c162257ce85e7ec2f4f03c48b142e5ef2e2
Gerrit-Change-Number: 16339
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] WIP: IMPALA-1652: fix char semantics

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16339


Change subject: WIP: IMPALA-1652: fix char semantics
..

WIP: IMPALA-1652: fix char semantics

select 'expected 2',count(*) from ax where t = cast('a ' as
varchar(10))

Change-Id: I00980c162257ce85e7ec2f4f03c48b142e5ef2e2
---
M be/src/exprs/anyval-util.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/string-functions-ir.cc
M be/src/testutil/test-udas.cc
M be/src/udf/uda-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/StringLiteral.java
9 files changed, 56 insertions(+), 13 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/16339/3
--
To view, visit http://gerrit.cloudera.org:8080/16339
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I00980c162257ce85e7ec2f4f03c48b142e5ef2e2
Gerrit-Change-Number: 16339
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] WIP: IMPALA-2138: part 4: implicit cast handling

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/14399


Change subject: WIP: IMPALA-2138: part 4: implicit cast handling
..

WIP: IMPALA-2138: part 4: implicit cast handling

This solves some latent issues with implicit casts that
caused test failures for my projection patch and
generally make it difficult to safely use expression
substitution to replace expressions with projected
versions.

When doing expression substitution, it is critical that the
behaviour of expressions should *not* change - we want to
make the plan more efficient without any change in results.

Implicit casts proved problematic for projection because they
got removed automatically during expression substitution.
In many cases the resulting expression actually changed
semantics.

E.g. c = cast('foo' as CHAR(4)), where c is a CHAR(3) requires
a cast to CHAR(4) to be inserted. Expr.substitute() can actually
remove this cast. There is also some "interesting" behaviour
where implicit casts are not considered when comparing two
expr trees for equality, which could have all kinds of peculiar
consequences in edge cases where non-equivalent exprs get
replaced.

This change introduces a concept of an explicit,
planner-generated cast that will be preserved through
expression substitution. All implicit casts are promoted
to these explicit casts during plan generation and
any new casts inserted in the plan's exprs must be
explicit casts.

A validateExprs() method is added that verifies this.
validateExprs() also checks the invariant that exprs
must reference valid tuple IDs, which will be important
for the follow-on projection patch.
changes as much as possible.

This patch avoids changes to handling of implicit casts
during analysis to avoid unintended changes in behaviour.

TODO:
* document in CastExpr
* validate expressions in data sinks for lack of implicit casts
* decouple plan changes with casts from this change

Change-Id: Ie50b7d28d691d65ee721581216d227d52f04ecf5
---
M fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/analysis/AnalyticInfo.java
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/BetweenPredicate.java
M fe/src/main/java/org/apache/impala/analysis/BoolLiteral.java
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ColumnDef.java
M fe/src/main/java/org/apache/impala/analysis/DateLiteral.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/ExprSubstitutionMap.java
A fe/src/main/java/org/apache/impala/analysis/ExprSubstitutionMode.java
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/KuduPartitionExpr.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/ModifyStmt.java
M fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java
M fe/src/main/java/org/apache/impala/analysis/NullLiteral.java
M fe/src/main/java/org/apache/impala/analysis/NumericLiteral.java
M fe/src/main/java/org/apache/impala/analysis/OrderByElement.java
M fe/src/main/java/org/apache/impala/analysis/QueryStmt.java
M fe/src/main/java/org/apache/impala/analysis/RangePartition.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/analysis/SortInfo.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java
M fe/src/main/java/org/apache/impala/analysis/StringLiteral.java
M fe/src/main/java/org/apache/impala/analysis/TimestampLiteral.java
M fe/src/main/java/org/apache/impala/analysis/UnionStmt.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/DataPartition.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/ExchangeNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionFilter.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala

[Impala-ASF-CR] WIP: IMPALA-2138: part 4: implicit cast handling

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has abandoned this change. ( 
http://gerrit.cloudera.org:8080/14399 )

Change subject: WIP: IMPALA-2138: part 4: implicit cast handling
..


Abandoned
--
To view, visit http://gerrit.cloudera.org:8080/14399
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: Ie50b7d28d691d65ee721581216d227d52f04ecf5
Gerrit-Change-Number: 14399
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has abandoned this change. ( 
http://gerrit.cloudera.org:8080/14216 )

Change subject: WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges
..


Abandoned

not doing for now
--
To view, visit http://gerrit.cloudera.org:8080/14216
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I94ccf2d1acecd9a593f4c29fc15202a799d2f7f5
Gerrit-Change-Number: 14216
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/14216


Change subject: WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges
..

WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges

This patch adds an optimization pass to the planner
that adds projection operations to the distributed
plan before exchanges. The projections are added
if the row either has slots that are not used
upstream or if the row has multiple tuples, i.e.
is not flat.

The pass is performed late in the planning process,
after join ordering, runtime filters, etc. There is
a phase ordering problem because the projection
pass depends on the join order and strategy,
but those decisions depend on the amount of data
being exchanged across the network, which is reduced
by projection. Tacking the projection onto the end
solves the problem with the following pros and cons.
Pros:
* This is not invasive to the rest of the planner
* Other planner decisions like join ordering will not be
  affected by the change, making this change safer and
  makes the chance of serious perf regressions minimal.
Cons:
* Join ordering and strategy may make sub-optimal decisions
  because they include the cost of exchanging projected
  slots in decisions.

Note that the projection is implemented with a separate
UnionNode (called PROJECT in the plan). We could squeeze
out some more performance by doing the projection inline
in the exchange operator and avoiding a copy, but this
already appears to be a big win even with the extra
separate step: the cost of the PROJECT seems to be
offset by reduced serialisation and compression
time in the exchange. The UNION is codegen'd and quite
efficient, and we save work in the exchange node from
fewer slots, the flattened tuple representation, and
less data to compress and transfer.

The details of the optimisation pass are documented in
the class comment.

TODO in fe implementation
* O(n^2) algorithms in tree walk? Expr lists are revisited
  each time.
* Audit expr method implementations
* remove debug logging
* Consider removing the validation outside of testing

Testing:
TODO
* Need to update planner tests
* Exhaustive tests
* Run some tests with projection disabled - join queries, agg queries?

* Projection that should happen outside subplan
set explain_level=2; use functional_parquet; explain select
straight_join t1.id, m.key
from complextypestbl t1 join [broadcast] complextypestbl t2, t2.int_map
m
where t1.id = t2.id and t2.nested_struct.a > 10;

This was based on a prototype by Alex Behm.

Change-Id: I94ccf2d1acecd9a593f4c29fc15202a799d2f7f5
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/OrderByElement.java
M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/CardinalityCheckNode.java
M fe/src/main/java/org/apache/impala/planner/DataSink.java
M fe/src/main/java/org/apache/impala/planner/DataStreamSink.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/EmptySetNode.java
M fe/src/main/java/org/apache/impala/planner/ExchangeNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseTableSink.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/JoinBuildSink.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/KuduTableSink.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanRootSink.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
A fe/src/main/java/org/apache/impala/planner/ProjectionPass.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/planner/SelectNode.java
M fe/src/main/java/org/apache/impala/planner/SingularRowSrcNode.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/main/java/org/apache/impala/planner/SubplanNode.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M fe/src/main/java/org/apache/impala/planner/UnionNode.java
M fe/src/main/java/org/apache/impala/planner/UnnestNode.java
35 files changed, 976 insertions(+), 48 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/16/14216/8
--
To view, visit http://gerrit.cloudera.org:8080/14216
To

[Impala-ASF-CR] WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has abandoned this change. ( 
http://gerrit.cloudera.org:8080/16214 )

Change subject: WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen
..


Abandoned

Now redundant, this was solved in a different way
--
To view, visit http://gerrit.cloudera.org:8080/16214
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I3c8f1c82cf6f932dd6aad8670c8cca2e53f80c84
Gerrit-Change-Number: 16214
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Tianyi Wang

[Impala-ASF-CR] WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16214


Change subject: WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen
..

WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen

This patch removes the indirection of codegened TupleRowComparator::
Compare() at its call sites in sorter and topn node. It's implemented by
cloning and modifying all the functions between the codegened entry
function and Compare().

The call site in SortedRunMerger is still indirect, but the indirection
could be removed in the same way as this patch.

Testing:
* TODO: unit tests for SCC algo

Perf:
Tianyi said: "TPCH queries with a sort node are 2%-5% faster."
TODO: run benchmarks

Change-Id: I3c8f1c82cf6f932dd6aad8670c8cca2e53f80c84
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/exec/topn-node.cc
M be/src/runtime/sorted-run-merger.cc
M be/src/runtime/sorter-internal.h
M be/src/runtime/sorter-ir.cc
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/tuple-row-compare.cc
M be/src/util/tuple-row-compare.h
A tests/custom_cluster/test_replace_tuple_row_compare.py
12 files changed, 354 insertions(+), 37 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/14/16214/1
--
To view, visit http://gerrit.cloudera.org:8080/16214
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3c8f1c82cf6f932dd6aad8670c8cca2e53f80c84
Gerrit-Change-Number: 16214
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Tianyi Wang

[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16908 )

Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
..


Patch Set 11: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 25 Jan 2021 18:57:25 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-01-25 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..


Patch Set 27:

Rebased onto master after the previous patch was merged


--
To view, visit http://gerrit.cloudera.org:8080/16242
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
Gerrit-Change-Number: 16242
Gerrit-PatchSet: 27
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 25 Jan 2021 18:46:23 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-01-25 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#27).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

The interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle both
the partitioning (for which we use a std::map and a
comparator based on the partition exprs) and the tie-handling
semantics required by rank() predicates. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

We currently use the partitioned top-n node to implement
rank() pushdown in all cases because of the tie-handling
support. We also cannot use the merging exchange for
rank() because the limit does not handle ties in the same way,
so we need to generate a hash exchange with a partitioned
top-n node on top of the exchange.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.

[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1

2021-01-21 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
..


Patch Set 32:

(4 comments)

Few comments related to the hang

http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr-internal.h
File be/src/runtime/tmp-file-mgr-internal.h:

http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr-internal.h@206
PS32, Line 206:   TmpFileRemote(TmpFileGroup* file_group, TmpFileMgr::DeviceId 
device_id,
Can we add a destructor with a DCHECK that will ensure that the resources have 
been returned to TmpFileBufferPool correctly?

This might relate to the shared_ptr comment I made elsewhere - if we use 
shared_ptr consistently then we'll know that the destructor runs exactly when 
the object becomes unreachable.


http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.h@184
PS32, Line 184: /// A map to keep the TmpFile shared pointer alive by 
keeping a reference of the
Why do we need to do this again? Can't we enqueue the shared_ptr into the tmp 
file pool and make ownership explicit, instead of doing this shared_ptr/raw 
pointer dance. I'm finding it hard to understand the ownership of these TmpFile 
objects while looking to see if there's a resource leak.


http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc@576
PS32, Line 576:   if 
(tmp_dirs_remote_ctrl_.local_buff_dir_bytes_high_water_mark_.Add(file_size)
I think this code might be vulnerable to a race where a thread ends up blocking 
indefinitely in DequeueTmpFilesPool, e.g.

Assume bytes_limit is 2x and each file is x bytes. Start with x bytes allocated.

Thread 1: increment, win race, 2x allocated
Thread 2: increment, lose race, 3x allocated > bytes limit
Thread 3: decrement, 2x
Thread 4: increment, fails, 3x allocated
Thread 4: decrement, 2x
Thread 2: decrement, 1x

In that case thread 4 could block in DequeueTmpFilesPool even though there's 
free capacity.

I think the basic flaw is that one of the conditions for a thread blocking on 
tmp_files_available_cv_ is that it should have checked for buffer availabity 
before doing so, but it's not part of the condition variable loop condition.

Using the atomics is ok in other places because we just return an error when 
there's a race.

I think it'd be best to protect it behind a lock and also signal the condition 
variable when it's decremented from >= bytes_limit to < bytes_limit.


http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc@1835
PS32, Line 1835: void TmpFileBufferPool::EnqueueTmpFilesPool(TmpFile* tmp_file, 
bool front) {
I'm suspicious about using the raw pointer here



--
To view, visit http://gerrit.cloudera.org:8080/16318
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89
Gerrit-Change-Number: 16318
Gerrit-PatchSet: 32
Gerrit-Owner: Yida Wu 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Yida Wu 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:51:19 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

2021-01-21 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16908 )

Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
..


Patch Set 9:

(3 comments)

LGTM, I wanted a few more tests and a comment but otherwise I'm ready to +2

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc@10542
PS9, Line 10542: TEST_P(ExprTest, Utf8Test) {
Some of the characters below are > 2 bytes right? It would be helpful to add a 
comment mentioning the # of bytes in the characters you used.


http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc@10596
PS9, Line 10596:   TestStringValue("utf8_reverse('mañana')", "anañam");
Can we add tests for a couple of grapheme clusters where we're reversing the 
codepoints instead of the clusters. There are a couple of examples here - 
https://exploringjs.com/impatient-js/ch_unicode.html#grapheme-clusters-the-real-characters.

I tried them in impala-shell and it seems to have the expected behavior - 
https://drive.google.com/file/d/1iPXFAOtkRE5OPc014i6hARfig7W8xmpw/view?usp=sharing


http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/string-functions-ir.cc@496
PS9, Line 496: StringVal StringFunctions::Utf8Reverse(FunctionContext* context, 
const StringVal& str) {
Comment that this reverses codepoints only, and that's consistent with other 
systems.



--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 20:31:03 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns

2021-01-21 Thread Tim Armstrong (Code Review)

Tim Armstrong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16947 )

Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to 
support a list of columns
..

IMPALA-10435: Extend 'compute incremental stats' syntax
to support a list of columns

Modified parser to support compute incremental stats
columns.No need to modify the code of other modules
because it already supports

Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Reviewed-on: http://gerrit.cloudera.org:8080/16947
Tested-by: Impala Public Jenkins 
Reviewed-by: Tim Armstrong 
---
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test
5 files changed, 340 insertions(+), 5 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Tim Armstrong: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/16947
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Gerrit-Change-Number: 16947
Gerrit-PatchSet: 5
Gerrit-Owner: liuyao 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Reviewer: liuyao

[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns

2021-01-21 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16947 )

Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to 
support a list of columns
..


Patch Set 4: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16947
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Gerrit-Change-Number: 16947
Gerrit-PatchSet: 4
Gerrit-Owner: liuyao 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Reviewer: liuyao 
Gerrit-Comment-Date: Thu, 21 Jan 2021 19:35:24 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns

2021-01-20 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16947 )

Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to 
support a list of columns
..


Patch Set 3:

(1 comment)

Thanks, this makes sense I think. Can you test one more edge case for me?

http://gerrit.cloudera.org:8080/#/c/16947/3/testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test
File 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test:

http://gerrit.cloudera.org:8080/#/c/16947/3/testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test@830
PS3, Line 830: 
Can you also run compute incremental stats without specifying a partition to 
see which partitions it recomputes stats on?



--
To view, visit http://gerrit.cloudera.org:8080/16947
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Gerrit-Change-Number: 16947
Gerrit-PatchSet: 3
Gerrit-Owner: liuyao 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Reviewer: liuyao 
Gerrit-Comment-Date: Thu, 21 Jan 2021 00:52:16 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9793: Impala quickstart cluster with docker-compose

2021-01-20 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15966 )

Change subject: IMPALA-9793: Impala quickstart cluster with docker-compose
..


Patch Set 9:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/15966/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15966/9//COMMIT_MSG@26
PS9, Line 26: Instructions for running the quickstart cluster are in
: docker/quickstart.yml.
> One thing I saw over in Kudu is that they have the docker compose instructi
I moved the stuff from the yml to the README, I think that makes sense.


http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart.yml
File docker/quickstart.yml:

http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart.yml@20
PS9, Line 20: # All filesystem data is stored in Docker volumes. The default 
storage location for tables
: # is in the impala-quickstart-warehouse volume, i.e. if you 
create a table in Impala, it
: # will be stored in that volume by default.
> This could be a follow-on change, but we may want to have some documentatio
I added a section to the README describing how to mount the volume in a 
container and copy data from the host.


http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart.yml@51
PS9, Line 51: # To load data in background into Parquet and Kudu formats:
: #
: #  docker-compose -f docker/quickstart.yml -f 
docker/quickstart-kudu-minimal.yml \
: # -f docker/quickstart-load-data.yml up -d
> Can the dataload part run as a separate docker-compose command? i.e.
They need to be run in the same command so that docker-compose can resolve the 
dependencies between the files.


http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/Dockerfile
File docker/quickstart_client/Dockerfile:

http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/Dockerfile@48
PS9, Line 48: RUN groupadd -r impala && useradd --no-log-init -r -g impala 
impala && \
Needed to update the uid/gid to match IMPALA-10373


http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/Dockerfile@52
PS9, Line 52: # Copy the Hive install.
> Nit: Not really Hive related. Maybe change this to "Copy the client entrypo
Done


http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/data-load-entrypoint.sh
File docker/quickstart_client/data-load-entrypoint.sh:

http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/data-load-entrypoint.sh@32
PS9, Line 32: LOading
> Nit: capitalization typo
Done


http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/data-load-entrypoint.sh@35
PS9, Line 35:   
TPCDS_TARBALL=tpc-ds-${TPCDS_VERSION}-gcc-4.9.2-ec2-package-ubuntu-18-04.tar.gz
> Nit: The client base image is using Ubuntu 16 by default, and we are downlo
Done - switched the base image to ubuntu 18


http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_conf/hive-site.xml
File docker/quickstart_conf/hive-site.xml:

http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_conf/hive-site.xml@28
PS9, Line 28:  
: 
: 
hive.metastore.notifications.add.thrift.objects
: true
:   
:   
: 
: hive.metastore.alter.notifications.basic
: false
:   
:   
: 
:  
hive.metastore.event.db.notification.api.auth
: false
:   
> When I'm looking at https://github.com/apache/impala/blob/master/fe/src/tes
Yeah it should be generally similar to what we use for the tests. I used that 
as a reference but removed some of the bits that were obviously irrelevant to 
prune it down. Ideally this config would only override default hive settings 
where we actually need them to get it working.

I removed a couple of these settings that were redundant now.



--
To view, visit http://gerrit.cloudera.org:8080/15966
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c
Gerrit-Change-Number: 15966
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 20 Jan 2021 21:46:50 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9793: Impala quickstart cluster with docker-compose

2021-01-20 Thread Tim Armstrong (Code Review)

Hello Quanlong Huang, Grant Henke, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15966

to look at the new patch set (#10).

Change subject: IMPALA-9793: Impala quickstart cluster with docker-compose
..

IMPALA-9793: Impala quickstart cluster with docker-compose

What works:
* A single node cluster can be started up with docker-compose
* HMS data is stored in Derby database in a docker volume
* Filesystem data is stored in a shared docker volume, using the
  localfs support in the Hadoop client.
* A Kudu cluster with a single master can be optionally added on
  to the Impala cluster.
* TPC-DS data can be loaded automatically by a data loading container.

We need to set up a docker network called quickstart-network,
purely because docker-compose insists on generating network names
with underscores, which are part of the FQDN and end up causing
problems with Java's URL parsing, which rejects these technically
invalid domain names.

How to run:

Instructions for running the quickstart cluster are in
docker/README.md.

How to build containers:

  ./buildall.sh -release -noclean -notests -ninja
  ninja quickstart_hms_image quickstart_client_image docker_images

How to upload containers to dockerhub:

  IMPALA_QUICKSTART_IMAGE_PREFIX=timgarmstrong/
  for i in impalad_coord_exec impalad_coordinator statestored \
   impalad_executor catalogd impala_quickstart_client \
   impala_quickstart_hms
  do
docker tag $i ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i
docker push ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i
  done

I pushed containers build from commit f260cce22, which
was branched from 6cb7cecacf on master.

Misc other stuff:
* Added more metadata to all images.

TODO:
* Test and instructions to run against Kudu quickstart
* Upload latest version of containers before merging.

Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c
---
M docker/CMakeLists.txt
M docker/README.md
A docker/docker-build.sh
M docker/impala_base/Dockerfile
A docker/quickstart-kudu-minimal.yml
A docker/quickstart-load-data.yml
A docker/quickstart.yml
A docker/quickstart_client/Dockerfile
A docker/quickstart_client/data-load-entrypoint.sh
A docker/quickstart_client/load_tpcds_kudu.sql
A docker/quickstart_client/load_tpcds_parquet.sql
A docker/quickstart_conf/hive-site.xml
A docker/quickstart_hms/Dockerfile
A docker/quickstart_hms/hms-entrypoint.sh
14 files changed, 2,985 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/15966/10
--
To view, visit http://gerrit.cloudera.org:8080/15966
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c
Gerrit-Change-Number: 15966
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-10441: Skip test bytes read per column if not on local minicluster

2021-01-20 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16964 )

Change subject: IMPALA-10441: Skip test_bytes_read_per_column if not on local 
minicluster
..


Patch Set 1: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16964
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If8a179937c9c7c690dd2630549464dbe6aa1b834
Gerrit-Change-Number: 16964
Gerrit-PatchSet: 1
Gerrit-Owner: Laszlo Gaal 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Laszlo Gaal 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 20 Jan 2021 16:53:27 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10448: Build impala-profile-tool early for Docker-based tests

2021-01-20 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16965 )

Change subject: IMPALA-10448: Build impala-profile-tool early for Docker-based 
tests
..


Patch Set 1: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16965
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60e78ea883f3057c59a345feca38ef08a7f6a0b8
Gerrit-Change-Number: 16965
Gerrit-PatchSet: 1
Gerrit-Owner: Laszlo Gaal 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Laszlo Gaal 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 20 Jan 2021 16:53:05 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1

2021-01-19 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
..


Patch Set 30:

(10 comments)

I think I'm pretty close to approving it - the big thing is that I wanted to 
understand why the deadlock you were mentioning could occur. That one makes me 
think that we might be missing a flaw in how the local buffer directories work, 
or similar.

Otherwise looking pretty good :)

http://gerrit.cloudera.org:8080/#/c/16318/30//COMMIT_MSG
Commit Message:

PS30:
Following on from our discussion about a potential deadlock, I do think there 
is a limitation that, if there are N local buffer files, only N 
spilling-to-remote queries can make progress at the same time. I think these 
queries almost always will have a partially-written file with a local buffer 
associated, so they may block other queries from spilling until they have 
completed and released that buffer. I think a query can be in this state 
indefinitely. E.g. if it was spilling but is now off doing some other work.

This isn't ideal but is maybe OK if N is high enough, at least for milestone 1.

I *think* that each query should be able to make progress independent of other 
queries. We would have a deadlock if query A depended on B to release a local 
buffer file to make progress and B depended directly or indirectly on A 
releasing some other resource to make progress. I haven't seen anything that 
looks like that yet.


http://gerrit.cloudera.org:8080/#/c/16318/30//COMMIT_MSG@112
PS30, Line 112: * Ran pre-review-test
Have you tested in combination with disk spill compression?


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/disk-io-mgr.cc@328
PS30, Line 328: 0
I think we should probably set the replication to 1 for this use case - that 
would reduce space requirement for HDFS spilling by 3x.


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.h
File be/src/runtime/io/local-file-system.h:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.h@42
PS30, Line 42: uint8
nit: uint8_t


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.cc
File be/src/runtime/io/local-file-system.cc:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.cc@98
PS30, Line 98: uint8
nit: uint8_t


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-writer.cc
File be/src/runtime/io/local-file-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-writer.cc@45
PS30, Line 45:   int written = write(fileno(file_), range->data(), 
range->len());
I missed this, I think we should use LocalFileSystem::Fwrite which converts the 
error into a status so that the error is handled correctly.


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.h@420
PS30, Line 420:   ///
This is the main public API for TmpFileMgr writing. I notice that it doesn't 
clearly document anything about when the write finishes.

We should probably say something like:

  The write may take some time to complete. It may be queued behind other I/O 
operations. If remote scratch is enabled, it may also need to wait for other 
queries to make progress and release space in the local buffer directory.


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h@194
PS27, Line 194: std::unique_ptr tmp_file_pool_;
> The pool only works for remote spilling. It would be null when there is no
Can you also put this in the comment for tmp_file_pool_?


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc@517
PS30, Line 517:   if (HasRemoteDir()) {
This doesn't seem right if write_range is just writing to a local scratch 
directory (not just the local buffer directory). Shouldn't it check to see if 
write_range is jsut targeted at local scratch? I guess I would expect spilling 
to a local directory to use the same code path regardless of whether remote 
scratch is used or not.


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc@1820
PS30, Line 1820: DCHECK(status.ok());
nit: use DCHECK_OK - we have a custom version that automatically prints the 
error message if the check fails.



--
To view, visit http://gerrit.cloudera.org:8080/16318
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Ge

[Impala-ASF-CR] IMPALA-10434: Fix impala-shell's unicode regressions on Python2

2021-01-19 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16960 )

Change subject: IMPALA-10434: Fix impala-shell's unicode regressions on Python2
..


Patch Set 1: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16960/1/shell/impala_shell.py
File shell/impala_shell.py:

http://gerrit.cloudera.org:8080/#/c/16960/1/shell/impala_shell.py@552
PS1, Line 552: line
nit: maybe history_cmd?



--
To view, visit http://gerrit.cloudera.org:8080/16960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4
Gerrit-Change-Number: 16960
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Laszlo Gaal 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 19 Jan 2021 21:56:33 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns

2021-01-19 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16947 )

Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to 
support a list of columns
..


Patch Set 2:

How does this work if you compute incremental stats on a subset of columns, 
e.g. (c1, c2) then compute it on a different subset of columns, e.g. (c2, c3)?

Is this allowed? Does this still merge together the stats that are available?

I think we need a test.


--
To view, visit http://gerrit.cloudera.org:8080/16947
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Gerrit-Change-Number: 16947
Gerrit-PatchSet: 2
Gerrit-Owner: liuyao 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Tue, 19 Jan 2021 21:11:23 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-19 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 11:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@9
PS11, Line 9: when
> nit: remove
Done


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@22
PS11, Line 22: limit
> did you mean partition limit >= order by limit ?
Done


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@32
PS11, Line 32: was
> nit: 'is'
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@507
PS11, Line 507: upper top-n.
> Since we do a distributed top-n, there are 3 top-n's in the plan and the 'u
Good point - used final/analytic instead of upper/lower.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@511
PS11, Line 511: doesn't matter
> nit: worth clarifying that it doesn't matter for the purpose of the pushdow
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@530
PS11, Line 530: include all of the rows in the final
> This does not literally mean all rows in the final partition right ? Should
Yup that's true.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@531
PS11, Line 531: was
> nit: 'we'
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@538
PS11, Line 538:* TODO: this should also return the amount that needs to be 
added to the limit
I already did this, removed TODO


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@603
PS11, Line 603: if (analyticLimit < limit) return falseStatus;
> One special case where this could work is if each partition had a maximum o
I think the partitioned top-n node would let us correctly apply the limit at 
runtime, if we wanted to further optimize this (i.e. I think at any point 
during the partitioned top-n, you could discard any in-memory partitions that 
didn't include the first N rows)


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@83
PS11, Line 83: row.s
> nit: 'rows'
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test:

http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test@1139
PS11, Line 1139: # rank() predicate is not pushed down because TOPN_BYTES_LIMIT 
prevents conversion
> The plan shows the lower top-n which indicates the rank was pushed down. Pe
Thanks for catching this. The problem was that the table only had 8 rows, so 
the estimate for the top n was 18B * 8 = 144B, under the threshold. I switched 
to a different table with more rows and it now shows the expected behavior



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 19 Jan 2021 19:40:05 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-19 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#14).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >=
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it is always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The elapsed time for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,275 insertions(+), 355 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/14
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 13:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc
File be/src/exec/topn-node-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@44
PS11, Line 44: priority_queue_.Push(insert_tuple);
> Might be useful to explicitly mention here that we don't have to worry abou
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@91
PS11, Line 91: else {
 : // 'materialized_tuple' needs to be
> I might move this above the DCHECK, to make it clearer its referring to the
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@93
PS11, Line 93:  from the heap.
 : DCHECK_LT(cmp_result
> Not sure what this is supposed to mean, since my understanding would be tha
I'm not sure exactly what I was getting at, I think I was just restating the 
invariant that everything in 'overflowed_ties_' was equal with the head of the 
queue, so it didn't add much.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@97
PS11, Line 97:
> Might be worth explicitly saying that 'top_tuple' is what we just popped of
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@100
PS11, Line 100: node->tuple_row_less_than_->Compare(top_tuple, 
priority_queue_.Top()) == 0) {
> I find this comment a little confusing, since we've already done the one Po
Thanks, that's clearer.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h
File be/src/exec/topn-node.h:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@86
PS11, Line 86: /// TODO: currently we only support a single top-n heap per 
operator. IMPALA-9979 will
> nit: its a little confusing to call out 'unpartitioned mode' specifically w
Yeah I had to pull this comment out of the partitioned mode patch and didn't 
want to create too many merge conflicts by editing comments. I added a TODO as 
suggested.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@180
PS11, Line 180:   RuntimeProfile::Counter* tuple_pool_reclaim_counter_= nullptr;
> The counters here and below aren't used in this patch
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@200
PS11, Line 200: n data structure use
> nit: not in this patch
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@256
PS11, Line 256:   const int64_t capacity_;
> If I understand correctly, this function only gets called once the heap has
Done. Updated comment and added DCHECK.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc
File be/src/exec/topn-node.cc:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc@306
PS11, Line 306: heap is at
> I think this is a typo? i.e. these two words should be removed
Done



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 16 Jan 2021 01:48:13 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#13).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,276 insertions(+), 354 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/13
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#12).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,275 insertions(+), 354 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/12
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 12
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-01-15 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#26).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

The interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle both
the partitioning (for which we use a std::map and a
comparator based on the partition exprs) and the tie-handling
semantics required by rank() predicates. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

We currently use the partitioned top-n node to implement
rank() pushdown in all cases because of the tie-handling
support. We also cannot use the merging exchange for
rank() because the limit does not handle ties in the same way,
so we need to generate a hash exchange with a partitioned
top-n node on top of the exchange.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.

[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1

2021-01-15 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
..


Patch Set 27:

(19 comments)

Another round of comments. I understand the local buffer file logic now and I 
think it makes sense, just need to do another pass to look for bugs.

http://gerrit.cloudera.org:8080/#/c/16318/27//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16318/27//COMMIT_MSG@77
PS27, Line 77:The first available local directory is used for the local 
buffer
Can you give an example of how you might set this up in practice to allow both 
local and remote spilling? Would you configure two local directories with 
different limits so that one was the local buffer directory and the other was 
the main scratch directory?


http://gerrit.cloudera.org:8080/#/c/16318/27//COMMIT_MSG@91
PS27, Line 91: * Unit Tests added to tmp-file-mgr-test/disk-io-mgr-test.
I think it'd be good to add a buffer pool test that exercises the remote 
spilling code paths. We have some randomized tests like  
BufferPoolTest::TestRandomInternalMulti() - I think you could add another one 
with a FileGroup configured to use remote scratch.


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-file.h
File be/src/runtime/io/disk-file.h:

http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-file.h@137
PS27, Line 137:   void SetActualFileSize(int64_t size) {
This is only called once, right? Can we check that invariant:

  DCHECK_EQ(0, actual_file_size_.Load());


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-file.h@138
PS27, Line 138: DCHECK
nit: use DCHECK_LE - it automatically prints the two values if it fails so it's 
a little nicer for debugging.


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-io-mgr.cc@252
PS27, Line 252: is_full_ = written_bytes != 0 && written_bytes == 
disk_file_->actual_file_size();
Add

  DCHECK_LE(written_bytes, disk_file_->actual_file_size())

Just so we catch it in testing if there's a bug.


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-io-mgr.cc@329
PS27, Line 329: int bytes = (file_size - offset < buffer_size) ? (file_size 
- offset) : buffer_size;
Maybe use min() instead?


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/request-context.cc
File be/src/runtime/io/request-context.cc:

http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/request-context.cc@243
PS27, Line 243: (static_cast(range))->callback()(status);
nit: I think the parentheses around static_cast here and on l246 are not needed.


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr-internal.h
File be/src/runtime/tmp-file-mgr-internal.h:

http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr-internal.h@326
PS27, Line 326:   /// The number of the tmp files in the pool. For debug and 
test.
We don't seem to use this variable any more


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h@194
PS27, Line 194: std::unique_ptr tmp_file_pool_;
When is this null or non-null?


http://gerrit.cloudera.org:8080/#/c/16318/24/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/24/be/src/runtime/tmp-file-mgr.cc@427
PS24, Line 427:   *need_local_buffer_dir = false;
> Currently local buffer directory won't be inserted into tmp_dirs_, the dire
I left a comment on the commit message. It seems like it would be helpful to 
have an example of how to configure this in a reasonable way.


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.cc@512
PS27, Line 512:   if (tmp_dirs_remote_ctrl_.tmp_file_pool_ == nullptr) {
What [


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.cc@917
PS27, Line 917: if (tmp_file_mgr_->tmp_dirs_remote_ctrl_.tmp_file_pool_ != 
nullptr) {
Which case is this? Can we add a boolean helper method to TmpDirsRemoteCtrl() 
that makes it self-describing? It's sometimes a little hard to follow code when 
null/non-null implies certain modes.


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.cc@967
PS27, Line 967:   if (!tmp_files_remote_.empty()) {
It would be helpful to add a couple of comments to explain the code flow. E.g.

  // First, check to see if we can allocate more space in the previous file.

then below

 // No space in previous file, we need to set up a new remote file.


http://gerrit.cloudera.org:808

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-01-14 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..


Patch Set 25:

Latest patchset fixes the test failure and is ready for review. This patchset 
depends on the limit pushdown patchset though so that needs review first anyway.


--
To view, visit http://gerrit.cloudera.org:8080/16242
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
Gerrit-Change-Number: 16242
Gerrit-PatchSet: 25
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 15 Jan 2021 00:27:40 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9865: part 2/2: add verbosity to profile tool

2021-01-14 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16881 )

Change subject: IMPALA-9865: part 2/2: add verbosity to profile tool
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16881/8/be/src/util/runtime-profile.h
File be/src/util/runtime-profile.h:

http://gerrit.cloudera.org:8080/#/c/16881/8/be/src/util/runtime-profile.h@177
PS8, Line 177:   void PrettyPrint(std::ostream* s, const std::string& prefix = 
"") const;
> For my understanding, what type of code uses this overload that doesn't tak
The profiles from the Impala debug server is one example.



--
To view, visit http://gerrit.cloudera.org:8080/16881
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I82618a813e29af7996dfaed78873b2a73bc0231d
Gerrit-Change-Number: 16881
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Jan 2021 19:13:45 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-01-14 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#25).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

The interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle both
the partitioning (for which we use a std::map and a
comparator based on the partition exprs) and the tie-handling
semantics required by rank() predicates. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

We currently use the partitioned top-n node to implement
rank() pushdown in all cases because of the tie-handling
support. We also cannot use the merging exchange for
rank() because the limit does not handle ties in the same way,
so we need to generate a hash exchange with a partitioned
top-n node on top of the exchange.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-01-13 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#24).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

The interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle both
the partitioning (for which we use a std::map and a
comparator based on the partition exprs) and the tie-handling
semantics required by rank() predicates. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

We currently use the partitioned top-n node to implement
rank() pushdown in all cases because of the tie-handling
support. We also cannot use the merging exchange for
rank() because the limit does not handle ties in the same way,
so we need to generate a hash exchange with a partitioned
top-n node on top of the exchange.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#11).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,285 insertions(+), 353 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/11
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] WIP

2021-01-13 Thread Tim Armstrong (Code Review)

Tim Armstrong has abandoned this change. ( 
http://gerrit.cloudera.org:8080/16950 )

Change subject: WIP
..


Abandoned
--
To view, visit http://gerrit.cloudera.org:8080/16950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I55f6376ab609ec6fb56d59ce14a995174e1644c6
Gerrit-Change-Number: 16950
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#10).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,278 insertions(+), 348 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/10
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-01-13 Thread Tim Armstrong (Code Review)

Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16242

to look at the new patch set (#23).

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

The interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle both
the partitioning (for which we use a std::map and a
comparator based on the partition exprs) and the tie-handling
semantics required by rank() predicates. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

We currently use the partitioned top-n node to implement
rank() pushdown in all cases because of the tie-handling
support. We also cannot use the merging exchange for
rank() because the limit does not handle ties in the same way,
so we need to generate a hash exchange with a partitioned
top-n node on top of the exchange.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 6366 matches

Mail list logo