[Impala-ASF-CR] IMPALA-7556: Decouple BufferManagement from the ScanRange and IoMgr
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17413 ) Change subject: IMPALA-7556: Decouple BufferManagement from the ScanRange and IoMgr .. Patch Set 9: Glad to see someone new getting their hands dirty in this code, it makes sense to make ScanRange responsible for fewer things... -- To view, visit http://gerrit.cloudera.org:8080/17413 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibd74691b50b46114f95a8641034c05d07ddeec97 Gerrit-Change-Number: 17413 Gerrit-PatchSet: 9 Gerrit-Owner: Amogh Margoor Gerrit-Reviewer: Amogh Margoor Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 29 May 2021 06:18:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10503: testdata load hits hive memory limit errors during hive inserts
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17061 ) Change subject: IMPALA-10503: testdata load hits hive memory limit errors during hive inserts .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/17061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idac5f054e814070b983f7f57aef4ea9d54252bb2 Gerrit-Change-Number: 17061 Gerrit-PatchSet: 2 Gerrit-Owner: Kurt Deschler Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 12 Feb 2021 23:24:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9745: Propagate source type when doing constant propagation
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17064 ) Change subject: IMPALA-9745: Propagate source type when doing constant propagation .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/17064 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic3853478945229440f733c256ea225639f9178ff Gerrit-Change-Number: 17064 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 12 Feb 2021 22:33:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16318 ) Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1 .. Patch Set 35: (4 comments) I think we should merge this once a few small issues are fixed. After merging I would like to see more stress and perf testing before we consider it produciton-ready, but I think at this point the code is best in master. http://gerrit.cloudera.org:8080/#/c/16318/35/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/35/be/src/runtime/tmp-file-mgr.cc@1795 PS35, Line 1795: cur_write_range_->io_ctx()->Cancel(); I think this Cancel() is undesirable, because it means there are two different code paths that the query failure propagates by, and we may have different errors from them. I think passing status to write_callback() below is enough to cancel the query, but maybe we need to set an additional flag here to prevent different threads from getting blocked in ReserveLocalBufferSpace and instead pass the same timeout error to write_callback() http://gerrit.cloudera.org:8080/#/c/16318/35/be/src/runtime/tmp-file-mgr.cc@1797 PS35, Line 1797: status = TMP_FILE_BUFFER_POOL_CONTEXT_CANCELLED; As discussed, this should be a human-readable error message that explains that the spilling was cancelled after waiting for x seconds without getting a local file buffer. http://gerrit.cloudera.org:8080/#/c/16318/35/be/src/runtime/tmp-file-mgr.cc@1941 PS35, Line 1941: timeout = true; It'd be clearer to just return here, instead of break then return below. http://gerrit.cloudera.org:8080/#/c/16318/35/tests/custom_cluster/test_scratch_disk.py File tests/custom_cluster/test_scratch_disk.py: http://gerrit.cloudera.org:8080/#/c/16318/35/tests/custom_cluster/test_scratch_disk.py@260 PS35, Line 260: @pytest.mark.execute_serially Do these tests actually read back the data from remote storage? Or do they read from in-memory and the local files. I.e. did you confirm that they are reading spilled data from HDFS? If they are reading it back from in memory, maybe set --buffer_pool_clean_pages_limit to a lower value so that more clean pages are evicted -- To view, visit http://gerrit.cloudera.org:8080/16318 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89 Gerrit-Change-Number: 16318 Gerrit-PatchSet: 35 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Fri, 12 Feb 2021 22:33:11 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16909 ) Change subject: IMPALA-5675: Support UTF-8 Varchar and Char .. Patch Set 13: (5 comments) I looked at the backend change and the approach seems sane. I think an alternative for CHAR would be to switch to storing it in as a variable-length slot, i.e. StringValue, but to retain the padding logic. The approach you choose probably requires fewer code changes, which is good for a legacy data type that is hopefully rarely used. http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@7 PS13, Line 7: IMPALA-5675: Support UTF-8 Varchar and Char One minor thing : I think we should comment the len field, e.g. in StringValue StringVal, ColumnType, TypeDescriptor to be clear whether it means the length in bytes or the length in characters. http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@45 PS13, Line 45: In the backend, how did you identify all the places you needed to change the code? I can think of a number of places where the behaviour depends on the CHAR/VARCHAR length * Cast functions to CHAR/VARCHAR * Cast functions from STRING to CHAR, where we need to identify the "true" length of the CHAR. Although if we switched to the truncating behaviour suggested in IMPALA-1652, that might be the same for both code paths. * Scanners where we read and may need to pad or truncate according to the data type. * Expression evaluation where we turn a CHAR slot into a StringVal and a StringVal into a CHAR slot * Both the codegen'd and interpreted codepaths of the above. Generally I think my concern is that we've caught all the places and that we're testing the code. http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc@392 PS13, Line 392: if (!state_->utf8_mode()) { Should we add a DCHECK in UTF-8 mode to verify that the strings returned from Kudu do not have > the required number of characters. http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc File be/src/exprs/cast-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc@218 PS13, Line 218: // TODO: Will timestamp strings have non-ASCII characters? I don't think they could? Custom timestamp formats could have them, but that doesn't apply here- you can't have a custom format in a cast. http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc File be/src/exprs/scalar-expr-evaluator.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc@316 PS13, Line 316: // TODO: consider returning reference of the StringVal in UTF-8 mode. You might have to be careful here with UDFs, since they could return a len > the slot size. -- To view, visit http://gerrit.cloudera.org:8080/16909 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f Gerrit-Change-Number: 16909 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 12 Feb 2021 22:08:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9745: Propagate source type when doing constant propagation
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17064 ) Change subject: IMPALA-9745: Propagate source type when doing constant propagation .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/17064/2/fe/src/main/java/org/apache/impala/analysis/ConstantPredicateHandler.java File fe/src/main/java/org/apache/impala/analysis/ConstantPredicateHandler.java: http://gerrit.cloudera.org:8080/#/c/17064/2/fe/src/main/java/org/apache/impala/analysis/ConstantPredicateHandler.java@186 PS2, Line 186: Expr expr = source.getType() != target.getType() ? Should this use !equals() instead of != -- To view, visit http://gerrit.cloudera.org:8080/17064 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic3853478945229440f733c256ea225639f9178ff Gerrit-Change-Number: 17064 Gerrit-PatchSet: 2 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 12 Feb 2021 21:33:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9234: Support Ranger row filtering policies
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16976 ) Change subject: IMPALA-9234: Support Ranger row filtering policies .. Patch Set 2: (8 comments) http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/analysis/InlineViewRef.java File fe/src/main/java/org/apache/impala/analysis/InlineViewRef.java: http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/analysis/InlineViewRef.java@175 PS2, Line 175: // TODO: Column-masking/Row-filtering expressions may have subqueries which may Can you file a JIRA for this? I guess it is a fairly plausible use case for this - I found an example here - https://cwiki.apache.org/confluence/display/RANGER/Row-level+filtering+and+column-masking+using+Apache+Ranger+policies+in+Apache+Hive I think we might also have to be careful about exactly what subqueries are allowed. E.g. if would be weird if you had a correlated references between the row filter subquery and the outer query, or if you could have a relative table reference that referred to different tables depending on which DB it was executed in. edit: found the JIRA IMPALA-10483 http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java File fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java: http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@473 PS2, Line 473: when row filter is disabled This is inaccurate - this method doesn't check whether row filtering is enabled. http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@476 PS2, Line 476: authorizeRowFilter Rename this method to reflect new behaviour, e.g. throwIfRowFilteringRequired() http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerBufferAuditHandler.java File fe/src/main/java/org/apache/impala/authorization/ranger/RangerBufferAuditHandler.java: http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/main/java/org/apache/impala/authorization/ranger/RangerBufferAuditHandler.java@139 PS2, Line 139: // TODO: Whether we should use lowercase or uppercase accessType? How would we decide this? What does Hive do? http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/test/java/org/apache/impala/authorization/ranger/RangerAuditLogTest.java File fe/src/test/java/org/apache/impala/authorization/ranger/RangerAuditLogTest.java: http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/test/java/org/apache/impala/authorization/ranger/RangerAuditLogTest.java@454 PS2, Line 454: String databaseName = "functional"; Add a brief comment describing the row filter policies added. http://gerrit.cloudera.org:8080/#/c/16976/2/fe/src/test/java/org/apache/impala/authorization/ranger/RangerAuditLogTest.java@502 PS2, Line 502: authzOk(events -> { Can you comment why we get the row filter event in one case but not the other (I think it's because of the user it's applied to right). http://gerrit.cloudera.org:8080/#/c/16976/2/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test File testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test: http://gerrit.cloudera.org:8080/#/c/16976/2/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test@3 PS2, Line 3: # Row-filtering policy keeps rows with "id % 3 = 0". Is the expected behaviour that row filters are applied before column masks? http://gerrit.cloudera.org:8080/#/c/16976/2/testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test File testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test: http://gerrit.cloudera.org:8080/#/c/16976/2/testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test@1 PS2, Line 1: Can you add a test where the row filter references a column not in the table. it should fail with an analysis exception. There's an extra edge case to check here where it isn't in the table, but could be a correlated reference to an outer query. E.g. This is a valid query, but I think test_id = id would be an invalid row filter on alltypes. select * from jointbl where exists(select * from alltypes where test_id = id); -- To view, visit http://gerrit.cloudera.org:8080/16976 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I580517be241225ca15e45686381b78890178d7cc Gerrit-Change-Number: 16976 Gerrit-PatchSet: 2 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Fang-Yu Rao Gerrit-Review
[Impala-ASF-CR] IMPALA-4805: Avoid hash exchange before analytic function if appropriate
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16888 ) Change subject: IMPALA-4805: Avoid hash exchange before analytic function if appropriate .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16888 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icb6289d1e70cfb6bbd5b38eedb00856dbc85ac77 Gerrit-Change-Number: 16888 Gerrit-PatchSet: 2 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 23:51:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README
Tim Armstrong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17058 ) Change subject: IMPALA-10470: Add link to quickstart container from README .. IMPALA-10470: Add link to quickstart container from README Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e Reviewed-on: http://gerrit.cloudera.org:8080/17058 Reviewed-by: Joe McDonnell Tested-by: Tim Armstrong --- M README.md 1 file changed, 9 insertions(+), 1 deletion(-) Approvals: Joe McDonnell: Looks good to me, approved Tim Armstrong: Verified -- To view, visit http://gerrit.cloudera.org:8080/17058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e Gerrit-Change-Number: 17058 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17058 ) Change subject: IMPALA-10470: Add link to quickstart container from README .. Patch Set 1: Manually verifying since it's only a readme change. -- To view, visit http://gerrit.cloudera.org:8080/17058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e Gerrit-Change-Number: 17058 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 17:51:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17058 ) Change subject: IMPALA-10470: Add link to quickstart container from README .. Patch Set 1: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/17058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e Gerrit-Change-Number: 17058 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 17:51:07 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 4: Hit IMPALA-10501 -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 17:50:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 4: Code-Review+2 fix rat exclusdes -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:24:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Hello Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17050 to look at the new patch set (#4). Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. IMPALA-9382: part 3/3 clean up runtime profile v2 text output Eliminated some of the noisy per-instance counters from DEFAULT verbosity. Testing: * Updated impala-profile-tool test with new output * Added new impala-profile-tool test for v2 profile. Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 --- M be/src/util/runtime-profile.cc M bin/rat_exclude_files.txt M testdata/impala-profiles/README M testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2 A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt M tests/observability/test_profile_tool.py 8 files changed, 5,760 insertions(+), 516 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/4 -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17058 Change subject: IMPALA-10470: Add link to quickstart container from README .. IMPALA-10470: Add link to quickstart container from README Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e --- M README.md 1 file changed, 9 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/17058/1 -- To view, visit http://gerrit.cloudera.org:8080/17058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e Gerrit-Change-Number: 17058 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 2: Code-Review+2 carry -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Hello Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17050 to look at the new patch set (#2). Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. IMPALA-9382: part 3/3 clean up runtime profile v2 text output Eliminated some of the noisy per-instance counters from DEFAULT verbosity. Testing: * Updated impala-profile-tool test with new output * Added new impala-profile-tool test for v2 profile. Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 --- M be/src/util/runtime-profile.cc M testdata/impala-profiles/README M testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2 A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt M tests/observability/test_profile_tool.py 7 files changed, 5,757 insertions(+), 516 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/2 -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py File tests/observability/test_profile_tool.py: http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py@50 PS1, Line 50: ) > flake8: E501 line too long (91 > 90 characters) Done http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py@53 PS1, Line 53: ) > flake8: E501 line too long (92 > 90 characters) Done -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#34). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. This interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle the partitioned case, for which we use a std::map and a comparator based on the partition exprs. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/exprs/slot-ref.h M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/priority-queue.h M be/src/util/tuple-row-compare.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/or
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. Patch Set 33: (2 comments) http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc File be/src/exec/topn-node.cc: http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@559 PS30, Line 559: // We evict heaps starting with the heaps that were least effective at filtering > I think I was partly thinking of the computation within the AnalyticEval no Yup http://gerrit.cloudera.org:8080/#/c/16242/33/testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test File testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test: http://gerrit.cloudera.org:8080/#/c/16242/33/testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test@22 PS33, Line 22: limit predicate > When browsing through the plans one more time, it occurred to me that some I think that makes sense, it is introducing a very specific new term that might be confusing. I replaced it. -- To view, visit http://gerrit.cloudera.org:8080/16242 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Gerrit-Change-Number: 16242 Gerrit-PatchSet: 33 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 10 Feb 2021 07:21:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17030 ) Change subject: IMPALA-10469: push quickstart to apache repo .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/17030/2/docker/publish_images_to_apache.sh File docker/publish_images_to_apache.sh: http://gerrit.cloudera.org:8080/#/c/17030/2/docker/publish_images_to_apache.sh@62 PS2, Line 62: IMAGES+=" impala_quickstart_client impala_quickstart_hms" > Ok, I think a coherent first step is that docker-images.txt is a list of al Good point. I did this and re-pushed the images. Also removed the -d flag cause I don't think that was particularly useful and it got more complicated when we had images that didn't have debug values. -- To view, visit http://gerrit.cloudera.org:8080/17030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a Gerrit-Change-Number: 17030 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 09 Feb 2021 20:27:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo
Hello Grant Henke, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17030 to look at the new patch set (#3). Change subject: IMPALA-10469: push quickstart to apache repo .. IMPALA-10469: push quickstart to apache repo This adds a script, docker/publish_images_to_apache.sh, that allows uploading images to the apache/impala docker hub repo, prefixed with a version string. E.g. with the following commands: ninja docker_images quickstart_docker_images ./docker/publish_images_to_apache.sh -v 81d5377c2 The uploaded images can then be used for the quickstart cluster, as documented in docker/README. Updated docs for quickstart to use a prefix from apache/impala Remove IMPALA_QUICKSTART_VERSION, which doesn't interact well with the tagging since the image name and version are now encoded in the tag. Fix an incorrect image name added to docker-images.txt: impala_profile_tool_image. Testing: Ran Impala quickstart with data loading using instructions in README. export IMPALA_QUICKSTART_IMAGE_PREFIX="apache/impala:81d5377c2-" docker network create -d bridge quickstart-network export QUICKSTART_IP=$(docker network inspect quickstart-network -f '{{(index .IPAM.Config 0).Gateway}}') export QUICKSTART_LISTEN_ADDR=$QUICKSTART_IP docker-compose -f docker/quickstart.yml \ -f docker/quickstart-kudu-minimal.yml \ -f docker/quickstart-load-data.yml up -d docker run --network=quickstart-network -it \ ${IMPALA_QUICKSTART_IMAGE_PREFIX}impala_quickstart_client impala-shell Change-Id: I535d77e565b73d732ae511d7525193467086c76a --- M docker/CMakeLists.txt M docker/README.md A docker/publish_images_to_apache.sh M docker/quickstart-load-data.yml M docker/quickstart.yml 5 files changed, 115 insertions(+), 14 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/17030/3 -- To view, visit http://gerrit.cloudera.org:8080/17030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a Gerrit-Change-Number: 17030 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9586: update query option docs for mt dop
Tim Armstrong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17043 ) Change subject: IMPALA-9586: update query option docs for mt_dop .. IMPALA-9586: update query option docs for mt_dop There are interactions between mt_dop and num_nodes and num_scanner_threads. Mention these in the docs. Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd Reviewed-on: http://gerrit.cloudera.org:8080/17043 Tested-by: Impala Public Jenkins Reviewed-by: Joe McDonnell --- M docs/topics/impala_num_nodes.xml M docs/topics/impala_num_scanner_threads.xml 2 files changed, 9 insertions(+), 0 deletions(-) Approvals: Impala Public Jenkins: Verified Joe McDonnell: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/17043 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd Gerrit-Change-Number: 17043 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17050 Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. IMPALA-9382: part 3/3 clean up runtime profile v2 text output Eliminated some of the noisy per-instance counters from DEFAULT verbosity. Testing: * Updated impala-profile-tool test with new output * Added new impala-profile-tool test for v2 profile. Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 --- M be/src/util/runtime-profile.cc M testdata/impala-profiles/README M testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2 A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt M tests/observability/test_profile_tool.py 7 files changed, 5,755 insertions(+), 516 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/1 -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-9586: update query option docs for mt dop
Hello Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17043 to look at the new patch set (#2). Change subject: IMPALA-9586: update query option docs for mt_dop .. IMPALA-9586: update query option docs for mt_dop There are interactions between mt_dop and num_nodes and num_scanner_threads. Mention these in the docs. Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd --- M docs/topics/impala_num_nodes.xml M docs/topics/impala_num_scanner_threads.xml 2 files changed, 9 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/17043/2 -- To view, visit http://gerrit.cloudera.org:8080/17043 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd Gerrit-Change-Number: 17043 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-9586: update query option docs for mt dop
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17043 Change subject: IMPALA-9586: update query option docs for mt_dop .. IMPALA-9586: update query option docs for mt_dop There are interactions between mt_dop and num_nodes and num_scanner_threads. Mention these in the docs. Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd --- M docs/topics/impala_num_nodes.xml M docs/topics/impala_num_scanner_threads.xml 2 files changed, 9 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/17043/1 -- To view, visit http://gerrit.cloudera.org:8080/17043 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I3d9a6f56ffaf211d7d3ca1fad506ee83d516ccbd Gerrit-Change-Number: 17043 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-8721: re-enable test hive impala interop
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17042 Change subject: IMPALA-8721: re-enable test_hive_impala_interop .. IMPALA-8721: re-enable test_hive_impala_interop The test now passes because HIVE-21290 was fixed. Revert "IMPALA-8689: test_hive_impala_interop failing with "Timeout >7200s"" This reverts commit 5d8c99ce74c45a7d04f11e1f252b346d654f02bf. Change-Id: I7e2beabd7082a45a0fc3b60d318cf698079768ff --- M tests/custom_cluster/test_hive_parquet_codec_interop.py 1 file changed, 2 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/17042/1 -- To view, visit http://gerrit.cloudera.org:8080/17042 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I7e2beabd7082a45a0fc3b60d318cf698079768ff Gerrit-Change-Number: 17042 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. Patch Set 30: (21 comments) http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG@49 PS30, Line 49: The > nit: 'This' Done http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG@57 PS30, Line 57: both : the partitioning > nit: ..and non-partitioned case ? Done http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.h File be/src/exec/topn-node.h: http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.h@133 PS30, Line 133: partition are returned > Would be good to add how the returned partitions are ordered among themselv It seems worth being clear about the invariant, so added it here. http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc File be/src/exec/topn-node.cc: http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@53 PS30, Line 53: the the > nit: repetition 'the' D'oh http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@59 PS30, Line 59: the the > nit: repetition 'the' D'oh http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@398 PS30, Line 398: RETURN_IF_CANCELLED(state); > I was wondering if the inner while loop (line 418) should also have a cance The inner loop is bounded by the batch size so should be fine (the general rule has been to check for cancellation at O(batch size) intervals). http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@559 PS30, Line 559: // We evict heaps starting with the heaps that were least effective at filtering > The eviction policy is interesting. Considering that a rank()/rownum() can Maybe I didn't quite get the point, but we should be able to start filtering as soon as we have N elements in a heap for the partition - after that point, each new row is either in the top N we've seen so far (in which case we discard a previous row) or it's not (in which case we discard the new row). http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@666 PS30, Line 666: // The key references memory in 'tuple_pool_'. Replace it with a rematerialized tuple. > line too long (92 > 90) Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/analysis/SortInfo.java File fe/src/main/java/org/apache/impala/analysis/SortInfo.java: http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@279 PS30, Line 279: public long estimateTopNMaterializedSize(long totalRows) { > Should this estimate account for any per-partition overhead (additional met We could probably add 100 bytes or so to account for the size of the heap, but I feel like it probably doesn't really affect estimates all that much and adds a bit of complexity. I did realise that the name of this method is a bit misleading, so renamed it to reflect that it's just estimating the size of that many rows. http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java: http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@646 PS30, Line 646: limit on > nit: incomplete sentence Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java File fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java: http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@453 PS30, Line 453: if (limit.limit > 1) { > Maybe simplify this if-else to planNodeLimit = Math.max(limit.limit, 1) Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@461 PS30, Line 461: if (partitionByExprs.isEmpty()) { > There was a recent issue with constant partition-by and order-by exprs (htt I modified this to check for whether they are all literals and added a test. There's a bit of a distinction between "constant", which means assumed constant within a query and literals that are guaranteed constant, so I used the latter. http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@885 PS30, Line 885: /// Whether the source predicate was a simple < or <= inequality. > This sounded confusing at first ..I interpreted it to mean true for < (the Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@928 PS30, Line 928: to . > nit: incomplete Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/DistributedPlanner
[Impala-ASF-CR] wip
Tim Armstrong has abandoned this change. ( http://gerrit.cloudera.org:8080/17040 ) Change subject: wip .. Abandoned -- To view, visit http://gerrit.cloudera.org:8080/17040 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: If1919b896ee3e4eaef2d02cf8b1d29d7a0996632 Gerrit-Change-Number: 17040 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#32). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. This interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle the partitioned case, for which we use a std::map and a comparator based on the partition exprs. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/exprs/slot-ref.h M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/priority-queue.h M be/src/util/tuple-row-compare.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/or
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#31). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. This interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle the partitioned case, for which we use a std::map and a comparator based on the partition exprs. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/exprs/slot-ref.h M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/priority-queue.h M be/src/util/tuple-row-compare.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/or
[Impala-ASF-CR] wip
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17040 Change subject: wip .. wip Change-Id: If1919b896ee3e4eaef2d02cf8b1d29d7a0996632 --- M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M fe/src/main/java/org/apache/impala/analysis/SortInfo.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SelectNode.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test 10 files changed, 159 insertions(+), 362 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/17040/1 -- To view, visit http://gerrit.cloudera.org:8080/17040 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: If1919b896ee3e4eaef2d02cf8b1d29d7a0996632 Gerrit-Change-Number: 17040 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17030 ) Change subject: IMPALA-10469: push quickstart to apache repo .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/17030/2/docker/publish_images_to_apache.sh File docker/publish_images_to_apache.sh: http://gerrit.cloudera.org:8080/#/c/17030/2/docker/publish_images_to_apache.sh@62 PS2, Line 62: IMAGES+=" impala_quickstart_client impala_quickstart_hms" > What consumes docker-images.txt other than this? Should we just include the I've seen some build tools (internal to Cloudera) that basically intersect the images in docker-images.txt with the image tags generated by the build and then consider those build outputs. We don't build these quickstart images as part of the docker_images target, but I guess that's true of the debug images as well, so maybe it's harmless to add these. What do you think? -- To view, visit http://gerrit.cloudera.org:8080/17030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a Gerrit-Change-Number: 17030 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 09 Feb 2021 01:40:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo
Hello Grant Henke, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17030 to look at the new patch set (#2). Change subject: IMPALA-10469: push quickstart to apache repo .. IMPALA-10469: push quickstart to apache repo This adds a script, docker/publish_images_to_apache.sh, that allows uploading images to the apache/impala docker hub repo, prefixed with a version string. E.g. with the following commands: ninja docker_images quickstart_docker_images ./docker/publish_images_to_apache.sh -v 81d5377c2 The uploaded images can then be used for the quickstart cluster, as documented in docker/README. Updated docs for quickstart to use a prefix from apache/impala Remove IMPALA_QUICKSTART_VERSION, which doesn't interact well with the tagging since the image name and version are now encoded in the tag. Fix an incorrect image name added to docker-images.txt: impala_profile_tool_image. Testing: Ran Impala quickstart with data loading using instructions in README. export IMPALA_QUICKSTART_IMAGE_PREFIX="apache/impala:81d5377c2-" docker network create -d bridge quickstart-network export QUICKSTART_IP=$(docker network inspect quickstart-network -f '{{(index .IPAM.Config 0).Gateway}}') export QUICKSTART_LISTEN_ADDR=$QUICKSTART_IP docker-compose -f docker/quickstart.yml \ -f docker/quickstart-kudu-minimal.yml \ -f docker/quickstart-load-data.yml up -d docker run --network=quickstart-network -it \ ${IMPALA_QUICKSTART_IMAGE_PREFIX}impala_quickstart_client impala-shell Change-Id: I535d77e565b73d732ae511d7525193467086c76a --- M docker/CMakeLists.txt M docker/README.md A docker/publish_images_to_apache.sh M docker/quickstart-load-data.yml M docker/quickstart.yml 5 files changed, 111 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/17030/2 -- To view, visit http://gerrit.cloudera.org:8080/17030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a Gerrit-Change-Number: 17030 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-10469: push quickstart to apache repo
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17030 Change subject: IMPALA-10469: push quickstart to apache repo .. IMPALA-10469: push quickstart to apache repo This adds a script, docker/publish_images_to_apache.sh, that allows uploading images to the apache/impala docker hub repo, prefixed with a version string. E.g. with the following commands: ninja docker_images quickstart_docker_images ./docker/publish_images_to_apache.sh -v 81d5377c2 The uploaded images can then be used for the quickstart cluster, as documented in docker/README. Updated docs for quickstart to use a prefix from apache/impala Remove IMPALA_QUICKSTART_VERSION, which doesn't interact well with the tagging since the image name and version are now encoded in the tag. Fix an incorrect image name added to docker-images.txt: impala_profile_tool_image. Testing: Ran Impala quickstart with data loading using instructions in README. export IMPALA_QUICKSTART_IMAGE_PREFIX="apache/impala:81d5377c2-" docker network create -d bridge quickstart-network export QUICKSTART_IP=$(docker network inspect quickstart-network -f '{{(index .IPAM.Config 0).Gateway}}') export QUICKSTART_LISTEN_ADDR=$QUICKSTART_IP docker-compose -f docker/quickstart.yml \ -f docker/quickstart-kudu-minimal.yml \ -f docker/quickstart-load-data.yml up -d docker run --network=quickstart-network -it \ ${IMPALA_QUICKSTART_IMAGE_PREFIX}impala_quickstart_client impala-shell Change-Id: I535d77e565b73d732ae511d7525193467086c76a --- M docker/CMakeLists.txt M docker/README.md A docker/publish_images_to_apache.sh M docker/quickstart-load-data.yml M docker/quickstart.yml 5 files changed, 111 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/17030/1 -- To view, visit http://gerrit.cloudera.org:8080/17030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I535d77e565b73d732ae511d7525193467086c76a Gerrit-Change-Number: 17030 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-10475: [DOCS] elaborate SYNC DDL option
Tim Armstrong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17027 ) Change subject: IMPALA-10475: [DOCS] elaborate SYNC_DDL option .. IMPALA-10475: [DOCS] elaborate SYNC_DDL option call out that SYNC_DDL applies to all filesystem-based tables Change-Id: I3f1bfce8430c681515101d00cabf9d70ae52e5ec Reviewed-on: http://gerrit.cloudera.org:8080/17027 Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong --- M docs/topics/impala_sync_ddl.xml 1 file changed, 7 insertions(+), 7 deletions(-) Approvals: Impala Public Jenkins: Verified Tim Armstrong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/17027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I3f1bfce8430c681515101d00cabf9d70ae52e5ec Gerrit-Change-Number: 17027 Gerrit-PatchSet: 2 Gerrit-Owner: Shajini Thayasingh Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10475: [DOCS] elaborate SYNC DDL option
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17027 ) Change subject: IMPALA-10475: [DOCS] elaborate SYNC_DDL option .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/17027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3f1bfce8430c681515101d00cabf9d70ae52e5ec Gerrit-Change-Number: 17027 Gerrit-PatchSet: 1 Gerrit-Owner: Shajini Thayasingh Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 05 Feb 2021 22:11:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10389: impala-profile-tool container
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17015 ) Change subject: IMPALA-10389: impala-profile-tool container .. Patch Set 1: (4 comments) http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt File docker/CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt@132 PS1, Line 132: file(WRITE ${CMAKE_SOURCE_DIR}/docker/docker-images.txt "${exported_image_names}") > I'm wondering whether it is significant that the profile tool image is adde You're right, this should go at the end. I didn't check the contents of the file. http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt@172 PS1, Line 172: setup_utility_build_context.py > I think this should be setup_build_context.py? Done http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt@184 PS1, Line 184: WORKING_DIRECTORY ${IMPALA_UTILITY_BUILD_CONTEXT_DIR}/${build_type} > Nit: indentation should line up with COMMAND Done http://gerrit.cloudera.org:8080/#/c/17015/1/docker/CMakeLists.txt@186 PS1, Line 186: #DEPENDS ${CMAKE_SOURCE_DIR}/bin/graceful_shutdown_backends.sh > Nit: stray line? Missing the entrypoint script. -- To view, visit http://gerrit.cloudera.org:8080/17015 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714 Gerrit-Change-Number: 17015 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 05 Feb 2021 00:39:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10389: impala-profile-tool container
Hello Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17015 to look at the new patch set (#2). Change subject: IMPALA-10389: impala-profile-tool container .. IMPALA-10389: impala-profile-tool container Add a build step for an impala-profile-tool docker image that makes it easy to run the binary on any system. This container is automatically built as part of the docker build. This sets up a new build context that doesn't pull in all of the same dependencies or depend on the Java build Testing: cat logs/cluster/profiles/* | \ docker run -i impala_profile_tool I uploaded a build of the container to dockerhub too: timgarmstrong/impala_profile_tool Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714 --- M docker/CMakeLists.txt A docker/impala_profile_tool/Dockerfile M docker/setup_build_context.py A docker/utility_entrypoint.sh 4 files changed, 203 insertions(+), 46 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/17015/2 -- To view, visit http://gerrit.cloudera.org:8080/17015 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714 Gerrit-Change-Number: 17015 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10274: Initialize impala-python as part of the CMake build
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16607 ) Change subject: IMPALA-10274: Initialize impala-python as part of the CMake build .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16607 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ieff51263c55bd234028fed7101c94b4a928590f0 Gerrit-Change-Number: 16607 Gerrit-PatchSet: 5 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 04 Feb 2021 00:46:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#30). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. The interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle both the partitioning (for which we use a std::map and a comparator based on the partition exprs). The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/exprs/slot-ref.h M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/priority-queue.h M be/src/util/tuple-row-compare.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/o
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Thomas Tauber-Marshall, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#29). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. The interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle both the partitioning (for which we use a std::map and a comparator based on the partition exprs). The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/exprs/slot-ref.h M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/priority-queue.h M be/src/util/tuple-row-compare.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/o
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. Patch Set 28: (7 comments) http://gerrit.cloudera.org:8080/#/c/16242/28//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16242/28//COMMIT_MSG@59 PS28, Line 59: and the tie-handling : semantics required by rank() predicates > nit: I think this was really implemented in your previous patch? Done http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.h File be/src/exec/topn-node.h: http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.h@64 PS28, Line 64: int64_t limit = is_partitioned() ? per_partition_limit() : > What's the relationship between 'include_ties' and 'is_partitioned', i.e. w tried to make it less confusing... http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc File be/src/exec/topn-node.cc: http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc@244 PS28, Line 244: U > typo Done http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc@399 PS28, Line 399: RETURN_IF_ERROR(QueryMaintenance(state)); > This results in two calls to QueryMaintenance() in quick succession, here a Good point, I just moved it to GetNextUnpartitioned() so that we don't double up in this method. http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc@566 PS28, Line 566: be > typo Done http://gerrit.cloudera.org:8080/#/c/16242/28/be/src/exec/topn-node.cc@666 PS28, Line 666: vector> rematerialized_heaps; : for (auto& entry : partition_heaps_) { : RETURN_IF_ERROR(entry.second->RematerializeTuples(this, state, temp_pool.get())); : DCHECK(entry.second->DCheckConsistency()); : // The key references memory in 'tuple_pool_'. Replace it with a rematerialized tuple. : rematerialized_heaps.push_back(move(entry.second)); : } : partition_heaps_.clear(); : for (auto& heap_ptr : rematerialized_heaps) { : const Tuple* key_tuple = heap_ptr->top(); : partition_heaps_.emplace(key_tuple, move(heap_ptr)); : } > I think this can be put in an 'else' with the above 'if (heap_ != nullptr)' Done http://gerrit.cloudera.org:8080/#/c/16242/28/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/16242/28/common/thrift/ImpalaService.thrift@625 PS28, Line 625: // If > 0, the rank()/row_number() pushdown into pre-analytic sorts is enabled > Maybe note the default value, and briefly the issues with setting it higher Done -- To view, visit http://gerrit.cloudera.org:8080/16242 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Gerrit-Change-Number: 16242 Gerrit-PatchSet: 28 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 02 Feb 2021 18:53:14 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9588: Add extra logging to cancel tests
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16985 ) Change subject: IMPALA-9588: Add extra logging to cancel tests .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16985 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ied7100a9ea2e2f0611cf8e328e589b4c8e5d5100 Gerrit-Change-Number: 16985 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 02 Feb 2021 17:33:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10389: impala-profile-tool container
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17015 Change subject: IMPALA-10389: impala-profile-tool container .. IMPALA-10389: impala-profile-tool container Add a build step for an impala-profile-tool docker image that makes it easy to run the binary on any system. This container is automatically built as part of the docker build. This sets up a new build context that doesn't pull in all of the same dependencies or depend on the Java build Testing: cat logs/cluster/profiles/* | \ docker run -i impala_profile_tool I uploaded a build of the container to dockerhub too: timgarmstrong/impala_profile_tool Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714 --- M docker/CMakeLists.txt A docker/impala_profile_tool/Dockerfile M docker/setup_build_context.py A docker/utility_entrypoint.sh 4 files changed, 200 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/17015/1 -- To view, visit http://gerrit.cloudera.org:8080/17015 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I36915cd686ab930dcc934bc0c81bff8c16d46714 Gerrit-Change-Number: 17015 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-10454: Bump --ssl minimum version to tls1.2
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16988 ) Change subject: IMPALA-10454: Bump --ssl_minimum_version to tls1.2 .. Patch Set 2: Code-Review+2 Thanks for fixing this! -- To view, visit http://gerrit.cloudera.org:8080/16988 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifed66646b041a061f9db92744710aef7453f39e4 Gerrit-Change-Number: 16988 Gerrit-PatchSet: 2 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 27 Jan 2021 21:37:29 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-9234: Support Ranger row filtering policies
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16976 ) Change subject: [WIP] IMPALA-9234: Support Ranger row filtering policies .. Patch Set 1: (4 comments) http://gerrit.cloudera.org:8080/#/c/16976/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16976/1//COMMIT_MSG@33 PS1, Line 33:place I think we need to add some more tests for different row filter types, e.g. more complex expressions, invalid expressions, etc. Are there any restrictions on the types of expressions that can be used in row filters? E.g. are UDFs allowed? http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/TableMask.java File fe/src/main/java/org/apache/impala/authorization/TableMask.java: http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/TableMask.java@102 PS1, Line 102: String stmtSql = String.format("SELECT * FROM %s.%s WHERE %s", We might need to be a bit careful with testing invalid row filters that don't parse, etc. It seems like this approach should work, but just trying to think about the risks. http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java File fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java: http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@397 PS1, Line 397: List auditEvents = auditHandler.getAuthzEvents(); Can we factor out this code and share it with the similar logic above? This code seems subtle and it would be good to only have one implementation. http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationContext.java File fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationContext.java: http://gerrit.cloudera.org:8080/#/c/16976/1/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationContext.java@56 PS1, Line 56:* We stash the List of AuthzAuditEvent's in a Map after the analysis of the query and This comment probably needs an update. I find this method a bit confusing overall. Maybe it would be clearer if the comment described what it means to call this method? Or when it would be called? The comment seems to mainly discuss the implementation. -- To view, visit http://gerrit.cloudera.org:8080/16976 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I580517be241225ca15e45686381b78890178d7cc Gerrit-Change-Number: 16976 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 27 Jan 2021 20:42:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10382: fix invalid outer join simplification
Tim Armstrong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16845 ) Change subject: IMPALA-10382: fix invalid outer join simplification .. IMPALA-10382: fix invalid outer join simplification When set ENABLE_OUTER_JOIN_TO_INNER_TRANSFORMATION = true, the planner will simplify outer joins if the predicate with case expr or conditional function on both sides of outer join. However, the predicate maybe not null-rejecting, if simplify the outer join, the result is incorrect. E.g. t1.b > coalesce(t1.c, t2.c) can return true if t2.c is null, so it is not null-rejecting predicate for t2. The fix is simply to support the case that the predicate with two operands and the operator is one of (=, !=, >, <, >=, <=), 1. one of the operands or 2. if the operand is arithmetic expression and one of the children does not contain conditional builtin function or case expr and has tuple id in outer joined tuples. E.g. t1.b > coalesce(t2.c, t1.c) or t1.b + coalesce(t2.c, t1.c) > coalesce(t2.c, t1.c) is null-rejecting predicate for t1. Testing: * Add new plan tests in outer-to-inner-joins.test * Add new query tests to verify the correctness on transformation Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b Reviewed-on: http://gerrit.cloudera.org:8080/16845 Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong --- M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M testdata/workloads/functional-planner/queries/PlannerTest/outer-to-inner-joins.test M testdata/workloads/functional-query/queries/QueryTest/outer-to-inner-joins.test 3 files changed, 223 insertions(+), 11 deletions(-) Approvals: Impala Public Jenkins: Verified Tim Armstrong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/16845 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b Gerrit-Change-Number: 16845 Gerrit-PatchSet: 5 Gerrit-Owner: Xianqing He Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Xianqing He
[Impala-ASF-CR] IMPALA-10382: fix invalid outer join simplification
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16845 ) Change subject: IMPALA-10382: fix invalid outer join simplification .. Patch Set 4: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16845 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b Gerrit-Change-Number: 16845 Gerrit-PatchSet: 4 Gerrit-Owner: Xianqing He Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Xianqing He Gerrit-Comment-Date: Wed, 27 Jan 2021 17:30:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10274: Initialize impala-python as part of the CMake build
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16607 ) Change subject: IMPALA-10274: Initialize impala-python as part of the CMake build .. Patch Set 4: (3 comments) Seems overall fine, had some minor comments. http://gerrit.cloudera.org:8080/#/c/16607/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16607/4//COMMIT_MSG@19 PS4, Line 19: - Rebuilt and verified that impala-python is not reinitialized Are there any considerations about testing against different OS versions? I guess mostly we're just using python2.7 in most places? http://gerrit.cloudera.org:8080/#/c/16607/4/be/src/codegen/gen_ir_descriptions.py File be/src/codegen/gen_ir_descriptions.py: http://gerrit.cloudera.org:8080/#/c/16607/4/be/src/codegen/gen_ir_descriptions.py@1 PS4, Line 1: #!/usr/bin/env python Maybe comment why this doesn't use impala-python? http://gerrit.cloudera.org:8080/#/c/16607/4/bin/gen_build_version.py File bin/gen_build_version.py: http://gerrit.cloudera.org:8080/#/c/16607/4/bin/gen_build_version.py@1 PS4, Line 1: #!/usr/bin/env python Maybe comment why this doesn't use impala-python? -- To view, visit http://gerrit.cloudera.org:8080/16607 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ieff51263c55bd234028fed7101c94b4a928590f0 Gerrit-Change-Number: 16607 Gerrit-PatchSet: 4 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 27 Jan 2021 02:20:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10382: fix invalid outer join simplification
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16845 ) Change subject: IMPALA-10382: fix invalid outer join simplification .. Patch Set 2: (1 comment) Looks good to me except for a typo http://gerrit.cloudera.org:8080/#/c/16845/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java File fe/src/main/java/org/apache/impala/analysis/Analyzer.java: http://gerrit.cloudera.org:8080/#/c/16845/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@3368 PS2, Line 3368: Builin nit: Builtin -- To view, visit http://gerrit.cloudera.org:8080/16845 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b Gerrit-Change-Number: 16845 Gerrit-PatchSet: 2 Gerrit-Owner: Xianqing He Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 27 Jan 2021 02:14:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8306: clarify wording on /sessions UI
Hello Vincent Tran, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16981 to look at the new patch set (#2). Change subject: IMPALA-8306: clarify wording on /sessions UI .. IMPALA-8306: clarify wording on /sessions UI Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e --- M www/sessions.tmpl 1 file changed, 9 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/16981/2 -- To view, visit http://gerrit.cloudera.org:8080/16981 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e Gerrit-Change-Number: 16981 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vincent Tran
[Impala-ASF-CR] IMPALA-8306: clarify wording on /sessions UI
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16981 ) Change subject: IMPALA-8306: clarify wording on /sessions UI .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/16981/1/www/sessions.tmpl File www/sessions.tmpl: http://gerrit.cloudera.org:8080/#/c/16981/1/www/sessions.tmpl@29 PS1, Line 29: once an expired session has been cleaned up > Is there a mechanism to clean up expired session? Or are you referring to a Very good point, I was not clearly understanding how this worked - you are right that the expired session can stick around indefinitely unless -disconnected_session_timeout kicks in. Updated this, hope it clarifies some of the isseus. -- To view, visit http://gerrit.cloudera.org:8080/16981 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e Gerrit-Change-Number: 16981 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vincent Tran Gerrit-Comment-Date: Tue, 26 Jan 2021 20:52:04 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support
Tim Armstrong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16982 ) Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support .. IMPALA-10404: Update docs to reflect RLE_DICTIONARY support Fix references to PLAIN_DICTIONARY to reflect that RLE_DICTIONARY is supported too. Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335 Reviewed-on: http://gerrit.cloudera.org:8080/16982 Reviewed-by: Tim Armstrong Tested-by: Impala Public Jenkins --- M docs/shared/impala_common.xml M docs/topics/impala_parquet_dictionary_filtering.xml 2 files changed, 8 insertions(+), 5 deletions(-) Approvals: Tim Armstrong: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/16982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335 Gerrit-Change-Number: 16982 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shajini Thayasingh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16982 ) Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/16982/1/docs/topics/impala_parquet_dictionary_filtering.xml File docs/topics/impala_parquet_dictionary_filtering.xml: http://gerrit.cloudera.org:8080/#/c/16982/1/docs/topics/impala_parquet_dictionary_filtering.xml@61 PS1, Line 61: > nit: extra spaces Done -- To view, visit http://gerrit.cloudera.org:8080/16982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335 Gerrit-Change-Number: 16982 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shajini Thayasingh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 26 Jan 2021 16:35:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16982 ) Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support .. Patch Set 2: Code-Review+2 carry -- To view, visit http://gerrit.cloudera.org:8080/16982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335 Gerrit-Change-Number: 16982 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shajini Thayasingh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 26 Jan 2021 16:35:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support
Hello Zoltan Borok-Nagy, Shajini Thayasingh, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16982 to look at the new patch set (#2). Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support .. IMPALA-10404: Update docs to reflect RLE_DICTIONARY support Fix references to PLAIN_DICTIONARY to reflect that RLE_DICTIONARY is supported too. Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335 --- M docs/shared/impala_common.xml M docs/topics/impala_parquet_dictionary_filtering.xml 2 files changed, 8 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16982/2 -- To view, visit http://gerrit.cloudera.org:8080/16982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335 Gerrit-Change-Number: 16982 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shajini Thayasingh Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9793: Impala quickstart cluster with docker-compose
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15966 ) Change subject: IMPALA-9793: Impala quickstart cluster with docker-compose .. Patch Set 11: Hit IMPALA-10316 -- To view, visit http://gerrit.cloudera.org:8080/15966 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c Gerrit-Change-Number: 15966 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 26 Jan 2021 05:38:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#28). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. The interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle both the partitioning (for which we use a std::map and a comparator based on the partition exprs) and the tie-handling semantics required by rank() predicates. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. We currently use the partitioned top-n node to implement rank() pushdown in all cases because of the tie-handling support. We also cannot use the merging exchange for rank() because the limit does not handle ties in the same way, so we need to generate a hash exchange with a partitioned top-n node on top of the exchange. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.
[Impala-ASF-CR] IMPALA-10404: Update docs to reflect RLE DICTIONARY support
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16982 Change subject: IMPALA-10404: Update docs to reflect RLE_DICTIONARY support .. IMPALA-10404: Update docs to reflect RLE_DICTIONARY support Fix references to PLAIN_DICTIONARY to reflect that RLE_DICTIONARY is supported too. Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335 --- M docs/shared/impala_common.xml M docs/topics/impala_parquet_dictionary_filtering.xml 2 files changed, 8 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16982/1 -- To view, visit http://gerrit.cloudera.org:8080/16982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335 Gerrit-Change-Number: 16982 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-8306: clarify wording on /sessions UI
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16981 Change subject: IMPALA-8306: clarify wording on /sessions UI .. IMPALA-8306: clarify wording on /sessions UI Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e --- M www/sessions.tmpl 1 file changed, 6 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/16981/1 -- To view, visit http://gerrit.cloudera.org:8080/16981 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e Gerrit-Change-Number: 16981 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-9793: Impala quickstart cluster with docker-compose
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15966 ) Change subject: IMPALA-9793: Impala quickstart cluster with docker-compose .. Patch Set 11: Thanks for the review! -- To view, visit http://gerrit.cloudera.org:8080/15966 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c Gerrit-Change-Number: 15966 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 25 Jan 2021 21:34:06 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP: IMPALA-1652: fix char semantics
Tim Armstrong has abandoned this change. ( http://gerrit.cloudera.org:8080/16339 ) Change subject: WIP: IMPALA-1652: fix char semantics .. Abandoned -- To view, visit http://gerrit.cloudera.org:8080/16339 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I00980c162257ce85e7ec2f4f03c48b142e5ef2e2 Gerrit-Change-Number: 16339 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP: IMPALA-1652: fix char semantics
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16339 Change subject: WIP: IMPALA-1652: fix char semantics .. WIP: IMPALA-1652: fix char semantics select 'expected 2',count(*) from ax where t = cast('a ' as varchar(10)) Change-Id: I00980c162257ce85e7ec2f4f03c48b142e5ef2e2 --- M be/src/exprs/anyval-util.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/string-functions-ir.cc M be/src/testutil/test-udas.cc M be/src/udf/uda-test-harness.h M be/src/udf/udf.cc M be/src/udf/udf.h M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/analysis/StringLiteral.java 9 files changed, 56 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/16339/3 -- To view, visit http://gerrit.cloudera.org:8080/16339 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I00980c162257ce85e7ec2f4f03c48b142e5ef2e2 Gerrit-Change-Number: 16339 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP: IMPALA-2138: part 4: implicit cast handling
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/14399 Change subject: WIP: IMPALA-2138: part 4: implicit cast handling .. WIP: IMPALA-2138: part 4: implicit cast handling This solves some latent issues with implicit casts that caused test failures for my projection patch and generally make it difficult to safely use expression substitution to replace expressions with projected versions. When doing expression substitution, it is critical that the behaviour of expressions should *not* change - we want to make the plan more efficient without any change in results. Implicit casts proved problematic for projection because they got removed automatically during expression substitution. In many cases the resulting expression actually changed semantics. E.g. c = cast('foo' as CHAR(4)), where c is a CHAR(3) requires a cast to CHAR(4) to be inserted. Expr.substitute() can actually remove this cast. There is also some "interesting" behaviour where implicit casts are not considered when comparing two expr trees for equality, which could have all kinds of peculiar consequences in edge cases where non-equivalent exprs get replaced. This change introduces a concept of an explicit, planner-generated cast that will be preserved through expression substitution. All implicit casts are promoted to these explicit casts during plan generation and any new casts inserted in the plan's exprs must be explicit casts. A validateExprs() method is added that verifies this. validateExprs() also checks the invariant that exprs must reference valid tuple IDs, which will be important for the follow-on projection patch. changes as much as possible. This patch avoids changes to handling of implicit casts during analysis to avoid unintended changes in behaviour. TODO: * document in CastExpr * validate expressions in data sinks for lack of implicit casts * decouple plan changes with casts from this change Change-Id: Ie50b7d28d691d65ee721581216d227d52f04ecf5 --- M fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/analysis/AnalyticInfo.java M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/BetweenPredicate.java M fe/src/main/java/org/apache/impala/analysis/BoolLiteral.java M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ColumnDef.java M fe/src/main/java/org/apache/impala/analysis/DateLiteral.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/ExprSubstitutionMap.java A fe/src/main/java/org/apache/impala/analysis/ExprSubstitutionMode.java M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/KuduPartitionExpr.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/ModifyStmt.java M fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java M fe/src/main/java/org/apache/impala/analysis/NullLiteral.java M fe/src/main/java/org/apache/impala/analysis/NumericLiteral.java M fe/src/main/java/org/apache/impala/analysis/OrderByElement.java M fe/src/main/java/org/apache/impala/analysis/QueryStmt.java M fe/src/main/java/org/apache/impala/analysis/RangePartition.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/analysis/SortInfo.java M fe/src/main/java/org/apache/impala/analysis/StatementBase.java M fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java M fe/src/main/java/org/apache/impala/analysis/StringLiteral.java M fe/src/main/java/org/apache/impala/analysis/TimestampLiteral.java M fe/src/main/java/org/apache/impala/analysis/UnionStmt.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/planner/AggregationNode.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java M fe/src/main/java/org/apache/impala/planner/DataPartition.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/ExchangeNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionFilter.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala
[Impala-ASF-CR] WIP: IMPALA-2138: part 4: implicit cast handling
Tim Armstrong has abandoned this change. ( http://gerrit.cloudera.org:8080/14399 ) Change subject: WIP: IMPALA-2138: part 4: implicit cast handling .. Abandoned -- To view, visit http://gerrit.cloudera.org:8080/14399 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: Ie50b7d28d691d65ee721581216d227d52f04ecf5 Gerrit-Change-Number: 14399 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges
Tim Armstrong has abandoned this change. ( http://gerrit.cloudera.org:8080/14216 ) Change subject: WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges .. Abandoned not doing for now -- To view, visit http://gerrit.cloudera.org:8080/14216 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I94ccf2d1acecd9a593f4c29fc15202a799d2f7f5 Gerrit-Change-Number: 14216 Gerrit-PatchSet: 8 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/14216 Change subject: WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges .. WIP: IMPALA-2138,IMPALA-1306: project tuples before exchanges This patch adds an optimization pass to the planner that adds projection operations to the distributed plan before exchanges. The projections are added if the row either has slots that are not used upstream or if the row has multiple tuples, i.e. is not flat. The pass is performed late in the planning process, after join ordering, runtime filters, etc. There is a phase ordering problem because the projection pass depends on the join order and strategy, but those decisions depend on the amount of data being exchanged across the network, which is reduced by projection. Tacking the projection onto the end solves the problem with the following pros and cons. Pros: * This is not invasive to the rest of the planner * Other planner decisions like join ordering will not be affected by the change, making this change safer and makes the chance of serious perf regressions minimal. Cons: * Join ordering and strategy may make sub-optimal decisions because they include the cost of exchanging projected slots in decisions. Note that the projection is implemented with a separate UnionNode (called PROJECT in the plan). We could squeeze out some more performance by doing the projection inline in the exchange operator and avoiding a copy, but this already appears to be a big win even with the extra separate step: the cost of the PROJECT seems to be offset by reduced serialisation and compression time in the exchange. The UNION is codegen'd and quite efficient, and we save work in the exchange node from fewer slots, the flattened tuple representation, and less data to compress and transfer. The details of the optimisation pass are documented in the class comment. TODO in fe implementation * O(n^2) algorithms in tree walk? Expr lists are revisited each time. * Audit expr method implementations * remove debug logging * Consider removing the validation outside of testing Testing: TODO * Need to update planner tests * Exhaustive tests * Run some tests with projection disabled - join queries, agg queries? * Projection that should happen outside subplan set explain_level=2; use functional_parquet; explain select straight_join t1.id, m.key from complextypestbl t1 join [broadcast] complextypestbl t2, t2.int_map m where t1.id = t2.id and t2.nested_struct.a > 10; This was based on a prototype by Alex Behm. Change-Id: I94ccf2d1acecd9a593f4c29fc15202a799d2f7f5 --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/OrderByElement.java M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java M fe/src/main/java/org/apache/impala/planner/AggregationNode.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/CardinalityCheckNode.java M fe/src/main/java/org/apache/impala/planner/DataSink.java M fe/src/main/java/org/apache/impala/planner/DataStreamSink.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/EmptySetNode.java M fe/src/main/java/org/apache/impala/planner/ExchangeNode.java M fe/src/main/java/org/apache/impala/planner/HBaseTableSink.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/JoinBuildSink.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/KuduTableSink.java M fe/src/main/java/org/apache/impala/planner/PlanFragment.java M fe/src/main/java/org/apache/impala/planner/PlanNode.java M fe/src/main/java/org/apache/impala/planner/PlanRootSink.java M fe/src/main/java/org/apache/impala/planner/Planner.java A fe/src/main/java/org/apache/impala/planner/ProjectionPass.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/planner/SelectNode.java M fe/src/main/java/org/apache/impala/planner/SingularRowSrcNode.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M fe/src/main/java/org/apache/impala/planner/SubplanNode.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/main/java/org/apache/impala/planner/UnionNode.java M fe/src/main/java/org/apache/impala/planner/UnnestNode.java 35 files changed, 976 insertions(+), 48 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/16/14216/8 -- To view, visit http://gerrit.cloudera.org:8080/14216 To
[Impala-ASF-CR] WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen
Tim Armstrong has abandoned this change. ( http://gerrit.cloudera.org:8080/16214 ) Change subject: WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen .. Abandoned Now redundant, this was solved in a different way -- To view, visit http://gerrit.cloudera.org:8080/16214 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I3c8f1c82cf6f932dd6aad8670c8cca2e53f80c84 Gerrit-Change-Number: 16214 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Tianyi Wang
[Impala-ASF-CR] WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16214 Change subject: WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen .. WIP: IMPALA-3816,IMPALA-4065: full TupleRowComparator codegen This patch removes the indirection of codegened TupleRowComparator:: Compare() at its call sites in sorter and topn node. It's implemented by cloning and modifying all the functions between the codegened entry function and Compare(). The call site in SortedRunMerger is still indirect, but the indirection could be removed in the same way as this patch. Testing: * TODO: unit tests for SCC algo Perf: Tianyi said: "TPCH queries with a sort node are 2%-5% faster." TODO: run benchmarks Change-Id: I3c8f1c82cf6f932dd6aad8670c8cca2e53f80c84 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/exec/topn-node.cc M be/src/runtime/sorted-run-merger.cc M be/src/runtime/sorter-internal.h M be/src/runtime/sorter-ir.cc M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M be/src/util/tuple-row-compare.cc M be/src/util/tuple-row-compare.h A tests/custom_cluster/test_replace_tuple_row_compare.py 12 files changed, 354 insertions(+), 37 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/14/16214/1 -- To view, visit http://gerrit.cloudera.org:8080/16214 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I3c8f1c82cf6f932dd6aad8670c8cca2e53f80c84 Gerrit-Change-Number: 16214 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Tianyi Wang
[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16908 ) Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions .. Patch Set 11: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16908 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Gerrit-Change-Number: 16908 Gerrit-PatchSet: 11 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 25 Jan 2021 18:57:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. Patch Set 27: Rebased onto master after the previous patch was merged -- To view, visit http://gerrit.cloudera.org:8080/16242 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Gerrit-Change-Number: 16242 Gerrit-PatchSet: 27 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 25 Jan 2021 18:46:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#27). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. The interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle both the partitioning (for which we use a std::map and a comparator based on the partition exprs) and the tie-handling semantics required by rank() predicates. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. We currently use the partitioned top-n node to implement rank() pushdown in all cases because of the tie-handling support. We also cannot use the merging exchange for rank() because the limit does not handle ties in the same way, so we need to generate a hash exchange with a partitioned top-n node on top of the exchange. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.
[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16318 ) Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1 .. Patch Set 32: (4 comments) Few comments related to the hang http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr-internal.h File be/src/runtime/tmp-file-mgr-internal.h: http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr-internal.h@206 PS32, Line 206: TmpFileRemote(TmpFileGroup* file_group, TmpFileMgr::DeviceId device_id, Can we add a destructor with a DCHECK that will ensure that the resources have been returned to TmpFileBufferPool correctly? This might relate to the shared_ptr comment I made elsewhere - if we use shared_ptr consistently then we'll know that the destructor runs exactly when the object becomes unreachable. http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.h File be/src/runtime/tmp-file-mgr.h: http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.h@184 PS32, Line 184: /// A map to keep the TmpFile shared pointer alive by keeping a reference of the Why do we need to do this again? Can't we enqueue the shared_ptr into the tmp file pool and make ownership explicit, instead of doing this shared_ptr/raw pointer dance. I'm finding it hard to understand the ownership of these TmpFile objects while looking to see if there's a resource leak. http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc@576 PS32, Line 576: if (tmp_dirs_remote_ctrl_.local_buff_dir_bytes_high_water_mark_.Add(file_size) I think this code might be vulnerable to a race where a thread ends up blocking indefinitely in DequeueTmpFilesPool, e.g. Assume bytes_limit is 2x and each file is x bytes. Start with x bytes allocated. Thread 1: increment, win race, 2x allocated Thread 2: increment, lose race, 3x allocated > bytes limit Thread 3: decrement, 2x Thread 4: increment, fails, 3x allocated Thread 4: decrement, 2x Thread 2: decrement, 1x In that case thread 4 could block in DequeueTmpFilesPool even though there's free capacity. I think the basic flaw is that one of the conditions for a thread blocking on tmp_files_available_cv_ is that it should have checked for buffer availabity before doing so, but it's not part of the condition variable loop condition. Using the atomics is ok in other places because we just return an error when there's a race. I think it'd be best to protect it behind a lock and also signal the condition variable when it's decremented from >= bytes_limit to < bytes_limit. http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc@1835 PS32, Line 1835: void TmpFileBufferPool::EnqueueTmpFilesPool(TmpFile* tmp_file, bool front) { I'm suspicious about using the raw pointer here -- To view, visit http://gerrit.cloudera.org:8080/16318 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89 Gerrit-Change-Number: 16318 Gerrit-PatchSet: 32 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Thu, 21 Jan 2021 23:51:19 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16908 ) Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions .. Patch Set 9: (3 comments) LGTM, I wanted a few more tests and a comment but otherwise I'm ready to +2 http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc@10542 PS9, Line 10542: TEST_P(ExprTest, Utf8Test) { Some of the characters below are > 2 bytes right? It would be helpful to add a comment mentioning the # of bytes in the characters you used. http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc@10596 PS9, Line 10596: TestStringValue("utf8_reverse('mañana')", "anañam"); Can we add tests for a couple of grapheme clusters where we're reversing the codepoints instead of the clusters. There are a couple of examples here - https://exploringjs.com/impatient-js/ch_unicode.html#grapheme-clusters-the-real-characters. I tried them in impala-shell and it seems to have the expected behavior - https://drive.google.com/file/d/1iPXFAOtkRE5OPc014i6hARfig7W8xmpw/view?usp=sharing http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/string-functions-ir.cc@496 PS9, Line 496: StringVal StringFunctions::Utf8Reverse(FunctionContext* context, const StringVal& str) { Comment that this reverses codepoints only, and that's consistent with other systems. -- To view, visit http://gerrit.cloudera.org:8080/16908 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Gerrit-Change-Number: 16908 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 21 Jan 2021 20:31:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns
Tim Armstrong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16947 ) Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns .. IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns Modified parser to support compute incremental stats columns.No need to modify the code of other modules because it already supports Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b Reviewed-on: http://gerrit.cloudera.org:8080/16947 Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong --- M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test 5 files changed, 340 insertions(+), 5 deletions(-) Approvals: Impala Public Jenkins: Verified Tim Armstrong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/16947 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b Gerrit-Change-Number: 16947 Gerrit-PatchSet: 5 Gerrit-Owner: liuyao Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Reviewer: liuyao
[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16947 ) Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns .. Patch Set 4: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16947 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b Gerrit-Change-Number: 16947 Gerrit-PatchSet: 4 Gerrit-Owner: liuyao Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Reviewer: liuyao Gerrit-Comment-Date: Thu, 21 Jan 2021 19:35:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16947 ) Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns .. Patch Set 3: (1 comment) Thanks, this makes sense I think. Can you test one more edge case for me? http://gerrit.cloudera.org:8080/#/c/16947/3/testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test File testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test: http://gerrit.cloudera.org:8080/#/c/16947/3/testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test@830 PS3, Line 830: Can you also run compute incremental stats without specifying a partition to see which partitions it recomputes stats on? -- To view, visit http://gerrit.cloudera.org:8080/16947 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b Gerrit-Change-Number: 16947 Gerrit-PatchSet: 3 Gerrit-Owner: liuyao Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Reviewer: liuyao Gerrit-Comment-Date: Thu, 21 Jan 2021 00:52:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9793: Impala quickstart cluster with docker-compose
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15966 ) Change subject: IMPALA-9793: Impala quickstart cluster with docker-compose .. Patch Set 9: (8 comments) http://gerrit.cloudera.org:8080/#/c/15966/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15966/9//COMMIT_MSG@26 PS9, Line 26: Instructions for running the quickstart cluster are in : docker/quickstart.yml. > One thing I saw over in Kudu is that they have the docker compose instructi I moved the stuff from the yml to the README, I think that makes sense. http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart.yml File docker/quickstart.yml: http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart.yml@20 PS9, Line 20: # All filesystem data is stored in Docker volumes. The default storage location for tables : # is in the impala-quickstart-warehouse volume, i.e. if you create a table in Impala, it : # will be stored in that volume by default. > This could be a follow-on change, but we may want to have some documentatio I added a section to the README describing how to mount the volume in a container and copy data from the host. http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart.yml@51 PS9, Line 51: # To load data in background into Parquet and Kudu formats: : # : # docker-compose -f docker/quickstart.yml -f docker/quickstart-kudu-minimal.yml \ : # -f docker/quickstart-load-data.yml up -d > Can the dataload part run as a separate docker-compose command? i.e. They need to be run in the same command so that docker-compose can resolve the dependencies between the files. http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/Dockerfile File docker/quickstart_client/Dockerfile: http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/Dockerfile@48 PS9, Line 48: RUN groupadd -r impala && useradd --no-log-init -r -g impala impala && \ Needed to update the uid/gid to match IMPALA-10373 http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/Dockerfile@52 PS9, Line 52: # Copy the Hive install. > Nit: Not really Hive related. Maybe change this to "Copy the client entrypo Done http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/data-load-entrypoint.sh File docker/quickstart_client/data-load-entrypoint.sh: http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/data-load-entrypoint.sh@32 PS9, Line 32: LOading > Nit: capitalization typo Done http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_client/data-load-entrypoint.sh@35 PS9, Line 35: TPCDS_TARBALL=tpc-ds-${TPCDS_VERSION}-gcc-4.9.2-ec2-package-ubuntu-18-04.tar.gz > Nit: The client base image is using Ubuntu 16 by default, and we are downlo Done - switched the base image to ubuntu 18 http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_conf/hive-site.xml File docker/quickstart_conf/hive-site.xml: http://gerrit.cloudera.org:8080/#/c/15966/9/docker/quickstart_conf/hive-site.xml@28 PS9, Line 28: : : hive.metastore.notifications.add.thrift.objects : true : : : : hive.metastore.alter.notifications.basic : false : : : : hive.metastore.event.db.notification.api.auth : false : > When I'm looking at https://github.com/apache/impala/blob/master/fe/src/tes Yeah it should be generally similar to what we use for the tests. I used that as a reference but removed some of the bits that were obviously irrelevant to prune it down. Ideally this config would only override default hive settings where we actually need them to get it working. I removed a couple of these settings that were redundant now. -- To view, visit http://gerrit.cloudera.org:8080/15966 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c Gerrit-Change-Number: 15966 Gerrit-PatchSet: 9 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 20 Jan 2021 21:46:50 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9793: Impala quickstart cluster with docker-compose
Hello Quanlong Huang, Grant Henke, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15966 to look at the new patch set (#10). Change subject: IMPALA-9793: Impala quickstart cluster with docker-compose .. IMPALA-9793: Impala quickstart cluster with docker-compose What works: * A single node cluster can be started up with docker-compose * HMS data is stored in Derby database in a docker volume * Filesystem data is stored in a shared docker volume, using the localfs support in the Hadoop client. * A Kudu cluster with a single master can be optionally added on to the Impala cluster. * TPC-DS data can be loaded automatically by a data loading container. We need to set up a docker network called quickstart-network, purely because docker-compose insists on generating network names with underscores, which are part of the FQDN and end up causing problems with Java's URL parsing, which rejects these technically invalid domain names. How to run: Instructions for running the quickstart cluster are in docker/README.md. How to build containers: ./buildall.sh -release -noclean -notests -ninja ninja quickstart_hms_image quickstart_client_image docker_images How to upload containers to dockerhub: IMPALA_QUICKSTART_IMAGE_PREFIX=timgarmstrong/ for i in impalad_coord_exec impalad_coordinator statestored \ impalad_executor catalogd impala_quickstart_client \ impala_quickstart_hms do docker tag $i ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i docker push ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i done I pushed containers build from commit f260cce22, which was branched from 6cb7cecacf on master. Misc other stuff: * Added more metadata to all images. TODO: * Test and instructions to run against Kudu quickstart * Upload latest version of containers before merging. Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c --- M docker/CMakeLists.txt M docker/README.md A docker/docker-build.sh M docker/impala_base/Dockerfile A docker/quickstart-kudu-minimal.yml A docker/quickstart-load-data.yml A docker/quickstart.yml A docker/quickstart_client/Dockerfile A docker/quickstart_client/data-load-entrypoint.sh A docker/quickstart_client/load_tpcds_kudu.sql A docker/quickstart_client/load_tpcds_parquet.sql A docker/quickstart_conf/hive-site.xml A docker/quickstart_hms/Dockerfile A docker/quickstart_hms/hms-entrypoint.sh 14 files changed, 2,985 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/15966/10 -- To view, visit http://gerrit.cloudera.org:8080/15966 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c Gerrit-Change-Number: 15966 Gerrit-PatchSet: 10 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10441: Skip test bytes read per column if not on local minicluster
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16964 ) Change subject: IMPALA-10441: Skip test_bytes_read_per_column if not on local minicluster .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16964 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If8a179937c9c7c690dd2630549464dbe6aa1b834 Gerrit-Change-Number: 16964 Gerrit-PatchSet: 1 Gerrit-Owner: Laszlo Gaal Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 20 Jan 2021 16:53:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10448: Build impala-profile-tool early for Docker-based tests
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16965 ) Change subject: IMPALA-10448: Build impala-profile-tool early for Docker-based tests .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16965 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I60e78ea883f3057c59a345feca38ef08a7f6a0b8 Gerrit-Change-Number: 16965 Gerrit-PatchSet: 1 Gerrit-Owner: Laszlo Gaal Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 20 Jan 2021 16:53:05 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16318 ) Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1 .. Patch Set 30: (10 comments) I think I'm pretty close to approving it - the big thing is that I wanted to understand why the deadlock you were mentioning could occur. That one makes me think that we might be missing a flaw in how the local buffer directories work, or similar. Otherwise looking pretty good :) http://gerrit.cloudera.org:8080/#/c/16318/30//COMMIT_MSG Commit Message: PS30: Following on from our discussion about a potential deadlock, I do think there is a limitation that, if there are N local buffer files, only N spilling-to-remote queries can make progress at the same time. I think these queries almost always will have a partially-written file with a local buffer associated, so they may block other queries from spilling until they have completed and released that buffer. I think a query can be in this state indefinitely. E.g. if it was spilling but is now off doing some other work. This isn't ideal but is maybe OK if N is high enough, at least for milestone 1. I *think* that each query should be able to make progress independent of other queries. We would have a deadlock if query A depended on B to release a local buffer file to make progress and B depended directly or indirectly on A releasing some other resource to make progress. I haven't seen anything that looks like that yet. http://gerrit.cloudera.org:8080/#/c/16318/30//COMMIT_MSG@112 PS30, Line 112: * Ran pre-review-test Have you tested in combination with disk spill compression? http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/disk-io-mgr.cc@328 PS30, Line 328: 0 I think we should probably set the replication to 1 for this use case - that would reduce space requirement for HDFS spilling by 3x. http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.h File be/src/runtime/io/local-file-system.h: http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.h@42 PS30, Line 42: uint8 nit: uint8_t http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.cc File be/src/runtime/io/local-file-system.cc: http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.cc@98 PS30, Line 98: uint8 nit: uint8_t http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-writer.cc File be/src/runtime/io/local-file-writer.cc: http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-writer.cc@45 PS30, Line 45: int written = write(fileno(file_), range->data(), range->len()); I missed this, I think we should use LocalFileSystem::Fwrite which converts the error into a status so that the error is handled correctly. http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.h File be/src/runtime/tmp-file-mgr.h: http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.h@420 PS30, Line 420: /// This is the main public API for TmpFileMgr writing. I notice that it doesn't clearly document anything about when the write finishes. We should probably say something like: The write may take some time to complete. It may be queued behind other I/O operations. If remote scratch is enabled, it may also need to wait for other queries to make progress and release space in the local buffer directory. http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h File be/src/runtime/tmp-file-mgr.h: http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h@194 PS27, Line 194: std::unique_ptr tmp_file_pool_; > The pool only works for remote spilling. It would be null when there is no Can you also put this in the comment for tmp_file_pool_? http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc@517 PS30, Line 517: if (HasRemoteDir()) { This doesn't seem right if write_range is just writing to a local scratch directory (not just the local buffer directory). Shouldn't it check to see if write_range is jsut targeted at local scratch? I guess I would expect spilling to a local directory to use the same code path regardless of whether remote scratch is used or not. http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc@1820 PS30, Line 1820: DCHECK(status.ok()); nit: use DCHECK_OK - we have a custom version that automatically prints the error message if the check fails. -- To view, visit http://gerrit.cloudera.org:8080/16318 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Ge
[Impala-ASF-CR] IMPALA-10434: Fix impala-shell's unicode regressions on Python2
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16960 ) Change subject: IMPALA-10434: Fix impala-shell's unicode regressions on Python2 .. Patch Set 1: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/16960/1/shell/impala_shell.py File shell/impala_shell.py: http://gerrit.cloudera.org:8080/#/c/16960/1/shell/impala_shell.py@552 PS1, Line 552: line nit: maybe history_cmd? -- To view, visit http://gerrit.cloudera.org:8080/16960 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4 Gerrit-Change-Number: 16960 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 19 Jan 2021 21:56:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16947 ) Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns .. Patch Set 2: How does this work if you compute incremental stats on a subset of columns, e.g. (c1, c2) then compute it on a different subset of columns, e.g. (c2, c3)? Is this allowed? Does this still merge together the stats that are available? I think we need a test. -- To view, visit http://gerrit.cloudera.org:8080/16947 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b Gerrit-Change-Number: 16947 Gerrit-PatchSet: 2 Gerrit-Owner: liuyao Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 19 Jan 2021 21:11:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 11: (11 comments) http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@9 PS11, Line 9: when > nit: remove Done http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@22 PS11, Line 22: limit > did you mean partition limit >= order by limit ? Done http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@32 PS11, Line 32: was > nit: 'is' Done http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java: http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@507 PS11, Line 507: upper top-n. > Since we do a distributed top-n, there are 3 top-n's in the plan and the 'u Good point - used final/analytic instead of upper/lower. http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@511 PS11, Line 511: doesn't matter > nit: worth clarifying that it doesn't matter for the purpose of the pushdow Done http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@530 PS11, Line 530: include all of the rows in the final > This does not literally mean all rows in the final partition right ? Should Yup that's true. http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@531 PS11, Line 531: was > nit: 'we' Done http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@538 PS11, Line 538:* TODO: this should also return the amount that needs to be added to the limit I already did this, removed TODO http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@603 PS11, Line 603: if (analyticLimit < limit) return falseStatus; > One special case where this could work is if each partition had a maximum o I think the partitioned top-n node would let us correctly apply the limit at runtime, if we wanted to further optimize this (i.e. I think at any point during the partitioned top-n, you could discard any in-memory partitions that didn't include the first N rows) http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@83 PS11, Line 83: row.s > nit: 'rows' Done http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test File testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test: http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test@1139 PS11, Line 1139: # rank() predicate is not pushed down because TOPN_BYTES_LIMIT prevents conversion > The plan shows the lower top-n which indicates the rank was pushed down. Pe Thanks for catching this. The problem was that the table only had 8 rows, so the estimate for the top n was 18B * 8 = 144B, under the threshold. I switched to a different table with more rows and it now shows the expected behavior -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 19 Jan 2021 19:40:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#14). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it is always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The elapsed time for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,275 insertions(+), 355 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/14 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 14 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 13: (10 comments) http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc File be/src/exec/topn-node-ir.cc: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@44 PS11, Line 44: priority_queue_.Push(insert_tuple); > Might be useful to explicitly mention here that we don't have to worry abou Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@91 PS11, Line 91: else { : // 'materialized_tuple' needs to be > I might move this above the DCHECK, to make it clearer its referring to the Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@93 PS11, Line 93: from the heap. : DCHECK_LT(cmp_result > Not sure what this is supposed to mean, since my understanding would be tha I'm not sure exactly what I was getting at, I think I was just restating the invariant that everything in 'overflowed_ties_' was equal with the head of the queue, so it didn't add much. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@97 PS11, Line 97: > Might be worth explicitly saying that 'top_tuple' is what we just popped of Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@100 PS11, Line 100: node->tuple_row_less_than_->Compare(top_tuple, priority_queue_.Top()) == 0) { > I find this comment a little confusing, since we've already done the one Po Thanks, that's clearer. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h File be/src/exec/topn-node.h: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@86 PS11, Line 86: /// TODO: currently we only support a single top-n heap per operator. IMPALA-9979 will > nit: its a little confusing to call out 'unpartitioned mode' specifically w Yeah I had to pull this comment out of the partitioned mode patch and didn't want to create too many merge conflicts by editing comments. I added a TODO as suggested. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@180 PS11, Line 180: RuntimeProfile::Counter* tuple_pool_reclaim_counter_= nullptr; > The counters here and below aren't used in this patch Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@200 PS11, Line 200: n data structure use > nit: not in this patch Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@256 PS11, Line 256: const int64_t capacity_; > If I understand correctly, this function only gets called once the heap has Done. Updated comment and added DCHECK. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc File be/src/exec/topn-node.cc: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc@306 PS11, Line 306: heap is at > I think this is a typo? i.e. these two words should be removed Done -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 13 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 16 Jan 2021 01:48:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#13). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,276 insertions(+), 354 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/13 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 13 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#12). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,275 insertions(+), 354 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/12 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 12 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#26). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. The interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle both the partitioning (for which we use a std::map and a comparator based on the partition exprs) and the tie-handling semantics required by rank() predicates. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. We currently use the partitioned top-n node to implement rank() pushdown in all cases because of the tie-handling support. We also cannot use the merging exchange for rank() because the limit does not handle ties in the same way, so we need to generate a hash exchange with a partitioned top-n node on top of the exchange. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.
[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16318 ) Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1 .. Patch Set 27: (19 comments) Another round of comments. I understand the local buffer file logic now and I think it makes sense, just need to do another pass to look for bugs. http://gerrit.cloudera.org:8080/#/c/16318/27//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16318/27//COMMIT_MSG@77 PS27, Line 77:The first available local directory is used for the local buffer Can you give an example of how you might set this up in practice to allow both local and remote spilling? Would you configure two local directories with different limits so that one was the local buffer directory and the other was the main scratch directory? http://gerrit.cloudera.org:8080/#/c/16318/27//COMMIT_MSG@91 PS27, Line 91: * Unit Tests added to tmp-file-mgr-test/disk-io-mgr-test. I think it'd be good to add a buffer pool test that exercises the remote spilling code paths. We have some randomized tests like BufferPoolTest::TestRandomInternalMulti() - I think you could add another one with a FileGroup configured to use remote scratch. http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-file.h File be/src/runtime/io/disk-file.h: http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-file.h@137 PS27, Line 137: void SetActualFileSize(int64_t size) { This is only called once, right? Can we check that invariant: DCHECK_EQ(0, actual_file_size_.Load()); http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-file.h@138 PS27, Line 138: DCHECK nit: use DCHECK_LE - it automatically prints the two values if it fails so it's a little nicer for debugging. http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-io-mgr.cc@252 PS27, Line 252: is_full_ = written_bytes != 0 && written_bytes == disk_file_->actual_file_size(); Add DCHECK_LE(written_bytes, disk_file_->actual_file_size()) Just so we catch it in testing if there's a bug. http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/disk-io-mgr.cc@329 PS27, Line 329: int bytes = (file_size - offset < buffer_size) ? (file_size - offset) : buffer_size; Maybe use min() instead? http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/request-context.cc File be/src/runtime/io/request-context.cc: http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/io/request-context.cc@243 PS27, Line 243: (static_cast(range))->callback()(status); nit: I think the parentheses around static_cast here and on l246 are not needed. http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr-internal.h File be/src/runtime/tmp-file-mgr-internal.h: http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr-internal.h@326 PS27, Line 326: /// The number of the tmp files in the pool. For debug and test. We don't seem to use this variable any more http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h File be/src/runtime/tmp-file-mgr.h: http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h@194 PS27, Line 194: std::unique_ptr tmp_file_pool_; When is this null or non-null? http://gerrit.cloudera.org:8080/#/c/16318/24/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/24/be/src/runtime/tmp-file-mgr.cc@427 PS24, Line 427: *need_local_buffer_dir = false; > Currently local buffer directory won't be inserted into tmp_dirs_, the dire I left a comment on the commit message. It seems like it would be helpful to have an example of how to configure this in a reasonable way. http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.cc@512 PS27, Line 512: if (tmp_dirs_remote_ctrl_.tmp_file_pool_ == nullptr) { What [ http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.cc@917 PS27, Line 917: if (tmp_file_mgr_->tmp_dirs_remote_ctrl_.tmp_file_pool_ != nullptr) { Which case is this? Can we add a boolean helper method to TmpDirsRemoteCtrl() that makes it self-describing? It's sometimes a little hard to follow code when null/non-null implies certain modes. http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.cc@967 PS27, Line 967: if (!tmp_files_remote_.empty()) { It would be helpful to add a couple of comments to explain the code flow. E.g. // First, check to see if we can allocate more space in the previous file. then below // No space in previous file, we need to set up a new remote file. http://gerrit.cloudera.org:808
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. Patch Set 25: Latest patchset fixes the test failure and is ready for review. This patchset depends on the limit pushdown patchset though so that needs review first anyway. -- To view, visit http://gerrit.cloudera.org:8080/16242 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Gerrit-Change-Number: 16242 Gerrit-PatchSet: 25 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 15 Jan 2021 00:27:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9865: part 2/2: add verbosity to profile tool
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16881 ) Change subject: IMPALA-9865: part 2/2: add verbosity to profile tool .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/16881/8/be/src/util/runtime-profile.h File be/src/util/runtime-profile.h: http://gerrit.cloudera.org:8080/#/c/16881/8/be/src/util/runtime-profile.h@177 PS8, Line 177: void PrettyPrint(std::ostream* s, const std::string& prefix = "") const; > For my understanding, what type of code uses this overload that doesn't tak The profiles from the Impala debug server is one example. -- To view, visit http://gerrit.cloudera.org:8080/16881 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I82618a813e29af7996dfaed78873b2a73bc0231d Gerrit-Change-Number: 16881 Gerrit-PatchSet: 8 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 14 Jan 2021 19:13:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#25). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. The interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle both the partitioning (for which we use a std::map and a comparator based on the partition exprs) and the tie-handling semantics required by rank() predicates. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. We currently use the partitioned top-n node to implement rank() pushdown in all cases because of the tie-handling support. We also cannot use the merging exchange for rank() because the limit does not handle ties in the same way, so we need to generate a hash exchange with a partitioned top-n node on top of the exchange. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#24). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. The interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle both the partitioning (for which we use a std::map and a comparator based on the partition exprs) and the tie-handling semantics required by rank() predicates. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. We currently use the partitioned top-n node to implement rank() pushdown in all cases because of the tie-handling support. We also cannot use the merging exchange for rank() because the limit does not handle ties in the same way, so we need to generate a hash exchange with a partitioned top-n node on top of the exchange. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#11). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,285 insertions(+), 353 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/11 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP
Tim Armstrong has abandoned this change. ( http://gerrit.cloudera.org:8080/16950 ) Change subject: WIP .. Abandoned -- To view, visit http://gerrit.cloudera.org:8080/16950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I55f6376ab609ec6fb56d59ce14a995174e1644c6 Gerrit-Change-Number: 16950 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#10). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,278 insertions(+), 348 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/10 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 10 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Hello Aman Sinha, Qifan Chen, Shant Hovsepian, David Rorke, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16242 to look at the new patch set (#23). Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. The interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle both the partitioning (for which we use a std::map and a comparator based on the partition exprs) and the tie-handling semantics required by rank() predicates. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. We currently use the partitioned top-n node to implement rank() pushdown in all cases because of the tie-handling support. We also cannot use the merging exchange for rank() because the limit does not handle ties in the same way, so we need to generate a hash exchange with a partitioned top-n node on top of the exchange. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.