[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions
Fucun Chu has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 ) Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions .. Patch Set 4: Performance comparison between ds_hll_* and ds_cpc_* functions see: https://issues.apache.org/jira/browse/IMPALA-10500 -- To view, visit http://gerrit.cloudera.org:8080/16656 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a Gerrit-Change-Number: 16656 Gerrit-PatchSet: 4 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Thu, 11 Feb 2021 06:58:12 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 4: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6884/ -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 06:50:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17054 ) Change subject: IMPALA-10497: Fix flakiness in test_no_fd_caching_on_cached_data. .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8125/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17054 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952 Gerrit-Change-Number: 17054 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Thu, 11 Feb 2021 03:19:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/17054 ) Change subject: IMPALA-10497: Fix flakiness in test_no_fd_caching_on_cached_data. .. Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py File tests/custom_cluster/test_hdfs_fd_caching.py: http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@179 PS1, Line 179: le handles cached > One thing to watch out for here is if you inherit from TestHdfsFdCaching, y Done http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@184 PS1, Line 184: # handle ca > I believe that the number of rows should not impact the behavior. The diffe Done http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@190 PS1, Line 190: > I will remove this so the cluster size back to default. I change my mind, I think we should keep cluster_size=1 here. On start up, impalad delete the old cache file by listing file matching CACHE_FILE_PREFIX and delete them one by one. https://github.com/apache/impala/blob/b42c649/be/src/runtime/io/data-cache.cc#L415 So if cluster_size > 1, multiple impalad might race to delete the same old cache file and crash the losing ones. Also, setting cluster_size=1 does not change the behavior being tested by this test, so I think it should be okay. http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@207 PS1, Line 207: : def cached_handles(self): : return self.get_ag > This makes a lot of sense to me as the explanation. There are times when da Done http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@224 PS1, Line 224: > I'm fine with either value here, but I also don't see this changing the beh Done -- To view, visit http://gerrit.cloudera.org:8080/17054 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952 Gerrit-Change-Number: 17054 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Thu, 11 Feb 2021 03:09:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.
Hello Joe McDonnell, Bikramjeet Vig, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17054 to look at the new patch set (#2). Change subject: IMPALA-10497: Fix flakiness in test_no_fd_caching_on_cached_data. .. IMPALA-10497: Fix flakiness in test_no_fd_caching_on_cached_data. test_no_fd_caching_on_cached_data has been flaky for not having all of the data fully cached in the warm-up phase. There is a limit on concurrency in writing to the cache such that we may fail to cache data the first time read it. This patch fixes the test by repeating the warm-up query 5 times. This patch also reduce the test cluster size to just 1 to avoid potential flakiness from multiple impalad racing to delete the same old fata cache files. Testing: - Loop the test manually 100 times and see no more failures. Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952 --- M tests/custom_cluster/test_hdfs_fd_caching.py 1 file changed, 7 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/17054/2 -- To view, visit http://gerrit.cloudera.org:8080/17054 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952 Gerrit-Change-Number: 17054 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Riza Suminto
[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/17054 ) Change subject: IMPALA-10497: Fix flakiness in test_no_fd_caching_on_cached_data. .. Patch Set 1: (1 comment) Thanks Joe for the feedback. I will revert most of these and leave only the extra warm up part. http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py File tests/custom_cluster/test_hdfs_fd_caching.py: http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@190 PS1, Line 190: cluster_size=1 > I think this should not impact the caching. Our scheduling algorithm maps f I will remove this so the cluster size back to default. Previously, when I kept cluster size as default (3), after several iteration of looping this test, one impalad failed with the following error. I0210 17:50:00.781529 10340 mem_tracker.cc:83] Creating tracker /tmp-sharded_lru_cache->root I0210 17:50:00.782321 10340 status.cc:129] Failed to delete old cache file /tmp/impala-cache-file-fc4060e3ab7e45b0:a680f0d3b05005a1: Not found: /tmp/impala-cache-file-fc4060e3ab7e45b0:a680f0d3b05005a1: No such file or directory (error 2) @ 0x2c008d7 _ZN6impala13GetStackTraceB5cxx11Ev @ 0x1c9003e impala::Status::Status() @ 0x36016f9 impala::io::DataCache::Partition::DeleteExistingFiles() @ 0x3601e22 impala::io::DataCache::Partition::Init() @ 0x360975e impala::io::DataCache::Init() @ 0x35b4db0 impala::io::DiskIoMgr::Init() @ 0x251b229 impala::ExecEnv::Init() @ 0x29a0412 ImpaladMain() @ 0x1bdf4d3 main @ 0x7faa8e3bebf7 __libc_start_main @ 0x1ae81da _start I0210 17:50:00.897989 10340 mem_tracker.cc:87] Destroying tracker /tmp-sharded_lru_cache->root I was guessing that since all three write their data cache to common /tmp/ dir, one might accidentally delete cache file that belong to other impalad. Just running the test one time doesn't reveal this error. So in this patchset 1, I follow examples from custom_cluster/test_data_cache.py to set custom_cluster=1. -- To view, visit http://gerrit.cloudera.org:8080/17054 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952 Gerrit-Change-Number: 17054 Gerrit-PatchSet: 1 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Thu, 11 Feb 2021 02:18:19 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/17058 ) Change subject: IMPALA-10470: Add link to quickstart container from README .. Patch Set 1: Code-Review+2 Thanks for adding this. (The build failures look unrelated, probably some Maven server went bad.) -- To view, visit http://gerrit.cloudera.org:8080/17058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e Gerrit-Change-Number: 17058 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Thu, 11 Feb 2021 01:41:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17058 ) Change subject: IMPALA-10470: Add link to quickstart container from README .. Patch Set 1: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/8123/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/17058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e Gerrit-Change-Number: 17058 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Thu, 11 Feb 2021 01:38:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8124/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:37:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 3: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6883/ -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:24:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 4: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6884/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:24:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 4: Code-Review+2 fix rat exclusdes -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:24:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Hello Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17050 to look at the new patch set (#4). Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. IMPALA-9382: part 3/3 clean up runtime profile v2 text output Eliminated some of the noisy per-instance counters from DEFAULT verbosity. Testing: * Updated impala-profile-tool test with new output * Added new impala-profile-tool test for v2 profile. Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 --- M be/src/util/runtime-profile.cc M bin/rat_exclude_files.txt M testdata/impala-profiles/README M testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2 A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt M tests/observability/test_profile_tool.py 8 files changed, 5,760 insertions(+), 516 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/4 -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 2: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/8122/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:22:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17058 Change subject: IMPALA-10470: Add link to quickstart container from README .. IMPALA-10470: Add link to quickstart container from README Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e --- M README.md 1 file changed, 9 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/17058/1 -- To view, visit http://gerrit.cloudera.org:8080/17058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e Gerrit-Change-Number: 17058 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 3: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6883/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:58 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 2: Code-Review+2 carry -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Hello Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17050 to look at the new patch set (#2). Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. IMPALA-9382: part 3/3 clean up runtime profile v2 text output Eliminated some of the noisy per-instance counters from DEFAULT verbosity. Testing: * Updated impala-profile-tool test with new output * Added new impala-profile-tool test for v2 profile. Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 --- M be/src/util/runtime-profile.cc M testdata/impala-profiles/README M testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2 A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt M tests/observability/test_profile_tool.py 7 files changed, 5,757 insertions(+), 516 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/2 -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py File tests/observability/test_profile_tool.py: http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py@50 PS1, Line 50: ) > flake8: E501 line too long (91 > 90 characters) Done http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py@53 PS1, Line 53: ) > flake8: E501 line too long (92 > 90 characters) Done -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. IMPALA-9979: part 2: partitioned top-n Planner changes: --- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. This interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --- The top-n node in the backend is augmented to handle the partitioned case, for which we use a std::map and a comparator based on the partition exprs. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. Limitations: --- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: - * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: - Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Reviewed-on: http://gerrit.cloudera.org:8080/16242 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/exprs/slot-ref.h M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/priority-queue.h M be/src/util/tuple-row-compare.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. Patch Set 35: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16242 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Gerrit-Change-Number: 16242 Gerrit-PatchSet: 35 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 10 Feb 2021 23:52:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/17050 ) Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output .. Patch Set 1: Code-Review+2 I think this makes sense as a way to cut down the verbosity for the DEFAULT level. The flake8 line length issues don't bother me that much given that it is driven by long filenames, but if there is an easy fix, go for it. -- To view, visit http://gerrit.cloudera.org:8080/17050 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491 Gerrit-Change-Number: 17050 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Comment-Date: Wed, 10 Feb 2021 23:29:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/17054 ) Change subject: IMPALA-10497: Fix flakiness in test_no_fd_caching_on_cached_data. .. Patch Set 1: (5 comments) Thanks for looking into this. My main thought here is that I think the change may be bigger than it needs to be. My theory is that the "Repeat the warm up query 5 times" is the part of the change that makes it work and it may not need the other parts. I would be interested in whether a stripped down version of this with only that part of the change would fix the issue. http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py File tests/custom_cluster/test_hdfs_fd_caching.py: http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@179 PS1, Line 179: TestHdfsFdCaching One thing to watch out for here is if you inherit from TestHdfsFdCaching, you may inherit all of its test cases (e.g. test_caching_with_eviction()). I'm not sure how pytest deals with that. http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@184 PS1, Line 184: NUM_ROWS = 10 I believe that the number of rows should not impact the behavior. The differences between 10 and 100 is small enough that it shouldn't impact the IO pattern. They are all in a single parquet files, and we would do an IO for the footer and one IO for each column. http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@190 PS1, Line 190: cluster_size=1 I think this should not impact the caching. Our scheduling algorithm maps files to nodes consistently, so each node should be reading the same file each time. http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@207 PS1, Line 207: Read 5 times to make : # sure the data cache is fully warmed up. : for x in range(5): This makes a lot of sense to me as the explanation. There are times when data won't be written to the cache. In particular, there is a limit on concurrency in writing to the cache. I'm guessing we could hit that and fail to cache something the first time around. http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@224 PS1, Line 224: for x in range(5): I'm fine with either value here, but I also don't see this changing the behavior of the test. -- To view, visit http://gerrit.cloudera.org:8080/17054 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952 Gerrit-Change-Number: 17054 Gerrit-PatchSet: 1 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Wed, 10 Feb 2021 22:59:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16833 ) Change subject: IMPALA-10496: SAML implementation in Impala .. Patch Set 23: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16833 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa Gerrit-Change-Number: 16833 Gerrit-PatchSet: 23 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Comment-Date: Wed, 10 Feb 2021 21:06:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. Patch Set 35: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16242 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Gerrit-Change-Number: 16242 Gerrit-PatchSet: 35 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 10 Feb 2021 18:13:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 ) Change subject: IMPALA-9979: part 2: partitioned top-n .. Patch Set 35: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6882/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16242 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Gerrit-Change-Number: 16242 Gerrit-PatchSet: 35 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 10 Feb 2021 18:13:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 65: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8121/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 65 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Feb 2021 17:24:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 64: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8120/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 64 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Feb 2021 17:22:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#65). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds a new class of predicates called overlap predicates to aid in the acceptance or rejection of a row group, a page, or a row in a Parquet table, utilizing the minimal and the maximal values gathered from an equi hash join and the Parquet column index stats. When a row group or page is rejected, all contained rows within are rejected all together. For example in the following query, the min and max in the overlap predicate are computed from the join column from table 'b', and are compared against the min/max of each row group or page at the scan node for 'a'. select straight_join count(*) from lineitem a join [SHUFFLE] lineitem b where a.l_shipdate = b.l_receiptdate and b.l_commitdate = "1992-01-31"; An overlap predicate associated with the column type B in hash table and scan column type A will be formed when both A and B are of or can be converted to as: 1. booleans; 2. integers (tinyint, smallint, int, or bigint); 3. approximate numeric (float or double); 4. decimals with the same precision and scale; 5. strings; 6. date; or 7. timestamps. The overlap predicate is implemented as a min/max filter and can be observed in the explain output of a query. A new query option 'minmax_filter_threshold' is provided to control the new feature. Setting it to 0.0 disables the feature. Setting it to a value > 0.0 but less than 1.0 provides a threshold. An overlap predicate will be evaluated against a row group and possibly the containing pages/rows, as long as its overlap ratio is less than the threshold. The overlap ratio is the common area of the row group and the filter, divided by the area of the row group. A second query option, minmax_filtering_level, is provided to specify the filtering scope: 1. ROW_GROUP: the overlap is only tested for row groups; 2. PAGE: the overlap is tested for both row groups and pages; 3. ROW: the overlap is for row groups, pages and rows. Two new run-time profile counters are added to report the number of row groups or pages filtered out via the overlap predicates respectively: 1. NumRuntimeFilteredRowGroups 2. NumRuntimeFilteredPages Testing: 1. Unit tested on various column types with TPCH and TPCDS tables. Benefits were significant when the join column on the outer table is sorted and there exist many row groups or pages no overlapping with the min/max filters; 2. Added following new tests: a) In overlap_min_max_filters.test to demonstrate the number of filtered out pages and row groups with the two new profile counters; b) In runtime-filter-propagation.test to demonstrate that the overlap predicates work with different column types; 3. Core testing; 4. Performance measurement: the overal improvement with 3TB scale TPCDS is at 1.45% with the threshold at 0.5 and filtering level at ROW_GROUP. Good improvement (over 10%) are seen wtih query 16, 25, 62, 83, 94 and 99, due to the join column ship_date_sk being strongly correlated to the partition column sold_date_sk. To do in follow-up JIRAs: 1. Improve filtering efficiency; 2. Apply the overlap predicate on partition columns; 3. IR code-gen for various MinMaxFilter::EvalOverlap methods. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/runtime/date-value.cc M be/src/runtime/date-value.h M be/src/runtime/raw-value.h M be/src/runtime/runtime-filter-ir.cc M be/src/runtime/string-value-test.cc M be/src/runtime/string-value.cc M be/src/runtime/string-value.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M be/src/util/min-max-filter-ir.cc M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M
[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 64: (1 comment) http://gerrit.cloudera.org:8080/#/c/16720/64/tests/run-tests.py File tests/run-tests.py: http://gerrit.cloudera.org:8080/#/c/16720/64/tests/run-tests.py@216 PS64, Line 216: # flake8: E265 block comment should start with '# ' -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 64 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Feb 2021 17:03:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Qifan Chen has uploaded a new patch set (#64). ( http://gerrit.cloudera.org:8080/16720 ) Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds a new class of predicates called overlap predicates to aid in the acceptance or rejection of a row group, a page, or a row in a Parquet table, utilizing the minimal and the maximal values gathered from an equi hash join and the Parquet column index stats. When a row group or page is rejected, all contained rows within are rejected all together. For example in the following query, the min and max in the overlap predicate are computed from the join column from table 'b', and are compared against the min/max of each row group or page at the scan node for 'a'. select straight_join count(*) from lineitem a join [SHUFFLE] lineitem b where a.l_shipdate = b.l_receiptdate and b.l_commitdate = "1992-01-31"; An overlap predicate associated with the column type B in hash table and scan column type A will be formed when both A and B are of or can be converted to as: 1. booleans; 2. integers (tinyint, smallint, int, or bigint); 3. approximate numeric (float or double); 4. decimals with the same precision and scale; 5. strings; 6. date; or 7. timestamps. The overlap predicate is implemented as a min/max filter and can be observed in the explain output of a query. A new query option 'minmax_filter_threshold' is provided to control the new feature. Setting it to 0.0 disables the feature. Setting it to a value > 0.0 but less than 1.0 provides a threshold. An overlap predicate will be evaluated against a row group and possibly the containing pages/rows, as long as its overlap ratio is less than the threshold. The overlap ratio is the common area of the row group and the filter, divided by the area of the row group. A second query option, minmax_filtering_level, is provided to specify the filtering scope: 1. ROW_GROUP: the overlap is only tested for row groups; 2. PAGE: the overlap is tested for both row groups and pages; 3. ROW: the overlap is for row groups, pages and rows. Two new run-time profile counters are added to report the number of row groups or pages filtered out via the overlap predicates respectively: 1. NumRuntimeFilteredRowGroups 2. NumRuntimeFilteredPages Testing: 1. Unit tested on various column types with TPCH and TPCDS tables. Benefits were significant when the join column on the outer table is sorted and there exist many row groups or pages no overlapping with the min/max filters; 2. Added following new tests: a) In overlap_min_max_filters.test to demonstrate the number of filtered out pages and row groups with the two new profile counters; b) In runtime-filter-propagation.test to demonstrate that the overlap predicates work with different column types; 3. Core testing; 4. Performance measurement: the overal improvement with 3TB scale TPCDS is at 1.45% with the threshold at 0.5 and filtering level at ROW_GROUP. Good improvement (over 10%) are seen wtih query 16, 25, 62, 83, 94 and 99, due to the join column ship_date_sk being strongly correlated to the partition column sold_date_sk. To do in follow-up JIRAs: 1. Improve filtering efficiency; 2. Apply the overlap predicate on partition columns; 3. IR code-gen for various MinMaxFilter::EvalOverlap methods. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/runtime/date-value.cc M be/src/runtime/date-value.h M be/src/runtime/raw-value.h M be/src/runtime/runtime-filter-ir.cc M be/src/runtime/string-value-test.cc M be/src/runtime/string-value.cc M be/src/runtime/string-value.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M be/src/util/min-max-filter-ir.cc M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/analysis/Predicate.java M
[Impala-ASF-CR] IMPALA-10161: User LDAP Search bind support
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17047 ) Change subject: IMPALA-10161: User LDAP Search bind support .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8119/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17047 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I978744ad05d9ef408328d1e4dd2d18c329f4d3b7 Gerrit-Change-Number: 17047 Gerrit-PatchSet: 2 Gerrit-Owner: Tamas Mate Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 10 Feb 2021 16:37:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10161: User LDAP Search bind support
Tamas Mate has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/17047 ) Change subject: IMPALA-10161: User LDAP Search bind support .. IMPALA-10161: User LDAP Search bind support This change adds user search bind support next to simple bind that can be configured with LDAP filters. The group check was done with LDAP search earlier, this change adds the possibility to configure it with Hadoop library like options, which is the LDAP filter with optional patterns. The '{0}' will be replaced with the user name while the '{1}' pattern will be replaced with the user dn. The following new flags have been added: --ldap_search_bind_authentication: a flag to change between simple and search bind --ldap_user_search_basedn: the base dn for the LDAP subtree to search --ldap_group_search_basedn: the base dn for the LDAP subtree to search Tested: - Custom cluster tests have been added Change-Id: I978744ad05d9ef408328d1e4dd2d18c329f4d3b7 --- M be/src/rpc/authentication.cc M be/src/util/CMakeLists.txt A be/src/util/ldap-search-bind.cc A be/src/util/ldap-search-bind.h A be/src/util/ldap-simple-bind.cc A be/src/util/ldap-simple-bind.h M be/src/util/ldap-util.cc M be/src/util/ldap-util.h M be/src/util/webserver.cc M fe/src/test/java/org/apache/impala/customcluster/LdapImpalaShellTest.java A fe/src/test/java/org/apache/impala/customcluster/LdapSearchBindImpalaShellTest.java A fe/src/test/java/org/apache/impala/customcluster/LdapSimpleBindImpalaShellTest.java M fe/src/test/java/org/apache/impala/testutil/LdapUtil.java M fe/src/test/resources/users.ldif 14 files changed, 901 insertions(+), 334 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/17047/2 -- To view, visit http://gerrit.cloudera.org:8080/17047 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I978744ad05d9ef408328d1e4dd2d18c329f4d3b7 Gerrit-Change-Number: 17047 Gerrit-PatchSet: 2 Gerrit-Owner: Tamas Mate Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17039 ) Change subject: IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8118/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17039 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e Gerrit-Change-Number: 17039 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Feb 2021 15:43:09 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16833 ) Change subject: IMPALA-10496: SAML implementation in Impala .. Patch Set 22: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8117/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16833 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa Gerrit-Change-Number: 16833 Gerrit-PatchSet: 22 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Comment-Date: Wed, 10 Feb 2021 15:36:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17039 ) Change subject: IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8116/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17039 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e Gerrit-Change-Number: 17039 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Feb 2021 15:35:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16833 ) Change subject: IMPALA-10496: SAML implementation in Impala .. Patch Set 23: PS 22 is a one line fix for the ldap test failures introduced + unintended removal of the rebase that was done by jenkins. -- To view, visit http://gerrit.cloudera.org:8080/16833 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa Gerrit-Change-Number: 16833 Gerrit-PatchSet: 23 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Comment-Date: Wed, 10 Feb 2021 15:23:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17039 ) Change subject: IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution .. Patch Set 3: (4 comments) http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@291 PS1, Line 291: inline uint64_t OrcComplexColumnReader::GetTargetColId( : const SlotDescriptor* slot_desc) const { : return slot_desc->type().IsCollectionType() ? : > This is no longer used. Done http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@296 PS1, Line 296: } > This is no longer used too. Done http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@311 PS1, Line 311: * Return the result in '*child' and its index inside the children. Returns false for > nit: inline? Done http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@316 PS1, Line 316: DCHECK(parent.getKind() == orc::TypeKind::STRUCT || > Could you move the original comments of FindChild here? Added a bit more verbose comment. -- To view, visit http://gerrit.cloudera.org:8080/17039 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e Gerrit-Change-Number: 17039 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Feb 2021 15:23:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16833 ) Change subject: IMPALA-10496: SAML implementation in Impala .. Patch Set 23: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6881/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16833 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa Gerrit-Change-Number: 16833 Gerrit-PatchSet: 23 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Comment-Date: Wed, 10 Feb 2021 15:21:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution
Hello Quanlong Huang, Gabor Kaszab, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17039 to look at the new patch set (#3). Change subject: IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution .. IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution Currently ORC schema resolution is position-based only. Then the positional information is reused during ORC column reader creation. This prevents adding other column resolution strategies, e.g. column resolution by name or Iceberg field id. It also prevents schema evolution, as table metadata and file metadata should be in sync. This patch makes column reader creation independent of schema resolution. It does this by creating a mapping between slot/tuple descriptors and ORC type ids during schema resolution and use this mapping during column reader creation. Now further patches just need to add support for other column resolution strategies, the column readers will be created accordingly. Testing: * no additional tests as it is neither a bug fix nor a new feature Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e --- M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/orc-metadata-utils.cc M be/src/exec/orc-metadata-utils.h M tests/query_test/test_scanners.py 7 files changed, 71 insertions(+), 137 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/17039/3 -- To view, visit http://gerrit.cloudera.org:8080/17039 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e Gerrit-Change-Number: 17039 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16833 ) Change subject: IMPALA-10496: SAML implementation in Impala .. Patch Set 22: (37 comments) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/AuthTokenGenerator.java File fe/src/main/java/org/apache/impala/authentication/saml/AuthTokenGenerator.java: http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/AuthTokenGenerator.java@20 PS22, Line 20: // copy of https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/AuthTokenGenerator.java line too long (135 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java File fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java: http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java@32 PS22, Line 32: // https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSamlAuthTokenGenerator.java line too long (135 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java@52 PS22, Line 52: private static final Logger LOG = LoggerFactory.getLogger(HiveSamlAuthTokenGenerator.class); line too long (94 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlGroupNameFilter.java File fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlGroupNameFilter.java: http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlGroupNameFilter.java@31 PS22, Line 31: // https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSamlGroupNameFilter.java line too long (132 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateInfo.java File fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateInfo.java: http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateInfo.java@20 PS22, Line 20: // copy of https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSamlRelayStateInfo.java line too long (139 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateStore.java File fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateStore.java: http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateStore.java@32 PS22, Line 32: // slightly modified copy of https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSamlRelayStateInfo.java line too long (157 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java File fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java: http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@43 PS22, Line 43: // modified version of https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSaml2Client.java line too long (144 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@69 PS22, Line 69: //TODO handle the replayCache as described in http://www.pac4j.org/docs/clients/saml.html line too long (93 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@148 PS22, Line 148: // This is done to keep original structure by Vihang + keep ImpalaSamlClient as the only line too long (92 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@188 PS22, Line 188: // https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java line too long (129 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@189 PS22, Line 189: private String doSamlAuth(WrappedWebContext webContext) throws HttpSamlAuthenticationException { line too long (98 > 90) http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/service/BackendConfig.java File fe/src/main/java/org/apache/impala/service/BackendConfig.java:
[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala
Hello Thomas Tauber-Marshall, Vihang Karajgaonkar, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16833 to look at the new patch set (#22). Change subject: IMPALA-10496: SAML implementation in Impala .. IMPALA-10496: SAML implementation in Impala The bulk of the SAML2 related code is done on Java side because: - There is already a POC in Hive that could be reused. - The only SAML lib for c++ seems to be OpenSaml, which is seemed quite hard to use and a heavy dependency. Doing authentication in Java needed some plumbing, as the hs2-http port is listened to in c++ and http related processing happens in THttpServer/THttpTransport, which is not a "real" web server, just a simple http implementation that processes the headers and passes content to the thrift service. - Http headers (and in one case body) are inspected and if it is SAML related, the http request is wrapped in TWrappedHttpRequest and sent to the Frontend. The Frontend processes it and returns a TWrappedHttpResponse with the info to return to the client. - After the last SAML message (with the bearer token) we generate an auth cookie in c++ (which can be validated in c++), so later requests in the session don't need to call to Java. State of implementation: - The java side is more or less ok, will be updated when the Hive implementation changes. I would do a proper cleanup / documentation once the Hive code is more final. - Compatibility with other auth mechanisms should be decided: - Whether other clients should be able to auth with ldap/kerberos is not clear yet. Testing: - Added EE tests that use Python's urllib2 to sent SAML requests to Impala. Impala works slightly differently during tests (saml2_ee_test_mode=true). Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa --- M be/src/common/global-flags.cc M be/src/rpc/auth-provider.h M be/src/rpc/authentication-test.cc M be/src/rpc/authentication.cc M be/src/rpc/authentication.h M be/src/rpc/hs2-http-test.cc M be/src/rpc/thrift-server.h M be/src/service/frontend.cc M be/src/service/frontend.h M be/src/service/impala-server.cc M be/src/transport/THttpServer.cpp M be/src/transport/THttpServer.h M be/src/transport/THttpTransport.cpp M be/src/transport/THttpTransport.h M be/src/util/backend-gflag-util.cc M bin/rat_exclude_files.txt M common/thrift/BackendGflags.thrift M common/thrift/Frontend.thrift M common/thrift/metrics.json M fe/pom.xml A fe/src/main/java/org/apache/impala/authentication/saml/AuthTokenGenerator.java A fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java A fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlGroupNameFilter.java A fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlHttpServlet.java A fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateInfo.java A fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateStore.java A fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlUtils.java A fe/src/main/java/org/apache/impala/authentication/saml/HttpSamlAuthenticationException.java A fe/src/main/java/org/apache/impala/authentication/saml/HttpSamlNoGroupsMatchedException.java A fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java A fe/src/main/java/org/apache/impala/authentication/saml/NullSessionStore.java A fe/src/main/java/org/apache/impala/authentication/saml/WrappedWebContext.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/main/java/org/apache/impala/service/JniFrontend.java M java/pom.xml A testdata/authentication/saml2_sso.jks A testdata/authentication/saml2_sso_metadata.xml A tests/custom_cluster/test_saml2_sso.py 39 files changed, 2,142 insertions(+), 49 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/33/16833/22 -- To view, visit http://gerrit.cloudera.org:8080/16833 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa Gerrit-Change-Number: 16833 Gerrit-PatchSet: 22 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Vihang Karajgaonkar
[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution
Hello Quanlong Huang, Gabor Kaszab, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17039 to look at the new patch set (#2). Change subject: IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution .. IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution Currently ORC schema resolution is position-based only. Then the positional information is reused during ORC column reader creation. This prevents adding other column resolution strategies, e.g. column resolution by name or Iceberg field id. It also prevents schema evolution, as table metadata and file metadata should be in sync. This patch makes column reader creation independent of schema resolution. It does this by creating a mapping between slot/tuple descriptors and ORC type ids during schema resolution and use this mapping during column reader creation. Now further patches just need to add support for other column resolution strategies, the column readers will be created accordingly. Testing: * no additional tests as it is neither a bug fix nor a new feature Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e --- M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/orc-metadata-utils.cc M be/src/exec/orc-metadata-utils.h M tests/query_test/test_scanners.py 7 files changed, 69 insertions(+), 137 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/17039/2 -- To view, visit http://gerrit.cloudera.org:8080/17039 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e Gerrit-Change-Number: 17039 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 ) Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate .. Patch Set 63: (4 comments) Left a few comments, mostly focusing on the BE. But looks good overall. http://gerrit.cloudera.org:8080/#/c/16720/63//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16720/63//COMMIT_MSG@16 PS63, Line 16: followign nit: following http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/exec/parquet/hdfs-parquet-scanner.cc@722 PS63, Line 722: if (FindMinMaxFilter(desc.filter_id)) { : return true; : } shouldn't we check FilterStats.enabled_for_page here? http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/exec/parquet/hdfs-parquet-scanner.cc@797 PS63, Line 797: continue nit: Since we have a 'continue' here there's no need to put the EvaluateOverlapForRowGroup() into an else branch. http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/util/min-max-filter.cc File be/src/util/min-max-filter.cc: http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/util/min-max-filter.cc@541 PS63, Line 541: // TODO: implement the overlap computation for TimestampValue. : return 0.0; This will cause the row groups to be rejected by the HdfsParquetScanner if we have a min/max filter on timestamp. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 63 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Feb 2021 14:18:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions
Fucun Chu has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 ) Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions .. Patch Set 4: The test is being processed, update the document after completion -- To view, visit http://gerrit.cloudera.org:8080/16656 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a Gerrit-Change-Number: 16656 Gerrit-PatchSet: 4 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 10 Feb 2021 14:07:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theta estimate() functions
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17008 ) Change subject: IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8115/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17008 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Gerrit-Change-Number: 17008 Gerrit-PatchSet: 3 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 10 Feb 2021 13:55:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theta estimate() functions
Fucun Chu has posted comments on this change. ( http://gerrit.cloudera.org:8080/17008 ) Change subject: IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions .. Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@7 PS2, Line 7: ds_theta_estimate > nit: typo Done http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@13 PS2, Line 13: ds_theta_estimate > nit: same typo Done http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@28 PS2, Line 28:see IMPALA-10464. > I'd also include some highlights from that perf measurement doc into the co Done http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc@1646 PS2, Line 1646: SerializeDsThetaSketch( > In contrast with HLL as I see Theta doesn't compact the sketch just seriali Done http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc@1899 PS2, Line 1899: or dst->len == sizeof(datasketches::theta_union)); > I;m a bit lost here. Could you help me understand why is it needed to conve Previously, it was processed along the idea that the size of dst is unchanged, and it is better to return union_sketch. http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/datasketches-functions-ir.cc File be/src/exprs/datasketches-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/datasketches-functions-ir.cc@110 PS2, Line 110: return 0; > HLL returns a null here. Have you checked the behaviour in Hive to be in sy Comparing the test cases of HLL and Theta, the results are different. Theta: https://github.com/apache/datasketches-hive/blob/master/src/test/java/org/apache/datasketches/hive/theta/EstimateSketchUDFTest.java#L34 HLL: https://github.com/apache/datasketches-hive/blob/master/src/test/java/org/apache/datasketches/hive/hll/SketchToEstimateUDFTest.java#L31 http://gerrit.cloudera.org:8080/#/c/17008/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test: http://gerrit.cloudera.org:8080/#/c/17008/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test@138 PS2, Line 138: # Check that ds_theta_estimate returns error for strings that are not serialized sketches. > Please add a test when ds_theta_estimate() is used on an HLL sketch. I gues Done -- To view, visit http://gerrit.cloudera.org:8080/17008 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Gerrit-Change-Number: 17008 Gerrit-PatchSet: 3 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 10 Feb 2021 13:38:14 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theta estimate() functions
Fucun Chu has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/17008 ) Change subject: IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions .. IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions These functions can be used to get cardinality estimates of data using Theta algorithm from Apache DataSketches. ds_theta_sketch() receives a dataset, e.g. a column from a table, and returns a serialized Theta sketch in string format. This can be written to a table or be fed directly to ds_theta_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the Theta sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' Theta see: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch25_parquet.lineitem to compare perfomance with ds_hll_*. HLL and Theta gives closer estimate except for string, see IMPALA-10464. Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc --- M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions-test.cc M be/src/exprs/aggregate-functions.h M be/src/exprs/datasketches-functions-ir.cc M be/src/exprs/datasketches-functions.h M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/data/README A testdata/data/theta_sketches_from_hive.parquet A testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test M tests/query_test/test_datasketches.py 11 files changed, 445 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/17008/3 -- To view, visit http://gerrit.cloudera.org:8080/17008 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Gerrit-Change-Number: 17008 Gerrit-PatchSet: 3 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17039 ) Change subject: IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution .. Patch Set 1: (5 comments) Good to see a negative size patch! Thanks for the simplification! I'm going to have a detailed look. Left some minor comments first. http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@291 PS1, Line 291: inline bool PathContains(const SchemaPath& path, const SchemaPath& sub_path) { : return path.size() >= sub_path.size() && : std::equal(sub_path.begin(), sub_path.end(), path.begin()); : } This is no longer used. http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@296 PS1, Line 296: inline const SchemaPath& GetTargetColPath(const SlotDescriptor* slot_desc) { This is no longer used too. http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@311 PS1, Line 311: bool IsDescendant(const orc::Type& node, uint64_t candidate_col_id) { nit: inline? http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@316 PS1, Line 316: Could you move the original comments of FindChild here? http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@606 PS1, Line 606: // We have a position slot descriptor if it refers to this LIST ORC type, but it isn't : // a collection slot. It may be useful to mention "See how position slots are resolved in HdfsOrcScanner::ResolveColumns()." -- To view, visit http://gerrit.cloudera.org:8080/17039 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e Gerrit-Change-Number: 17039 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 10 Feb 2021 09:57:06 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10482, IMPALA-10493: Fix bugs in full ACID collection query rewrites
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17038 ) Change subject: IMPALA-10482, IMPALA-10493: Fix bugs in full ACID collection query rewrites .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/17038/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17038/3//COMMIT_MSG@34 PS3, Line 34: When AcidRewriter.splitCollectionRef() creates a new collection ref : it doesn't copy every information needed to correctly execute the : query. E.g. it dropped the ON clause, turning INNER joins to CROSS : joins. I think copying the join stuffs is not enough. We are rewritting "A join B" to "A join B1 join B2". This is ok for inner join but may be incorrect for outer joins. Here is an example, the two queries produce different results: impala> with v as ( select ('k4') as key union all values ('k1'), ('k2'), ('k3') ) select * from v left join functional_parquet.complextypestbl.int_map using (key); +-+--+---+ | key | key | value | +-+--+---+ | k1 | k1 | -1| | k1 | k1 | 1 | | k2 | k2 | 100 | | k1 | k1 | 2 | | k2 | k2 | NULL | | k1 | k1 | NULL | | k3 | k3 | NULL | | k4 | NULL | NULL | +-+--+---+ Fetched 8 row(s) in 3.35s impala> with v as ( select ('k4') as key union all values ('k1'), ('k2'), ('k3') ) select * from v left join functional_orc_def.complextypestbl.int_map using (key); +-+-+---+ | key | key | value | +-+-+---+ | k1 | k1 | -1| | k1 | k1 | 1 | | k2 | k2 | 100 | | k1 | k1 | 2 | | k2 | k2 | NULL | | k1 | k1 | NULL | | k3 | k3 | NULL | +-+-+---+ Fetched 7 row(s) in 0.35s Maybe we need to replace the original table ref with an inlineview (similar to what we've done for column masking). -- To view, visit http://gerrit.cloudera.org:8080/17038 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8fc758d3c1e75c7066936d590aec8bff8d2b00b0 Gerrit-Change-Number: 17038 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Feb 2021 08:51:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16909 ) Change subject: IMPALA-5675: Support UTF-8 Varchar and Char .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8114/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16909 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f Gerrit-Change-Number: 16909 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 10 Feb 2021 08:30:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16909 ) Change subject: IMPALA-5675: Support UTF-8 Varchar and Char .. Patch Set 13: Fixed some issues caused by unexpected data loading. -- To view, visit http://gerrit.cloudera.org:8080/16909 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f Gerrit-Change-Number: 16909 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 10 Feb 2021 08:11:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16909 to look at the new patch set (#13). Change subject: IMPALA-5675: Support UTF-8 Varchar and Char .. IMPALA-5675: Support UTF-8 Varchar and Char This patch adds support for UTF-8 aware varchar and char types. In UTF-8 mode, when truncating UTF-8 varchar(N) and char(N) strings, lengths will be counted by UTF-8 characters instead of bytes. So the result string will have up to N UTF-8 characters. The UTF8_MODE query option is first detected in FE when analyzing the query. A 'is_utf8' label is added in Exprs and SlotDescriptors. They are used in generating thrift objects and computing the tuple layouts. A char(N) slot will occupy 4 * N bytes if it's in UTF-8 type, because a UTF-8 character can be encoded into 1~4 bytes. The slot will store up to N UTF-8 characters. There is a gotcha that we should not add the label in Type.java, because Type instances are shared across the FE. Query compilation reuses the Type instances from the metadata. If we modify Type instances during compilation, other queries in non-UTF8 mode will be affected. However, in BE, we need the type related classes (e.g. ColumnType, TypeDesc) to carry in the utf8 markers. It's impractical to check the UTF8_MODE query option everywhere it needs to be. E.g. in AnyValUtil::SetAnyVal we can't access the query options. So we add the 'is_utf8' marker in TScalarType, ColumnType, TypeDesc to conveniently distinguish char(N) and varchar(N) types in UTF-8 mode. When generating thrift objects in FE, Exprs and SlotDescriptors deliver 'is_utf8' markers to TScalaTypes. They finally landed in ColumnType and TypeDesc instances. Given the correct UTF-8 mode checked, we just need to truncate/pad the char/varchar strings with their length counted by UTF-8 characters. Since char(N) slots always occupy 4N bytes, when converting char(N) to other string types, we need to re-calculate the actual length corresponding to N UTF-8 characters. We can optimize this in later patches, e.g. store the UTF-8 length in the slot, or deal with UTF-8 char(N) by the same way as varchar(N), i.e. reallocate the string space and just store the pointer and length in the slot. Tests: - Add tests for reading char(N) and varchar(N) columns in UTF8_MODE. - Add truncating/padding tests - Kudu only supports Varchar currently. Add special tests for Kudu. - Add tests for writing CHAR(N)/VARCHAR(N) in UTF-8 mode. Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f --- M be/src/codegen/codegen-anyval.cc M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/llvm-codegen.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner-test.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/kudu-scanner.cc M be/src/exec/kudu-table-sink.cc M be/src/exec/kudu-util.cc M be/src/exec/kudu-util.h M be/src/exec/orc-column-readers.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-column-stats.inline.h M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-plain-test.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.inline.h M be/src/exprs/agg-fn-evaluator.cc M be/src/exprs/anyval-util.cc M be/src/exprs/anyval-util.h M be/src/exprs/cast-functions-ir.cc M be/src/exprs/scalar-expr-evaluator.cc M be/src/exprs/scalar-fn-call.cc M be/src/exprs/slot-ref.cc M be/src/runtime/raw-value-ir.cc M be/src/runtime/raw-value.cc M be/src/runtime/raw-value.inline.h M be/src/runtime/tuple.cc M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/fe-support.cc M be/src/service/hs2-util.cc M be/src/udf/udf-internal.h M be/src/udf/udf.cc M be/src/udf/udf.h M be/src/util/CMakeLists.txt M be/src/util/bit-util.h M be/src/util/dict-encoding.h M be/src/util/string-util-test.cc M be/src/util/string-util.cc M be/src/util/string-util.h M be/src/util/tuple-row-compare.cc M common/thrift/Types.thrift M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/service/Frontend.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/QueryTest/kudu_create.test A testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test A