[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

2021-02-10 Thread Fucun Chu (Code Review)
Fucun Chu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() 
functions
..


Patch Set 4:

Performance comparison between ds_hll_* and ds_cpc_* functions see: 
https://issues.apache.org/jira/browse/IMPALA-10500


--
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 11 Feb 2021 06:58:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 4: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6884/


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 06:50:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17054 )

Change subject: IMPALA-10497: Fix flakiness in 
test_no_fd_caching_on_cached_data.
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8125/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17054
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952
Gerrit-Change-Number: 17054
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Thu, 11 Feb 2021 03:19:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.

2021-02-10 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17054 )

Change subject: IMPALA-10497: Fix flakiness in 
test_no_fd_caching_on_cached_data.
..


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py
File tests/custom_cluster/test_hdfs_fd_caching.py:

http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@179
PS1, Line 179: le handles cached
> One thing to watch out for here is if you inherit from TestHdfsFdCaching, y
Done


http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@184
PS1, Line 184: # handle ca
> I believe that the number of rows should not impact the behavior. The diffe
Done


http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@190
PS1, Line 190:
> I will remove this so the cluster size back to default.
I change my mind, I think we should keep cluster_size=1 here.

On start up, impalad delete the old cache file by listing file matching 
CACHE_FILE_PREFIX and delete them one by one.
https://github.com/apache/impala/blob/b42c649/be/src/runtime/io/data-cache.cc#L415
So if cluster_size > 1, multiple impalad might race to delete the same old 
cache file and crash the losing ones.

Also, setting cluster_size=1 does not change the behavior being tested by this 
test, so I think it should be okay.


http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@207
PS1, Line 207:
 :   def cached_handles(self):
 : return self.get_ag
> This makes a lot of sense to me as the explanation. There are times when da
Done


http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@224
PS1, Line 224:
> I'm fine with either value here, but I also don't see this changing the beh
Done



--
To view, visit http://gerrit.cloudera.org:8080/17054
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952
Gerrit-Change-Number: 17054
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Thu, 11 Feb 2021 03:09:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.

2021-02-10 Thread Riza Suminto (Code Review)
Hello Joe McDonnell, Bikramjeet Vig, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17054

to look at the new patch set (#2).

Change subject: IMPALA-10497: Fix flakiness in 
test_no_fd_caching_on_cached_data.
..

IMPALA-10497: Fix flakiness in test_no_fd_caching_on_cached_data.

test_no_fd_caching_on_cached_data has been flaky for not having all of
the data fully cached in the warm-up phase. There is a limit on
concurrency in writing to the cache such that we may fail to cache data
the first time read it. This patch fixes the test by repeating the
warm-up query 5 times. This patch also reduce the test cluster size to
just 1 to avoid potential flakiness from multiple impalad racing to
delete the same old fata cache files.

Testing:
- Loop the test manually 100 times and see no more failures.

Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952
---
M tests/custom_cluster/test_hdfs_fd_caching.py
1 file changed, 7 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/17054/2
--
To view, visit http://gerrit.cloudera.org:8080/17054
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952
Gerrit-Change-Number: 17054
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 


[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.

2021-02-10 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17054 )

Change subject: IMPALA-10497: Fix flakiness in 
test_no_fd_caching_on_cached_data.
..


Patch Set 1:

(1 comment)

Thanks Joe for the feedback.
I will revert most of these and leave only the extra warm up part.

http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py
File tests/custom_cluster/test_hdfs_fd_caching.py:

http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@190
PS1, Line 190: cluster_size=1
> I think this should not impact the caching. Our scheduling algorithm maps f
I will remove this so the cluster size back to default.
Previously, when I kept cluster size as default (3), after several iteration of 
looping this test, one impalad failed with the following error.

I0210 17:50:00.781529 10340 mem_tracker.cc:83] Creating tracker 
/tmp-sharded_lru_cache->root
I0210 17:50:00.782321 10340 status.cc:129] Failed to delete old cache file 
/tmp/impala-cache-file-fc4060e3ab7e45b0:a680f0d3b05005a1: Not found: 
/tmp/impala-cache-file-fc4060e3ab7e45b0:a680f0d3b05005a1: No such file or 
directory (error 2)
@  0x2c008d7  _ZN6impala13GetStackTraceB5cxx11Ev
@  0x1c9003e  impala::Status::Status()
@  0x36016f9  
impala::io::DataCache::Partition::DeleteExistingFiles()
@  0x3601e22  impala::io::DataCache::Partition::Init()
@  0x360975e  impala::io::DataCache::Init()
@  0x35b4db0  impala::io::DiskIoMgr::Init()
@  0x251b229  impala::ExecEnv::Init()
@  0x29a0412  ImpaladMain()
@  0x1bdf4d3  main
@ 0x7faa8e3bebf7  __libc_start_main
@  0x1ae81da  _start
I0210 17:50:00.897989 10340 mem_tracker.cc:87] Destroying tracker 
/tmp-sharded_lru_cache->root

I was guessing that since all three write their data cache to common /tmp/ dir, 
one might accidentally delete cache file that belong to other impalad. Just 
running the test one time doesn't reveal this error.
So in this patchset 1, I follow examples from custom_cluster/test_data_cache.py 
to set custom_cluster=1.



--
To view, visit http://gerrit.cloudera.org:8080/17054
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952
Gerrit-Change-Number: 17054
Gerrit-PatchSet: 1
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Thu, 11 Feb 2021 02:18:19 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README

2021-02-10 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17058 )

Change subject: IMPALA-10470: Add link to quickstart container from README
..


Patch Set 1: Code-Review+2

Thanks for adding this. (The build failures look unrelated, probably some Maven 
server went bad.)


--
To view, visit http://gerrit.cloudera.org:8080/17058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
Gerrit-Change-Number: 17058
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:41:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17058 )

Change subject: IMPALA-10470: Add link to quickstart container from README
..


Patch Set 1:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/8123/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/17058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
Gerrit-Change-Number: 17058
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:38:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8124/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:37:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 3: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6883/


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:24:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6884/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:24:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 4: Code-Review+2

fix rat exclusdes


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:24:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)
Hello Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17050

to look at the new patch set (#4).

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..

IMPALA-9382: part 3/3 clean up runtime profile v2 text output

Eliminated some of the noisy per-instance counters
from DEFAULT verbosity.

Testing:
* Updated impala-profile-tool test with new output
* Added new impala-profile-tool test for v2 profile.

Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
---
M be/src/util/runtime-profile.cc
M bin/rat_exclude_files.txt
M testdata/impala-profiles/README
M 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt
A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt
M tests/observability/test_profile_tool.py
8 files changed, 5,760 insertions(+), 516 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/4
--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 2:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/8122/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:22:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10470: Add link to quickstart container from README

2021-02-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17058


Change subject: IMPALA-10470: Add link to quickstart container from README
..

IMPALA-10470: Add link to quickstart container from README

Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
---
M README.md
1 file changed, 9 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/17058/1
--
To view, visit http://gerrit.cloudera.org:8080/17058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: If76376557efffe9c7b8e02a5b840e128b343a74e
Gerrit-Change-Number: 17058
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6883/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 3: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 2: Code-Review+2

carry


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)
Hello Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17050

to look at the new patch set (#2).

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..

IMPALA-9382: part 3/3 clean up runtime profile v2 text output

Eliminated some of the noisy per-instance counters
from DEFAULT verbosity.

Testing:
* Updated impala-profile-tool test with new output
* Added new impala-profile-tool test for v2 profile.

Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
---
M be/src/util/runtime-profile.cc
M testdata/impala-profiles/README
M 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_default.expected.txt
A testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt
A 
testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_extended.expected.txt
M tests/observability/test_profile_tool.py
7 files changed, 5,757 insertions(+), 516 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/17050/2
--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py
File tests/observability/test_profile_tool.py:

http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py@50
PS1, Line 50: )
> flake8: E501 line too long (91 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/17050/1/tests/observability/test_profile_tool.py@53
PS1, Line 53: )
> flake8: E501 line too long (92 > 90 characters)
Done



--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 11 Feb 2021 01:02:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..

IMPALA-9979: part 2: partitioned top-n

Planner changes:
---
The planner now identifies predicates that can be converted into
limits in a partitioned or unpartitioned top-n with the following
method:
* Push down predicates that reference analytic tuple into inline view.
  These will be evaluated after the analytic plan for the inline
  SelectStmt is generated.
* Identify predicates that reference the analytic tuple and could
  be converted to limits.
* If they can be applied to the last sort group of the analytic
  plan, and the windows are all compatible, then the lowest
  limit gets converted into a limit in the top N.
* Otherwise generate a select node with the conjuncts. We add
  logic to merge SELECT nodes to avoid generating duplicates
  from inside and outside the inline view.
* The pushed predicate is still added to the SELECT node
  because it is necessary for correctness for predicates
  like '=' to filter additional rows and also the limit
  pushdown optimization looks for analytic predicates
  there, so retaining all predicates simplifies that.
  The selectivity of the predicate is adjusted so that
  cardinality estimates remain accurate.

The optimization can be disabled by setting
ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is
only enabled for limits of 1000 or less, because the
in-memory Top-N may perform significantly worse than
a full sort for large heaps (since updating the heap
for every input row ends up being more expensive than
doing a traditional sort). We could probably optimize
this more with better tuning so that it can gracefully
fall back to doing the full sort at runtime.

rank() and row_number() are handled. rank() needs support in
the TopN node to include ties for the last place, which is
also added in this patch.

If predicates are trivially false, we generate empty nodes.

This interacts with the limit pushdwon optimization. The limit
pushdown optimization is applied after the partitioned top-n
is generated, and can sometimes result in more optimal plans,
so it is generalized to handle pushing into partitioned top-n
nodes.

Backend changes:
---
The top-n node in the backend is augmented to handle
the partitioned case, for which we use a std::map and a
comparator based on the partition exprs. The partitioned
top-n node has a soft limit of 64MB on the size of the
in-memory heaps and can spill with use of an embedded Sorter.
The current implementation tries to evict heaps that are
less effective at filtering rows.

Limitations:
---
There are several possible extensions to this that we did not do:
* dense_rank() is not supported because it would require additional
  backend support - IMPALA-10014.
* ntile() is not supported because it would require additional
  backend support - IMPALA-10174.
* Only one predicate per analytic is pushed.
* Redundant rank()/row_number() predicates are not merged,
  only the lowest is chosen.
* Lower bounds are not converted into OFFSET.
* The analytic operator cannot be eliminated even if the analytic
  expression was only used in the predicate.
* This doesn't push predicates into UNION - IMPALA-10013
* Always false predicates don't result in empty plan - IMPALA-10015

Tests:
-
* Planner tests - added tests that exercise the interesting code
  paths added in planning.
  - Predicate ordering in SELECT nodes changed in a couple of cases
because some predicates were pushed into the inline views.
* Modified SORT targeted perf tests to avoid conversion to Top-N
* Added targeted perf test for partitioned top-n.
* End-to-end tests
 - Unpartitioned Top-N end-to-end tests
 - Basic partitioning and duplicate handling tests on functional
 - Similar basic tests on larger inputs from TPC-DS and with
   larger partition counts.
 - I inspected the results and also ran the same tests with
   analytic_rank_pushdown_threshold=0 to confirm that the
   results were the same as with the full sort.
 - Fallback to spilling sort.

Perf:
-
Added a targeted benchmark that goes from ~2s to ~1s with
mt_dop=8 on TPC-H 30 on my desktop.

Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
Reviewed-on: http://gerrit.cloudera.org:8080/16242
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/exec-node.cc
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/exprs/slot-ref.h
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/priority-queue.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M 

[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..


Patch Set 35: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16242
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
Gerrit-Change-Number: 16242
Gerrit-PatchSet: 35
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 10 Feb 2021 23:52:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9382: part 3/3 clean up runtime profile v2 text output

2021-02-10 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17050 )

Change subject: IMPALA-9382: part 3/3 clean up runtime profile v2 text output
..


Patch Set 1: Code-Review+2

I think this makes sense as a way to cut down the verbosity for the DEFAULT 
level.

The flake8 line length issues don't bother me that much given that it is driven 
by long filenames, but if there is an easy fix, go for it.


--
To view, visit http://gerrit.cloudera.org:8080/17050
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Gerrit-Change-Number: 17050
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Comment-Date: Wed, 10 Feb 2021 23:29:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10497: Fix flakiness in test no fd caching on cached data.

2021-02-10 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17054 )

Change subject: IMPALA-10497: Fix flakiness in 
test_no_fd_caching_on_cached_data.
..


Patch Set 1:

(5 comments)

Thanks for looking into this.

My main thought here is that I think the change may be bigger than it needs to 
be. My theory is that the "Repeat the warm up query 5 times" is the part of the 
change that makes it work and it may not need the other parts. I would be 
interested in whether a stripped down version of this with only that part of 
the change would fix the issue.

http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py
File tests/custom_cluster/test_hdfs_fd_caching.py:

http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@179
PS1, Line 179: TestHdfsFdCaching
One thing to watch out for here is if you inherit from TestHdfsFdCaching, you 
may inherit all of its test cases (e.g. test_caching_with_eviction()). I'm not 
sure how pytest deals with that.


http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@184
PS1, Line 184:   NUM_ROWS = 10
I believe that the number of rows should not impact the behavior. The 
differences between 10 and 100 is small enough that it shouldn't impact the IO 
pattern. They are all in a single parquet files, and we would do an IO for the 
footer and one IO for each column.


http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@190
PS1, Line 190: cluster_size=1
I think this should not impact the caching. Our scheduling algorithm maps files 
to nodes consistently, so each node should be reading the same file each time.


http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@207
PS1, Line 207: Read 5 times to make
 : # sure the data cache is fully warmed up.
 : for x in range(5):
This makes a lot of sense to me as the explanation. There are times when data 
won't be written to the cache. In particular, there is a limit on concurrency 
in writing to the cache. I'm guessing we could hit that and fail to cache 
something the first time around.


http://gerrit.cloudera.org:8080/#/c/17054/1/tests/custom_cluster/test_hdfs_fd_caching.py@224
PS1, Line 224: for x in range(5):
I'm fine with either value here, but I also don't see this changing the 
behavior of the test.



--
To view, visit http://gerrit.cloudera.org:8080/17054
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I774f9dfea7dcc107c3c7f2b76db3aaf4b2dd7952
Gerrit-Change-Number: 17054
Gerrit-PatchSet: 1
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Wed, 10 Feb 2021 22:59:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16833 )

Change subject: IMPALA-10496: SAML implementation in Impala
..


Patch Set 23: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16833
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa
Gerrit-Change-Number: 16833
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Vihang Karajgaonkar 
Gerrit-Comment-Date: Wed, 10 Feb 2021 21:06:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..


Patch Set 35: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16242
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
Gerrit-Change-Number: 16242
Gerrit-PatchSet: 35
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 10 Feb 2021 18:13:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9979: part 2: partitioned top-n

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16242 )

Change subject: IMPALA-9979: part 2: partitioned top-n
..


Patch Set 35:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6882/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16242
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5
Gerrit-Change-Number: 16242
Gerrit-PatchSet: 35
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Shant Hovsepian 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 10 Feb 2021 18:13:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 65:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8121/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 65
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Feb 2021 17:24:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 64:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8120/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 64
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Feb 2021 17:22:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-02-10 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#65). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..

IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on 
equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the acceptance or rejection of a row group, a page, or a
row in a Parquet table, utilizing the minimal and the maximal values
gathered from an equi hash join and the Parquet column index stats.
When a row group or page is rejected, all contained rows within are
rejected all together.

For example in the following query, the min and max in the overlap
predicate are computed from the join column from table 'b', and
are compared against the min/max of each row group or page at the
scan node for 'a'.

  select straight_join count(*)
  from lineitem a join [SHUFFLE] lineitem b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with the column type B in hash
table and scan column type A will be formed when both A and B are
of or can be converted to as:
  1. booleans;
  2. integers (tinyint, smallint, int, or bigint);
  3. approximate numeric (float or double);
  4. decimals with the same precision and scale;
  5. strings;
  6. date; or
  7. timestamps.

The overlap predicate is implemented as a min/max filter and can be
observed in the explain output of a query.

A new query option 'minmax_filter_threshold' is provided to control
the new feature. Setting it to 0.0 disables the feature. Setting it
to a value > 0.0 but less than 1.0 provides a threshold. An overlap
predicate will be evaluated against a row group and possibly the
containing pages/rows, as long as its overlap ratio is less than the
threshold. The overlap ratio is the common area of the row group
and the filter, divided by the area of the row group.

A second query option, minmax_filtering_level, is provided to
specify the filtering scope:
  1. ROW_GROUP: the overlap is only tested for row groups;
  2. PAGE: the overlap is tested for both row groups and pages;
  3. ROW: the overlap is for row groups, pages and rows.

Two new run-time profile counters are added to report the number of
row groups or pages filtered out via the overlap predicates
respectively:
  1. NumRuntimeFilteredRowGroups
  2. NumRuntimeFilteredPages

Testing:
1. Unit tested on various column types with TPCH and TPCDS tables.
   Benefits were significant when the join column on the outer table
   is sorted and there exist many row groups or pages no overlapping
   with the min/max filters;
2. Added following new tests:
a) In overlap_min_max_filters.test to demonstrate the number of
   filtered out pages and row groups with the two new profile
   counters;
b) In runtime-filter-propagation.test to demonstrate that the
   overlap predicates work with different column types;
3. Core testing;
4. Performance measurement: the overal improvement with 3TB scale
   TPCDS is at 1.45% with the threshold at 0.5 and filtering level
   at ROW_GROUP. Good improvement (over 10%) are seen wtih query 16,
   25, 62, 83, 94 and 99, due to the join column ship_date_sk being
   strongly correlated to the partition column sold_date_sk.

To do in follow-up JIRAs:
1. Improve filtering efficiency;
2. Apply the overlap predicate on partition columns;
3. IR code-gen for various MinMaxFilter::EvalOverlap methods.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/raw-value.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/string-value-test.cc
M be/src/runtime/string-value.cc
M be/src/runtime/string-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/min-max-filter-ir.cc
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M 

[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 64:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16720/64/tests/run-tests.py
File tests/run-tests.py:

http://gerrit.cloudera.org:8080/#/c/16720/64/tests/run-tests.py@216
PS64, Line 216: #
flake8: E265 block comment should start with '# '



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 64
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Feb 2021 17:03:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-02-10 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#64). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..

IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on 
equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the acceptance or rejection of a row group, a page, or a
row in a Parquet table, utilizing the minimal and the maximal values
gathered from an equi hash join and the Parquet column index stats.
When a row group or page is rejected, all contained rows within are
rejected all together.

For example in the following query, the min and max in the overlap
predicate are computed from the join column from table 'b', and
are compared against the min/max of each row group or page at the
scan node for 'a'.

  select straight_join count(*)
  from lineitem a join [SHUFFLE] lineitem b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with the column type B in hash
table and scan column type A will be formed when both A and B are
of or can be converted to as:
  1. booleans;
  2. integers (tinyint, smallint, int, or bigint);
  3. approximate numeric (float or double);
  4. decimals with the same precision and scale;
  5. strings;
  6. date; or
  7. timestamps.

The overlap predicate is implemented as a min/max filter and can be
observed in the explain output of a query.

A new query option 'minmax_filter_threshold' is provided to control
the new feature. Setting it to 0.0 disables the feature. Setting it
to a value > 0.0 but less than 1.0 provides a threshold. An overlap
predicate will be evaluated against a row group and possibly the
containing pages/rows, as long as its overlap ratio is less than the
threshold. The overlap ratio is the common area of the row group
and the filter, divided by the area of the row group.

A second query option, minmax_filtering_level, is provided to
specify the filtering scope:
  1. ROW_GROUP: the overlap is only tested for row groups;
  2. PAGE: the overlap is tested for both row groups and pages;
  3. ROW: the overlap is for row groups, pages and rows.

Two new run-time profile counters are added to report the number of
row groups or pages filtered out via the overlap predicates
respectively:
  1. NumRuntimeFilteredRowGroups
  2. NumRuntimeFilteredPages

Testing:
1. Unit tested on various column types with TPCH and TPCDS tables.
   Benefits were significant when the join column on the outer table
   is sorted and there exist many row groups or pages no overlapping
   with the min/max filters;
2. Added following new tests:
a) In overlap_min_max_filters.test to demonstrate the number of
   filtered out pages and row groups with the two new profile
   counters;
b) In runtime-filter-propagation.test to demonstrate that the
   overlap predicates work with different column types;
3. Core testing;
4. Performance measurement: the overal improvement with 3TB scale
   TPCDS is at 1.45% with the threshold at 0.5 and filtering level
   at ROW_GROUP. Good improvement (over 10%) are seen wtih query 16,
   25, 62, 83, 94 and 99, due to the join column ship_date_sk being
   strongly correlated to the partition column sold_date_sk.

To do in follow-up JIRAs:
1. Improve filtering efficiency;
2. Apply the overlap predicate on partition columns;
3. IR code-gen for various MinMaxFilter::EvalOverlap methods.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/raw-value.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/string-value-test.cc
M be/src/runtime/string-value.cc
M be/src/runtime/string-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/min-max-filter-ir.cc
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M 

[Impala-ASF-CR] IMPALA-10161: User LDAP Search bind support

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17047 )

Change subject: IMPALA-10161: User LDAP Search bind support
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8119/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17047
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I978744ad05d9ef408328d1e4dd2d18c329f4d3b7
Gerrit-Change-Number: 17047
Gerrit-PatchSet: 2
Gerrit-Owner: Tamas Mate 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 10 Feb 2021 16:37:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10161: User LDAP Search bind support

2021-02-10 Thread Tamas Mate (Code Review)
Tamas Mate has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/17047 )

Change subject: IMPALA-10161: User LDAP Search bind support
..

IMPALA-10161: User LDAP Search bind support

This change adds user search bind support next to simple bind that can
be configured with LDAP filters. The group check was done with LDAP
search earlier, this change adds the possibility to configure it with
Hadoop library like options, which is the LDAP filter with optional
patterns. The '{0}' will be replaced with the user name while the
'{1}' pattern will be replaced with the user dn.

The following new flags have been added:
 --ldap_search_bind_authentication: a flag to change between simple and
search bind
 --ldap_user_search_basedn: the base dn for the LDAP subtree to search
 --ldap_group_search_basedn: the base dn for the LDAP subtree to search

Tested:
 - Custom cluster tests have been added

Change-Id: I978744ad05d9ef408328d1e4dd2d18c329f4d3b7
---
M be/src/rpc/authentication.cc
M be/src/util/CMakeLists.txt
A be/src/util/ldap-search-bind.cc
A be/src/util/ldap-search-bind.h
A be/src/util/ldap-simple-bind.cc
A be/src/util/ldap-simple-bind.h
M be/src/util/ldap-util.cc
M be/src/util/ldap-util.h
M be/src/util/webserver.cc
M fe/src/test/java/org/apache/impala/customcluster/LdapImpalaShellTest.java
A 
fe/src/test/java/org/apache/impala/customcluster/LdapSearchBindImpalaShellTest.java
A 
fe/src/test/java/org/apache/impala/customcluster/LdapSimpleBindImpalaShellTest.java
M fe/src/test/java/org/apache/impala/testutil/LdapUtil.java
M fe/src/test/resources/users.ldif
14 files changed, 901 insertions(+), 334 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/17047/2
--
To view, visit http://gerrit.cloudera.org:8080/17047
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I978744ad05d9ef408328d1e4dd2d18c329f4d3b7
Gerrit-Change-Number: 17047
Gerrit-PatchSet: 2
Gerrit-Owner: Tamas Mate 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17039 )

Change subject: IMPALA-10485: part(1): make ORC column reader creation 
independent of schema resolution
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8118/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17039
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e
Gerrit-Change-Number: 17039
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Feb 2021 15:43:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16833 )

Change subject: IMPALA-10496: SAML implementation in Impala
..


Patch Set 22:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8117/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16833
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa
Gerrit-Change-Number: 16833
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Vihang Karajgaonkar 
Gerrit-Comment-Date: Wed, 10 Feb 2021 15:36:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17039 )

Change subject: IMPALA-10485: part(1): make ORC column reader creation 
independent of schema resolution
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8116/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17039
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e
Gerrit-Change-Number: 17039
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Feb 2021 15:35:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala

2021-02-10 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16833 )

Change subject: IMPALA-10496: SAML implementation in Impala
..


Patch Set 23:

PS 22 is a one line fix for the ldap test failures introduced + unintended 
removal of the rebase that was done by jenkins.


--
To view, visit http://gerrit.cloudera.org:8080/16833
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa
Gerrit-Change-Number: 16833
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Vihang Karajgaonkar 
Gerrit-Comment-Date: Wed, 10 Feb 2021 15:23:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution

2021-02-10 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17039 )

Change subject: IMPALA-10485: part(1): make ORC column reader creation 
independent of schema resolution
..


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@291
PS1, Line 291: inline uint64_t OrcComplexColumnReader::GetTargetColId(
 : const SlotDescriptor* slot_desc) const {
 :   return slot_desc->type().IsCollectionType() ?
 :
> This is no longer used.
Done


http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@296
PS1, Line 296: }
> This is no longer used too.
Done


http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@311
PS1, Line 311:  * Return the result in '*child' and its index inside the 
children. Returns false for
> nit: inline?
Done


http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@316
PS1, Line 316:   DCHECK(parent.getKind() == orc::TypeKind::STRUCT ||
> Could you move the original comments of FindChild here?
Added a bit more verbose comment.



--
To view, visit http://gerrit.cloudera.org:8080/17039
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e
Gerrit-Change-Number: 17039
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Feb 2021 15:23:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16833 )

Change subject: IMPALA-10496: SAML implementation in Impala
..


Patch Set 23:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6881/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16833
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa
Gerrit-Change-Number: 16833
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Vihang Karajgaonkar 
Gerrit-Comment-Date: Wed, 10 Feb 2021 15:21:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution

2021-02-10 Thread Zoltan Borok-Nagy (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17039

to look at the new patch set (#3).

Change subject: IMPALA-10485: part(1): make ORC column reader creation 
independent of schema resolution
..

IMPALA-10485: part(1): make ORC column reader creation independent of schema 
resolution

Currently ORC schema resolution is position-based only. Then the
positional information is reused during ORC column reader creation.
This prevents adding other column resolution strategies, e.g.
column resolution by name or Iceberg field id. It also prevents
schema evolution, as table metadata and file metadata should be
in sync.

This patch makes column reader creation independent of schema
resolution. It does this by creating a mapping between slot/tuple
descriptors and ORC type ids during schema resolution and use
this mapping during column reader creation.

Now further patches just need to add support for other column
resolution strategies, the column readers will be created
accordingly.

Testing:
 * no additional tests as it is neither a bug fix nor a new feature

Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/orc-metadata-utils.h
M tests/query_test/test_scanners.py
7 files changed, 71 insertions(+), 137 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/17039/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17039
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e
Gerrit-Change-Number: 17039
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16833 )

Change subject: IMPALA-10496: SAML implementation in Impala
..


Patch Set 22:

(37 comments)

http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/AuthTokenGenerator.java
File 
fe/src/main/java/org/apache/impala/authentication/saml/AuthTokenGenerator.java:

http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/AuthTokenGenerator.java@20
PS22, Line 20: // copy of 
https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/AuthTokenGenerator.java
line too long (135 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java
File 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java:

http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java@32
PS22, Line 32: // 
https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSamlAuthTokenGenerator.java
line too long (135 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java@52
PS22, Line 52:   private static final Logger LOG = 
LoggerFactory.getLogger(HiveSamlAuthTokenGenerator.class);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlGroupNameFilter.java
File 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlGroupNameFilter.java:

http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlGroupNameFilter.java@31
PS22, Line 31: // 
https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSamlGroupNameFilter.java
line too long (132 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateInfo.java
File 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateInfo.java:

http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateInfo.java@20
PS22, Line 20: // copy of 
https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSamlRelayStateInfo.java
line too long (139 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateStore.java
File 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateStore.java:

http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateStore.java@32
PS22, Line 32: // slightly modified copy of 
https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSamlRelayStateInfo.java
line too long (157 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java
File 
fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java:

http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@43
PS22, Line 43: // modified version of 
https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/auth/saml/HiveSaml2Client.java
line too long (144 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@69
PS22, Line 69: //TODO handle the replayCache as described in 
http://www.pac4j.org/docs/clients/saml.html
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@148
PS22, Line 148: // This is done to keep original structure by Vihang + keep 
ImpalaSamlClient as the only
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@188
PS22, Line 188:   // 
https://github.com/vihangk1/hive/blob/master_saml/service/src/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
line too long (129 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java@189
PS22, Line 189:   private String doSamlAuth(WrappedWebContext webContext) 
throws HttpSamlAuthenticationException {
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/16833/22/fe/src/main/java/org/apache/impala/service/BackendConfig.java
File fe/src/main/java/org/apache/impala/service/BackendConfig.java:


[Impala-ASF-CR] IMPALA-10496: SAML implementation in Impala

2021-02-10 Thread Csaba Ringhofer (Code Review)
Hello Thomas Tauber-Marshall, Vihang Karajgaonkar, Joe McDonnell, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16833

to look at the new patch set (#22).

Change subject: IMPALA-10496: SAML implementation in Impala
..

IMPALA-10496: SAML implementation in Impala

The bulk of the SAML2 related code is done on Java side because:
- There is already a POC in Hive that could be reused.
- The only SAML lib for c++ seems to be OpenSaml, which is seemed
  quite hard to use and a heavy dependency.

Doing authentication in Java needed some plumbing, as the hs2-http
port is listened to in c++ and http related processing happens in
THttpServer/THttpTransport, which is not a "real" web server, just
a simple http implementation that processes the headers and passes
content to the thrift service.
- Http headers (and in one case body) are inspected and if it is
  SAML related, the http request is wrapped in TWrappedHttpRequest
  and sent to the Frontend. The Frontend processes it and returns
  a TWrappedHttpResponse with the info to return to the client.
- After the last SAML message (with the bearer token) we generate
  an auth cookie in c++ (which can be validated in c++),  so later
  requests in the session don't need to call to Java.

State of implementation:
- The java side is more or less ok, will be updated when the Hive
  implementation changes. I would do a proper cleanup / documentation
  once the Hive code is more final.
- Compatibility with other auth mechanisms should be decided:
  - Whether other clients should be able to auth with ldap/kerberos
is not clear yet.

Testing:
- Added EE tests that use Python's urllib2 to sent SAML
  requests to Impala. Impala works slightly differently
  during tests (saml2_ee_test_mode=true).

Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa
---
M be/src/common/global-flags.cc
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication-test.cc
M be/src/rpc/authentication.cc
M be/src/rpc/authentication.h
M be/src/rpc/hs2-http-test.cc
M be/src/rpc/thrift-server.h
M be/src/service/frontend.cc
M be/src/service/frontend.h
M be/src/service/impala-server.cc
M be/src/transport/THttpServer.cpp
M be/src/transport/THttpServer.h
M be/src/transport/THttpTransport.cpp
M be/src/transport/THttpTransport.h
M be/src/util/backend-gflag-util.cc
M bin/rat_exclude_files.txt
M common/thrift/BackendGflags.thrift
M common/thrift/Frontend.thrift
M common/thrift/metrics.json
M fe/pom.xml
A fe/src/main/java/org/apache/impala/authentication/saml/AuthTokenGenerator.java
A 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlAuthTokenGenerator.java
A 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlGroupNameFilter.java
A 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlHttpServlet.java
A 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateInfo.java
A 
fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlRelayStateStore.java
A fe/src/main/java/org/apache/impala/authentication/saml/HiveSamlUtils.java
A 
fe/src/main/java/org/apache/impala/authentication/saml/HttpSamlAuthenticationException.java
A 
fe/src/main/java/org/apache/impala/authentication/saml/HttpSamlNoGroupsMatchedException.java
A fe/src/main/java/org/apache/impala/authentication/saml/ImpalaSamlClient.java
A fe/src/main/java/org/apache/impala/authentication/saml/NullSessionStore.java
A fe/src/main/java/org/apache/impala/authentication/saml/WrappedWebContext.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M java/pom.xml
A testdata/authentication/saml2_sso.jks
A testdata/authentication/saml2_sso_metadata.xml
A tests/custom_cluster/test_saml2_sso.py
39 files changed, 2,142 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/33/16833/22
--
To view, visit http://gerrit.cloudera.org:8080/16833
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia0c026cba1b90e7ff6ec5ae49be78b0d1edd8dfa
Gerrit-Change-Number: 16833
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Vihang Karajgaonkar 


[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution

2021-02-10 Thread Zoltan Borok-Nagy (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/17039

to look at the new patch set (#2).

Change subject: IMPALA-10485: part(1): make ORC column reader creation 
independent of schema resolution
..

IMPALA-10485: part(1): make ORC column reader creation independent of schema 
resolution

Currently ORC schema resolution is position-based only. Then the
positional information is reused during ORC column reader creation.
This prevents adding other column resolution strategies, e.g.
column resolution by name or Iceberg field id. It also prevents
schema evolution, as table metadata and file metadata should be
in sync.

This patch makes column reader creation independent of schema
resolution. It does this by creating a mapping between slot/tuple
descriptors and ORC type ids during schema resolution and use
this mapping during column reader creation.

Now further patches just need to add support for other column
resolution strategies, the column readers will be created
accordingly.

Testing:
 * no additional tests as it is neither a bug fix nor a new feature

Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/orc-metadata-utils.h
M tests/query_test/test_scanners.py
7 files changed, 69 insertions(+), 137 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/17039/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17039
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e
Gerrit-Change-Number: 17039
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-02-10 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 63:

(4 comments)

Left a few comments, mostly focusing on the BE. But looks good overall.

http://gerrit.cloudera.org:8080/#/c/16720/63//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16720/63//COMMIT_MSG@16
PS63, Line 16: followign
nit: following


http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/exec/parquet/hdfs-parquet-scanner.cc@722
PS63, Line 722: if (FindMinMaxFilter(desc.filter_id)) {
  :   return true;
  : }
shouldn't we check FilterStats.enabled_for_page here?


http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/exec/parquet/hdfs-parquet-scanner.cc@797
PS63, Line 797: continue
nit: Since we have a 'continue' here there's no need to put the 
EvaluateOverlapForRowGroup() into an else branch.


http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/16720/63/be/src/util/min-max-filter.cc@541
PS63, Line 541:   // TODO: implement the overlap computation for TimestampValue.
  :   return 0.0;
This will cause the row groups to be rejected by the HdfsParquetScanner if we 
have a min/max filter on timestamp.



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 63
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Feb 2021 14:18:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

2021-02-10 Thread Fucun Chu (Code Review)
Fucun Chu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() 
functions
..


Patch Set 4:

The test is being processed, update the document after completion


--
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 10 Feb 2021 14:07:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theta estimate() functions

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17008 )

Change subject: IMPALA-10463: Implement ds_theta_sketch() and 
ds_theta_estimate() functions
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8115/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17008
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Gerrit-Change-Number: 17008
Gerrit-PatchSet: 3
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 10 Feb 2021 13:55:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theta estimate() functions

2021-02-10 Thread Fucun Chu (Code Review)
Fucun Chu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17008 )

Change subject: IMPALA-10463: Implement ds_theta_sketch() and 
ds_theta_estimate() functions
..


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@7
PS2, Line 7: ds_theta_estimate
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@13
PS2, Line 13: ds_theta_estimate
> nit: same typo
Done


http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@28
PS2, Line 28:see IMPALA-10464.
> I'd also include some highlights from that perf measurement doc into the co
Done


http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc@1646
PS2, Line 1646: SerializeDsThetaSketch(
> In contrast with HLL as I see Theta doesn't compact the sketch just seriali
Done


http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc@1899
PS2, Line 1899:   or dst->len == sizeof(datasketches::theta_union));
> I;m a bit lost here. Could you help me understand why is it needed to conve
Previously, it was processed along the idea that the size of dst is unchanged, 
and it is better to return union_sketch.


http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/datasketches-functions-ir.cc
File be/src/exprs/datasketches-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/datasketches-functions-ir.cc@110
PS2, Line 110: return 0;
> HLL returns a null here. Have you checked the behaviour in Hive to be in sy
Comparing the test cases of HLL and Theta, the results are different.
Theta:
https://github.com/apache/datasketches-hive/blob/master/src/test/java/org/apache/datasketches/hive/theta/EstimateSketchUDFTest.java#L34
HLL:
https://github.com/apache/datasketches-hive/blob/master/src/test/java/org/apache/datasketches/hive/hll/SketchToEstimateUDFTest.java#L31


http://gerrit.cloudera.org:8080/#/c/17008/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test:

http://gerrit.cloudera.org:8080/#/c/17008/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test@138
PS2, Line 138: # Check that ds_theta_estimate returns error for strings that 
are not serialized sketches.
> Please add a test when ds_theta_estimate() is used on an HLL sketch. I gues
Done



--
To view, visit http://gerrit.cloudera.org:8080/17008
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Gerrit-Change-Number: 17008
Gerrit-PatchSet: 3
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 10 Feb 2021 13:38:14 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theta estimate() functions

2021-02-10 Thread Fucun Chu (Code Review)
Fucun Chu has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/17008 )

Change subject: IMPALA-10463: Implement ds_theta_sketch() and 
ds_theta_estimate() functions
..

IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions

These functions can be used to get cardinality estimates of data
using Theta algorithm from Apache DataSketches. ds_theta_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized Theta sketch in string format. This can be written to a
table or be fed directly to ds_theta_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the Theta sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' Theta see:
https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch25_parquet.lineitem to compare perfomance
   with ds_hll_*. HLL and Theta gives closer estimate except for string,
   see IMPALA-10464.

Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions-test.cc
M be/src/exprs/aggregate-functions.h
M be/src/exprs/datasketches-functions-ir.cc
M be/src/exprs/datasketches-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
A testdata/data/theta_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test
M tests/query_test/test_datasketches.py
11 files changed, 445 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/17008/3
--
To view, visit http://gerrit.cloudera.org:8080/17008
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Gerrit-Change-Number: 17008
Gerrit-PatchSet: 3
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution

2021-02-10 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17039 )

Change subject: IMPALA-10485: part(1): make ORC column reader creation 
independent of schema resolution
..


Patch Set 1:

(5 comments)

Good to see a negative size patch! Thanks for the simplification! I'm going to 
have a detailed look. Left some minor comments first.

http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@291
PS1, Line 291: inline bool PathContains(const SchemaPath& path, const 
SchemaPath& sub_path) {
 :   return path.size() >= sub_path.size() &&
 :   std::equal(sub_path.begin(), sub_path.end(), path.begin());
 : }
This is no longer used.


http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@296
PS1, Line 296: inline const SchemaPath& GetTargetColPath(const SlotDescriptor* 
slot_desc) {
This is no longer used too.


http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@311
PS1, Line 311: bool IsDescendant(const orc::Type& node, uint64_t 
candidate_col_id) {
nit: inline?


http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@316
PS1, Line 316:
Could you move the original comments of FindChild here?


http://gerrit.cloudera.org:8080/#/c/17039/1/be/src/exec/orc-column-readers.cc@606
PS1, Line 606:   // We have a position slot descriptor if it refers to this 
LIST ORC type, but it isn't
 :   // a collection slot.
It may be useful to mention "See how position slots are resolved in 
HdfsOrcScanner::ResolveColumns()."



--
To view, visit http://gerrit.cloudera.org:8080/17039
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0f7d521f9397c5188fadc7996cee0bd1650d363e
Gerrit-Change-Number: 17039
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 10 Feb 2021 09:57:06 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10482, IMPALA-10493: Fix bugs in full ACID collection query rewrites

2021-02-10 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17038 )

Change subject: IMPALA-10482, IMPALA-10493: Fix bugs in full ACID collection 
query rewrites
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17038/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17038/3//COMMIT_MSG@34
PS3, Line 34: When AcidRewriter.splitCollectionRef() creates a new collection 
ref
: it doesn't copy every information needed to correctly execute the
: query. E.g. it dropped the ON clause, turning INNER joins to CROSS
: joins.
I think copying the join stuffs is not enough. We are rewritting "A join B" to 
"A join B1 join B2". This is ok for inner join but may be incorrect for outer 
joins. Here is an example, the two queries produce different results:

impala> with v as (
  select ('k4') as key
  union all
  values ('k1'), ('k2'), ('k3')
) select * from v left join functional_parquet.complextypestbl.int_map using 
(key);
+-+--+---+
| key | key  | value |
+-+--+---+
| k1  | k1   | -1|
| k1  | k1   | 1 |
| k2  | k2   | 100   |
| k1  | k1   | 2 |
| k2  | k2   | NULL  |
| k1  | k1   | NULL  |
| k3  | k3   | NULL  |
| k4  | NULL | NULL  |
+-+--+---+
Fetched 8 row(s) in 3.35s

impala> with v as (
  select ('k4') as key
  union all
  values ('k1'), ('k2'), ('k3')
) select * from v left join functional_orc_def.complextypestbl.int_map using 
(key);
+-+-+---+
| key | key | value |
+-+-+---+
| k1  | k1  | -1|
| k1  | k1  | 1 |
| k2  | k2  | 100   |
| k1  | k1  | 2 |
| k2  | k2  | NULL  |
| k1  | k1  | NULL  |
| k3  | k3  | NULL  |
+-+-+---+
Fetched 7 row(s) in 0.35s

Maybe we need to replace the original table ref with an inlineview (similar to 
what we've done for column masking).



--
To view, visit http://gerrit.cloudera.org:8080/17038
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8fc758d3c1e75c7066936d590aec8bff8d2b00b0
Gerrit-Change-Number: 17038
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Feb 2021 08:51:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char

2021-02-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8114/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 10 Feb 2021 08:30:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char

2021-02-10 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char
..


Patch Set 13:

Fixed some issues caused by unexpected data loading.


-- 
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 10 Feb 2021 08:11:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char

2021-02-10 Thread Quanlong Huang (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16909

to look at the new patch set (#13).

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char
..

IMPALA-5675: Support UTF-8 Varchar and Char

This patch adds support for UTF-8 aware varchar and char types. In
UTF-8 mode, when truncating UTF-8 varchar(N) and char(N) strings,
lengths will be counted by UTF-8 characters instead of bytes. So the
result string will have up to N UTF-8 characters.

The UTF8_MODE query option is first detected in FE when analyzing the
query. A 'is_utf8' label is added in Exprs and SlotDescriptors. They are
used in generating thrift objects and computing the tuple layouts. A
char(N) slot will occupy 4 * N bytes if it's in UTF-8 type, because a
UTF-8 character can be encoded into 1~4 bytes. The slot will store up to
N UTF-8 characters.

There is a gotcha that we should not add the label in Type.java, because
Type instances are shared across the FE. Query compilation reuses the
Type instances from the metadata. If we modify Type instances during
compilation, other queries in non-UTF8 mode will be affected.

However, in BE, we need the type related classes (e.g. ColumnType,
TypeDesc) to carry in the utf8 markers. It's impractical to check the
UTF8_MODE query option everywhere it needs to be. E.g. in
AnyValUtil::SetAnyVal we can't access the query options. So we add the
'is_utf8' marker in TScalarType, ColumnType, TypeDesc to conveniently
distinguish char(N) and varchar(N) types in UTF-8 mode. When generating
thrift objects in FE, Exprs and SlotDescriptors deliver 'is_utf8'
markers to TScalaTypes. They finally landed in ColumnType and TypeDesc
instances.

Given the correct UTF-8 mode checked, we just need to truncate/pad the
char/varchar strings with their length counted by UTF-8 characters.

Since char(N) slots always occupy 4N bytes, when converting char(N) to
other string types, we need to re-calculate the actual length
corresponding to N UTF-8 characters. We can optimize this in later
patches, e.g. store the UTF-8 length in the slot, or deal with UTF-8
char(N) by the same way as varchar(N), i.e. reallocate the string space
and just store the pointer and length in the slot.

Tests:
 - Add tests for reading char(N) and varchar(N) columns in UTF8_MODE.
 - Add truncating/padding tests
 - Kudu only supports Varchar currently. Add special tests for Kudu.
 - Add tests for writing CHAR(N)/VARCHAR(N) in UTF-8 mode.

Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
---
M be/src/codegen/codegen-anyval.cc
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/llvm-codegen.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner-test.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-table-sink.cc
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-plain-test.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.inline.h
M be/src/exprs/agg-fn-evaluator.cc
M be/src/exprs/anyval-util.cc
M be/src/exprs/anyval-util.h
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/scalar-expr-evaluator.cc
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/slot-ref.cc
M be/src/runtime/raw-value-ir.cc
M be/src/runtime/raw-value.cc
M be/src/runtime/raw-value.inline.h
M be/src/runtime/tuple.cc
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/fe-support.cc
M be/src/service/hs2-util.cc
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/dict-encoding.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M be/src/util/tuple-row-compare.cc
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/kudu_create.test
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test
A