[Impala-ASF-CR] IMPALA-10147: Avoid getting a file handle for data cache hits

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16963 )

Change subject: IMPALA-10147: Avoid getting a file handle for data cache hits
..


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6859/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/16963
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
Gerrit-Change-Number: 16963
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 22 Jan 2021 06:43:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >=
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it is always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The elapsed time for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Reviewed-on: http://gerrit.cloudera.org:8080/16942
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,275 insertions(+), 355 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim 

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 15: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 22 Jan 2021 05:31:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16908 )

Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8043/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 22 Jan 2021 03:32:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

2021-01-21 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16908 )

Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
..


Patch Set 10:

(3 comments)

Thanks! Added more info in the commit message and comments.

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc@10542
PS9, Line 10542: TEST_P(ExprTest, Utf8Test) {
> Some of the characters below are > 2 bytes right? It would be helpful to ad
Done.


http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc@10596
PS9, Line 10596:   TestStringValue("utf8_reverse('')", "");
> Can we add tests for a couple of grapheme clusters where we're reversing th
Done


http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/string-functions-ir.cc@496
PS9, Line 496: // Returns a string with the UTF-8 characters (code points) in 
revrese order. Note that
> Comment that this reverses codepoints only, and that's consistent with othe
Done



--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 22 Jan 2021 03:11:09 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

2021-01-21 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16908

to look at the new patch set (#10).

Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
..

IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse 
functions

A unicode character can be encoded into 1-4 bytes in UTF-8. String
functions will return undesired results when the input contains unicode
characters, because we deal with a string as a byte array. For instance,
length() returns the length in bytes, not in unicode characters.

UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem.
This patch adds UTF-8 support in some string functions so they can have
UTF-8 aware behavior. For compatibility with the old versions, a new
query option, UTF8_MODE, is added for turning on/off the UTF-8 aware
behavior. Currently, only length(), substring() and reverse() support
it. Other function supports will be added in later patches.

String functions will check the query option and switch to use the
desired implementation. It's similar to how we use the decimal_v2 query
option in builtin functions.

For easy testing, the UTF-8 aware version of string functions are
also exposed as builtin functions (named by utf8_*, e.g. utf8_length).

Tests:
 - Add BE tests for utf8 functions.
 - Add e2e tests for the UTF8_MODE query option.

Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
---
M be/src/codegen/llvm-codegen.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/runtime/runtime-state.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/util/bit-util.h
M common/function-registry/impala_functions.py
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/datasets/functional/functional_schema_template.sql
A 
testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
A tests/query_test/test_utf8_strings.py
16 files changed, 361 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16908/10
--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 15: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:55:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 15:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6858/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:55:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1

2021-01-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
..


Patch Set 32:

(4 comments)

Few comments related to the hang

http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr-internal.h
File be/src/runtime/tmp-file-mgr-internal.h:

http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr-internal.h@206
PS32, Line 206:   TmpFileRemote(TmpFileGroup* file_group, TmpFileMgr::DeviceId 
device_id,
Can we add a destructor with a DCHECK that will ensure that the resources have 
been returned to TmpFileBufferPool correctly?

This might relate to the shared_ptr comment I made elsewhere - if we use 
shared_ptr consistently then we'll know that the destructor runs exactly when 
the object becomes unreachable.


http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.h@184
PS32, Line 184: /// A map to keep the TmpFile shared pointer alive by 
keeping a reference of the
Why do we need to do this again? Can't we enqueue the shared_ptr into the tmp 
file pool and make ownership explicit, instead of doing this shared_ptr/raw 
pointer dance. I'm finding it hard to understand the ownership of these TmpFile 
objects while looking to see if there's a resource leak.


http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc@576
PS32, Line 576:   if 
(tmp_dirs_remote_ctrl_.local_buff_dir_bytes_high_water_mark_.Add(file_size)
I think this code might be vulnerable to a race where a thread ends up blocking 
indefinitely in DequeueTmpFilesPool, e.g.

Assume bytes_limit is 2x and each file is x bytes. Start with x bytes allocated.

Thread 1: increment, win race, 2x allocated
Thread 2: increment, lose race, 3x allocated > bytes limit
Thread 3: decrement, 2x
Thread 4: increment, fails, 3x allocated
Thread 4: decrement, 2x
Thread 2: decrement, 1x

In that case thread 4 could block in DequeueTmpFilesPool even though there's 
free capacity.

I think the basic flaw is that one of the conditions for a thread blocking on 
tmp_files_available_cv_ is that it should have checked for buffer availabity 
before doing so, but it's not part of the condition variable loop condition.

Using the atomics is ok in other places because we just return an error when 
there's a race.

I think it'd be best to protect it behind a lock and also signal the condition 
variable when it's decremented from >= bytes_limit to < bytes_limit.


http://gerrit.cloudera.org:8080/#/c/16318/32/be/src/runtime/tmp-file-mgr.cc@1835
PS32, Line 1835: void TmpFileBufferPool::EnqueueTmpFilesPool(TmpFile* tmp_file, 
bool front) {
I'm suspicious about using the raw pointer here



--
To view, visit http://gerrit.cloudera.org:8080/16318
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89
Gerrit-Change-Number: 16318
Gerrit-PatchSet: 32
Gerrit-Owner: Yida Wu 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Yida Wu 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:51:19 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 14: Code-Review+2

BE changes look good to me


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:50:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10147: Avoid getting a file handle for data cache hits

2021-01-21 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16963 )

Change subject: IMPALA-10147: Avoid getting a file handle for data cache hits
..


Patch Set 2: Code-Review+2

Looks good! Thanks!


--
To view, visit http://gerrit.cloudera.org:8080/16963
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
Gerrit-Change-Number: 16963
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:49:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 53:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/8042/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 53
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jan 2021 22:29:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-21 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#53). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..

IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on 
equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the determination of whether a Parquet row group or a page
overlap with a range computed from an equi hash join. If not, then
the entire row group or page are skipped.  When a row survives this way,
it can be subjected to the row-level overlapping test against the same
overlap predicate.

For the following query, the min and max in the overlap predicate are
computed with the values from the join column from table 'b'. To
evaluate the overlap predicate, these two values are compared against
the min/max of each row group or page at the scan node for 'a'.

  select straight_join count(*)
  from lineitem a join [SHUFFLE] lineitem b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with the column type J (in hash table)
and scan column type S will be formed when one of the following is true:
   Both J and S are booleans
   Both J and S are integers (tinyint, smallint, int, or bigint)
   Both J and S are approximate numeric (float or double)
   Both J and S are decimals with the same precision and scale
   Both J and S are strings (STRING, CHAR or VARCHAR)
   Both J and S are date
   Both J and S are timestamp

The overlap predicate is implemented as a min/max filter. Unlike
existing min/max filters, MAX_NUM_RUNTIME_FILTERS query option does
not apply to min/max filters created for overlap predicates. An overlap
predicate will be evaluated as long as the overlap ratio is less than a
thresold specified in a new query option 'minmax_filter_threshold'.
Setting the threshold to its minimal value 0.0 disables the feature,
and setting it to the maximal value 1.0 applies the filtering in all
cases.

In addition, two new run-time profile counters are added to report the
number of row groups or pages filtered out via the overlap predicates
respectively:
  1. NumRuntimeFilteredRowGroups
  2. NumRuntimeFilteredPages

Testing:
1. Unit tested on various column types with TPCH and TPCDS tables.
   Benefits were significant when the join column on the outer table
   is sorted and there exist many row groups or pages no overlapping
   with the implementing min/max filters;
2. Added following new tests:
a. in min_max_filters.test to demonstrate the number of filtered
   out pages and row groups with the two new profile counters;
b. in runtime-filter-propagation.test to demonstrate that the
   overlap predicates work with different column types;
c. data type specific overlap method tests in
   min-max-filter-test.cc;
3. Core testing;
4. Performance measurement.

To do in follow-up JIRAs:
1. Improve filtering efficiency;
2. Apply the overlap predicate on partition columns;
3. IR code-gen for various MinMaxFilter::EvalOverlap methods.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/raw-value.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/string-value.cc
M be/src/runtime/string-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-ir.cc
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit-large.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit.test
A 

[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

2021-01-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16908 )

Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
..


Patch Set 9:

(3 comments)

LGTM, I wanted a few more tests and a comment but otherwise I'm ready to +2

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc@10542
PS9, Line 10542: TEST_P(ExprTest, Utf8Test) {
Some of the characters below are > 2 bytes right? It would be helpful to add a 
comment mentioning the # of bytes in the characters you used.


http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/expr-test.cc@10596
PS9, Line 10596:   TestStringValue("utf8_reverse('mañana')", "anañam");
Can we add tests for a couple of grapheme clusters where we're reversing the 
codepoints instead of the clusters. There are a couple of examples here - 
https://exploringjs.com/impatient-js/ch_unicode.html#grapheme-clusters-the-real-characters.

I tried them in impala-shell and it seems to have the expected behavior - 
https://drive.google.com/file/d/1iPXFAOtkRE5OPc014i6hARfig7W8xmpw/view?usp=sharing


http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16908/9/be/src/exprs/string-functions-ir.cc@496
PS9, Line 496: StringVal StringFunctions::Utf8Reverse(FunctionContext* context, 
const StringVal& str) {
Comment that this reverses codepoints only, and that's consistent with other 
systems.



--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 20:31:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns

2021-01-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16947 )

Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to 
support a list of columns
..

IMPALA-10435: Extend 'compute incremental stats' syntax
to support a list of columns

Modified parser to support compute incremental stats
columns.No need to modify the code of other modules
because it already supports

Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Reviewed-on: http://gerrit.cloudera.org:8080/16947
Tested-by: Impala Public Jenkins 
Reviewed-by: Tim Armstrong 
---
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test
5 files changed, 340 insertions(+), 5 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Tim Armstrong: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/16947
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Gerrit-Change-Number: 16947
Gerrit-PatchSet: 5
Gerrit-Owner: liuyao 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Reviewer: liuyao 


[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns

2021-01-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16947 )

Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to 
support a list of columns
..


Patch Set 4: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16947
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Gerrit-Change-Number: 16947
Gerrit-PatchSet: 4
Gerrit-Owner: liuyao 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Reviewer: liuyao 
Gerrit-Comment-Date: Thu, 21 Jan 2021 19:35:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
..


Patch Set 32:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8041/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16318
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89
Gerrit-Change-Number: 16318
Gerrit-PatchSet: 32
Gerrit-Owner: Yida Wu 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Yida Wu 
Gerrit-Comment-Date: Thu, 21 Jan 2021 18:08:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1

2021-01-21 Thread Yida Wu (Code Review)
Yida Wu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
..


Patch Set 30:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/16318/30//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16318/30//COMMIT_MSG@112
PS30, Line 112: * Ran pre-review-test
> Have you tested in combination with disk spill compression?
Added TmpFileMgrTest::TestCompressBufferManagement*Remote and 
TmpFileMgrTest::TestCompressBufferManagement*RemoteUpload test cases.


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/disk-io-mgr.cc@328
PS30, Line 328: 0
> I think we should probably set the replication to 1 for this use case - tha
Done


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.h
File be/src/runtime/io/local-file-system.h:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.h@42
PS30, Line 42: uint8
> nit: uint8_t
Done


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.cc
File be/src/runtime/io/local-file-system.cc:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-system.cc@98
PS30, Line 98: uint8
> nit: uint8_t
Done


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-writer.cc
File be/src/runtime/io/local-file-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/io/local-file-writer.cc@45
PS30, Line 45:   int written = write(fileno(file_), range->data(), 
range->len());
> I missed this, I think we should use LocalFileSystem::Fwrite which converts
Use write() instead of fwrite(), because the file handle is shared among 
multiple write ranges before the file is full and closed, during the period, if 
using fwrite(), it may not be safe to read the file because the buffer fwrite() 
is using may not flush to the disk. But write() is safe to read immediately 
because it writes without a buffer. If we don’t share the file handle for 
ranges of the same file, fopen() and fclose() for every write ranges to 
guarantee the safety of read, it would be slower than to share the file handle 
and use write() to do the sequential writes.
Added LocalFileSystem::Write() to handle errors for calling write() and some 
comments.


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.h@420
PS30, Line 420:   ///
> This is the main public API for TmpFileMgr writing. I notice that it doesn'
Added.


http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/16318/27/be/src/runtime/tmp-file-mgr.h@194
PS27, Line 194: std::unique_ptr tmp_file_pool_;
> Can you also put this in the comment for tmp_file_pool_?
Done


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc@517
PS30, Line 517:   if (HasRemoteDir()) {
> This doesn't seem right if write_range is just writing to a local scratch d
Done


http://gerrit.cloudera.org:8080/#/c/16318/30/be/src/runtime/tmp-file-mgr.cc@1820
PS30, Line 1820: DCHECK(status.ok());
> nit: use DCHECK_OK - we have a custom version that automatically prints the
Done



--
To view, visit http://gerrit.cloudera.org:8080/16318
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89
Gerrit-Change-Number: 16318
Gerrit-PatchSet: 30
Gerrit-Owner: Yida Wu 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Yida Wu 
Gerrit-Comment-Date: Thu, 21 Jan 2021 17:52:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9867: Add Support for Spilling to S3: Milestone 1

2021-01-21 Thread Yida Wu (Code Review)
Yida Wu has uploaded a new patch set (#32). ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
..

IMPALA-9867: Add Support for Spilling to S3: Milestone 1

Major Features
1) Local files as buffers for spilling to S3.
2) Async Upload for remote files.
3) Sync remote files deletion after query ends.
4) Local buffer files management.
5) Compatibility of spilling to local and remote.
6) All the errors from hdfs/s3 should terminate the query.

Changes on TmpFile:
* TmpFile is separated into two types of implementation, TmpFileLocal
  and TmpFileRemote.
  TmpFileLocal is used for Spilling to local file system.
  TmpFileRemote is a new type for Spilling to the remote. It contains
  two DiskFiles, one for local buffer, the other for the remote file.
* The DiskFile is an object that contains the information of a pysical
  file for passing to the DiskIOMgr to execute the IO operations on
  that specific file. The DiskFile also contains status information of
  the file,includes DiskFileStatus::INWRITING/PERSISTED/DELETED.
  When the DiskFile is initialized, it is in INWRITING status. If the
  file is persisted into the file system, it would become PERSISTED
  status. If the file is deleted, for example, the local buffer is
  evicted, so the DiskFile status of the buffer file would become
  deleted. After that, if the file is fetching from the remote, the
  DiskFile status of the buffer file would become INWRITING, and then
  PERSISTED if the fetching finishes successfully.

Implementation Details:
1) A new enum type is added to specify the disk type of files,
   indicating where the file physically locates.
   The types include DiskFileType::LOCAL/LOCAL_BUFFER/DFS/S3.
   DiskFileType::LOCAL indicates the file is in the local file system.
   DiskFileType::LOCAL_BUFFER indicates the file is in the local file
   system, and it is the buffer of a remote scratch file.
   DiskFileType::DFS/S3 indicates the file is in the HDFS/S3.
   The local buffer allows the buffer pool to pin(read), but mainly
   for remote files, buffer pool would pin(read) the page from the
   remote file system.
2) Two disk queues have been added to do the file operation jobs.
   Queue name: RemoteS3DiskFileOper/RemoteDfsDiskFileOper
   File operations on the remote disk like upload and fetch should
   be done in these queues. The purpose of the queues is to isolate
   the file operations from normal read/write IO operations in different
   queues. It could increase the efficiency of the file operations by
   not being interrupted during a relatively long execution time, and
   also provide a more accurate control on the thread number working on
   file operation jobs.
   RemoteOperRange is the new type to carry the file operation jobs.
   Previously,we have request types of READ and WRITE.
   Now FILE_FETCH/FILE_UPLOAD are added.
3) The tmp files are physically deleted when the tmp file group is
   deconstructing. For remote files, the entire directory would be
   deleted.
4) The local buffer files management is to control the total size
   of local buffer files and evict files if needed.
   A local buffer file can be evicted if the temporary file has uploaded
   a copy to the remote disk or the query ends.
   There are two modes to decide the sequence of choosing files to be
   evicted first. Default is LIFO, the other is FIFO. It can be
   controlled by startup option remote_tmp_files_avail_pool_lifo.
   Also, a thread TmpFileSpaceReserveThreadLoop in TmpFileMgr is
   running to allow to reserve buffer file space in an async way to
   avoid deadlocks.
   Startup option allow_spill_to_hdfs is added. By default the HDFS path
   is not allowed, but for testcases, the option can be set true to
   allow the use of HDFS path as scratch space for testing only.
5) Spilling to local has higher priority than spilling to remote.
   If no local scratch space is available, temporary data will be
   spilled to remote.
   The first available local directory is used for the local buffer
   for spilling to remote if any remote directory is configured.
   If remote directory is configured without any available local
   scratch space, an error will be returned during initialization.
   The purpose of the design is to simplify the implementation in
   milestone 1 with less changes on the configuration.

Example (setting remote scratch space):
Assume that the directories we have for scratch space:
* Local dir: /tmp/local_buffer, /tmp/local, /tmp/local_sec
* Remote dir: s3a://tmp/remote
The scratch space path is configured in the startup options, and could
have three types of configurations:
1. Pure local scratch space
  --scratch_dirs="/tmp/local"
2. Pure remote scratch space
  --scratch_dirs="s3a://tmp/remote,/tmp/local_buffer:16GB"
3. Mixed local and remote scratch space
  

[Impala-ASF-CR] IMPALA-10147: Avoid getting a file handle for data cache hits

2021-01-21 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16963 )

Change subject: IMPALA-10147: Avoid getting a file handle for data cache hits
..


Patch Set 2:

(2 comments)

Thank you for the review.
Patch set 2 move the handle variable and scope exit trigger closer around the 
file handle retrieval.

http://gerrit.cloudera.org:8080/#/c/16963/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/16963/1/be/src/runtime/io/hdfs-file-reader.cc@99
PS1, Line 99: Status status = Status::OK();
:   {
: ScopedTimer req_context_read_timer(
: scan_range_->reader_->read_timer_);
: bool logged_slow_read = false; // True if we already logged 
about a slow read.
:
: // Try reading from the remote data cache if it's enabled for 
the scan range.
: DataCache* remote_data_cache = io_mgr->remote_data_cache();
: bool try_data_cache = scan_range_->UseDataCache() && 
remote_data_cache != nullptr;
: i
> Nit: Since these are now initialized later in the function, I would also mo
Done


http://gerrit.cloudera.org:8080/#/c/16963/1/be/src/runtime/io/hdfs-file-reader.cc@126
PS1, Line 126: hdfsFile hdfs_file;
> For the clang-tidy issue you are seeing here:
Thank you. I didn't read thoroughly to realize that the logic later won't 
execute if all bytes successfully read from data cache.



--
To view, visit http://gerrit.cloudera.org:8080/16963
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
Gerrit-Change-Number: 16963
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jan 2021 17:13:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10147: Avoid getting a file handle for data cache hits

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16963 )

Change subject: IMPALA-10147: Avoid getting a file handle for data cache hits
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8040/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16963
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
Gerrit-Change-Number: 16963
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jan 2021 17:06:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10147: Avoid getting a file handle for data cache hits

2021-01-21 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16963 )

Change subject: IMPALA-10147: Avoid getting a file handle for data cache hits
..


Patch Set 2: Code-Review+1

Thanks for fixing this, LGTM!


--
To view, visit http://gerrit.cloudera.org:8080/16963
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
Gerrit-Change-Number: 16963
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jan 2021 16:56:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10147: Avoid getting a file handle for data cache hits

2021-01-21 Thread Riza Suminto (Code Review)
Hello Zoltan Borok-Nagy, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16963

to look at the new patch set (#2).

Change subject: IMPALA-10147: Avoid getting a file handle for data cache hits
..

IMPALA-10147: Avoid getting a file handle for data cache hits

When reading from the data cache, the disk IO thread first gets a file
handle, then it checks the data cache for a hit. The file handle is only
used if there is a data cache miss. It is not used when data cache hit
and in turns becomes an overhead. This patch move the file handle
retrieval later when data cache miss hapens.

Testing:
- Add custom cluster test test_no_fd_caching_on_cached_data.
- Pass core tests.

Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
---
M be/src/runtime/io/hdfs-file-reader.cc
M tests/custom_cluster/test_hdfs_fd_caching.py
2 files changed, 71 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/16963/2
--
To view, visit http://gerrit.cloudera.org:8080/16963
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
Gerrit-Change-Number: 16963
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 52:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8039/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 52
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jan 2021 16:41:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-21 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#52). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..

IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on 
equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the determination of whether a Parquet row group or a page
overlap with a range computed from an equi hash join. If not, then
the entire row group or page are skipped.  When a row survives this way,
it can be subjected to the row-level overlapping test against the same
overlap predicate.

For the following query, the min and max in the overlap predicate are
computed with the values from the join column from table 'b'. To
evaluate the overlap predicate, these two values are compared against
the min/max of each row group or page at the scan node for 'a'.

  select straight_join count(*)
  from lineitem a join [SHUFFLE] lineitem b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with the column type J (in hash table)
and scan column type S will be formed when one of the following is true:
   Both J and S are booleans
   Both J and S are integers (tinyint, smallint, int, or bigint)
   Both J and S are approximate numeric (float or double)
   Both J and S are decimals with the same precision and scale
   Both J and S are strings (STRING, CHAR or VARCHAR)
   Both J and S are date
   Both J and S are timestamp

The overlap predicate is implemented as a min/max filter. Unlike
existing min/max filters, MAX_NUM_RUNTIME_FILTERS query option does
not apply to min/max filters created for overlap predicates. An overlap
predicate will be evaluated as long as the overlap ratio is less than a
thresold specified in a new query option 'minmax_filter_threshold'.
Setting the threshold to its minimal value 0.0 disables the feature,
and setting it to the maximal value 1.0 applies the filtering in all
cases.

In addition, two new run-time profile counters are added to report the
number of row groups or pages filtered out via the overlap predicates
respectively:
  1. NumRuntimeFilteredRowGroups
  2. NumRuntimeFilteredPages

Testing:
1. Unit tested on various column types with TPCH and TPCDS tables.
   Benefits were significant when the join column on the outer table
   is sorted and there exist many row groups or pages no overlapping
   with the implementing min/max filters;
2. Added following new tests:
a. in min_max_filters.test to demonstrate the number of filtered
   out pages and row groups with the two new profile counters;
b. in runtime-filter-propagation.test to demonstrate that the
   overlap predicates work with different column types;
c. data type specific overlap method tests in
   min-max-filter-test.cc;
3. Core testing;
4. Performance measurement.

To do in follow-up JIRAs:
1. Improve filtering efficiency;
2. Apply the overlap predicate on partition columns;
3. IR code-gen for various MinMaxFilter::EvalOverlap methods.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/raw-value.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/string-value.cc
M be/src/runtime/string-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-ir.cc
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit-large.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit.test
A 

[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 51:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8038/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 51
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jan 2021 16:13:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-21 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 50:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16720/50//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16720/50//COMMIT_MSG@68
PS50, Line 68: partition columns
It will be especially useful for partitioned Iceberg tables because runtime 
bloom filters doesn't work with them.

But they'll need some special code, because partitioned Iceberg tables are 
treated as non-partitioned tables most of the time.


http://gerrit.cloudera.org:8080/#/c/16720/50/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16720/50/be/src/exec/parquet/hdfs-parquet-scanner.cc@659
PS50, Line 659: raio
nit: ratio


http://gerrit.cloudera.org:8080/#/c/16720/50/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16720/50/be/src/runtime/runtime-filter-ir.cc@33
PS50, Line 33: min_max_filter_.Load() == nullptr) return true;
 :   return min_max_filter_.Load()
Probably we should only invoke Load() once and store the pointer.


http://gerrit.cloudera.org:8080/#/c/16720/50/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16720/50/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@733
PS50, Line 733: referes
nit: refers


http://gerrit.cloudera.org:8080/#/c/16720/50/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@734
PS50, Line 734: as it will not be
  : // as effective as the conjunct.
How do we know the runtime efficiency here? Maybe we should still add it and 
let minmax threshold decide?
However, if the min/max conjunct comes from an EQ predicate, then yes, we 
should probably not add an overlap predicate in this case.



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 50
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jan 2021 15:55:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-21 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#51). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..

IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on 
equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the determination of whether a Parquet row group or a page
overlap with a range computed from an equi hash join. If not, then
the entire row group or page are skipped.  When a row survives this way,
it can be subjected to the row-level overlapping test against the same
overlap predicate.

For the following query, the min and max in the overlap predicate are
computed with the values from the join column from table 'b'. To
evaluate the overlap predicate, these two values are compared against
the min/max of each row group or page at the scan node for 'a'.

  select straight_join count(*)
  from lineitem a join [SHUFFLE] lineitem b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with the column type J (in hash table)
and scan column type S will be formed when one of the following is true:
   Both J and S are booleans
   Both J and S are integers (tinyint, smallint, int, or bigint)
   Both J and S are approximate numeric (float or double)
   Both J and S are decimals with the same precision and scale
   Both J and S are strings (STRING, CHAR or VARCHAR)
   Both J and S are date
   Both J and S are timestamp

The overlap predicate is implemented as a min/max filter. Unlike
existing min/max filters, MAX_NUM_RUNTIME_FILTERS query option does
not apply to min/max filters created for overlap predicates. An overlap
predicate will be evaluated as long as the overlap ratio is less than a
thresold specified in a new query option 'minmax_filter_threshold'.
Setting the threshold to its minimal value 0.0 disables the feature,
and setting it to the maximal value 1.0 applies the filtering in all
cases.

In addition, two new run-time profile counters are added to report the
number of row groups or pages filtered out via the overlap predicates
respectively:
  1. NumRuntimeFilteredRowGroups
  2. NumRuntimeFilteredPages

Testing:
1. Unit tested on various column types with TPCH and TPCDS tables.
   Benefits were significant when the join column on the outer table
   is sorted and there exist many row groups or pages no overlapping
   with the implementing min/max filters;
2. Added following new tests:
a. in min_max_filters.test to demonstrate the number of filtered
   out pages and row groups with the two new profile counters;
b. in runtime-filter-propagation.test to demonstrate that the
   overlap predicates work with different column types;
c. data type specific overlap method tests in
   min-max-filter-test.cc;
3. Core testing;
4. Performance measurement.

To do in follow-up JIRAs:
1. Improve filtering efficiency;
2. Apply the overlap predicate on partition columns;
3. IR code-gen for various MinMaxFilter::EvalOverlap methods.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/raw-value.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/string-value.cc
M be/src/runtime/string-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-ir.cc
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit-large.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit.test
A 

[Impala-ASF-CR] IMPALA-10440: Import Theta functionality from DataSketches

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16959 )

Change subject: IMPALA-10440: Import Theta functionality from DataSketches
..


Patch Set 4: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16959
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8485d6829f50b130c84ec8bef0a4b5895255ba6c
Gerrit-Change-Number: 16959
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 21 Jan 2021 14:48:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10440: Import Theta functionality from DataSketches

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16959 )

Change subject: IMPALA-10440: Import Theta functionality from DataSketches
..

IMPALA-10440: Import Theta functionality from DataSketches

This patch imports the functionality needed for Theta approximate
algorithm from Apache DataSketches.

First, I updated our existing snapshot of DataSketches to the
following commit:b2f749ed5ce6ba650f4259602b133c310c3a5ee4"
Merge pull request #182 from chufucun/include_type"
This affects files originated from hll/, kll/ and theta/ directories
of the DataSketches repo.

Then I copied all the files needed for Theta into our snapshot
directory.

Browse the source files here:
https://github.com/apache/datasketches-cpp/tree/b2f749ed5ce6ba650f4259602b133c310c3a5ee4

Change-Id: I8485d6829f50b130c84ec8bef0a4b5895255ba6c
Reviewed-on: http://gerrit.cloudera.org:8080/16959
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/datasketches-test.cc
M be/src/thirdparty/datasketches/HllUnion-internal.hpp
M be/src/thirdparty/datasketches/README.md
M be/src/thirdparty/datasketches/hll.hpp
M be/src/thirdparty/datasketches/kll_quantile_calculator_impl.hpp
M be/src/thirdparty/datasketches/kll_sketch.hpp
M be/src/thirdparty/datasketches/kll_sketch_impl.hpp
A be/src/thirdparty/datasketches/theta_a_not_b.hpp
A be/src/thirdparty/datasketches/theta_a_not_b_impl.hpp
A be/src/thirdparty/datasketches/theta_intersection.hpp
A be/src/thirdparty/datasketches/theta_intersection_impl.hpp
A be/src/thirdparty/datasketches/theta_sketch.hpp
A be/src/thirdparty/datasketches/theta_sketch_impl.hpp
A be/src/thirdparty/datasketches/theta_union.hpp
A be/src/thirdparty/datasketches/theta_union_impl.hpp
16 files changed, 2,153 insertions(+), 60 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/16959
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I8485d6829f50b130c84ec8bef0a4b5895255ba6c
Gerrit-Change-Number: 16959
Gerrit-PatchSet: 5
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16908 )

Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8037/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 10:45:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

2021-01-21 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16908

to look at the new patch set (#9).

Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
..

IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse 
functions

Add a query option, UTF8_MODE, for turning on/off the UTF-8 aware
behavior. Add UTF-8 aware versions of length(), substring() and
reverse(). Other functions will be added in later patches.

String functions will check the query option and switch to use the
desired implementation. It's similar to how we use the decimal_v2 query
option in builtin functions.

Tests:
 - Expose the UTF-8 aware version of string functions as builtin
   functions (named by utf8_*). Add BE tests for them.
 - Add e2e tests for the UTF8_MODE query option.

Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
---
M be/src/codegen/llvm-codegen.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/runtime/runtime-state.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/util/bit-util.h
M common/function-registry/impala_functions.py
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/datasets/functional/functional_schema_template.sql
A 
testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
A tests/query_test/test_utf8_strings.py
16 files changed, 335 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16908/9
--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns

2021-01-21 Thread liuyao (Code Review)
liuyao has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16947 )

Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to 
support a list of columns
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16947/3/testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test
File 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test:

http://gerrit.cloudera.org:8080/#/c/16947/3/testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test@830
PS3, Line 830: 
> I add some test cases.
Done


http://gerrit.cloudera.org:8080/#/c/16947/3/testdata/workloads/functional-query/queries/QueryTest/compute-stats-incremental.test@830
PS3, Line 830: 
> Can you also run compute incremental stats without specifying a partition t
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16947
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Gerrit-Change-Number: 16947
Gerrit-PatchSet: 4
Gerrit-Owner: liuyao 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Reviewer: liuyao 
Gerrit-Comment-Date: Thu, 21 Jan 2021 10:09:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16947 )

Change subject: IMPALA-10435: Extend 'compute incremental stats' syntax to 
support a list of columns
..


Patch Set 4: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16947
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Gerrit-Change-Number: 16947
Gerrit-PatchSet: 4
Gerrit-Owner: liuyao 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Reviewer: liuyao 
Gerrit-Comment-Date: Thu, 21 Jan 2021 09:28:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10440: Import Theta functionality from DataSketches

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16959 )

Change subject: IMPALA-10440: Import Theta functionality from DataSketches
..


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6857/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16959
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8485d6829f50b130c84ec8bef0a4b5895255ba6c
Gerrit-Change-Number: 16959
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 21 Jan 2021 09:16:07 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10440: Import Theta functionality from DataSketches

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16959 )

Change subject: IMPALA-10440: Import Theta functionality from DataSketches
..


Patch Set 4: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16959
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8485d6829f50b130c84ec8bef0a4b5895255ba6c
Gerrit-Change-Number: 16959
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 21 Jan 2021 09:16:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10440: Import Theta functionality from DataSketches

2021-01-21 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16959 )

Change subject: IMPALA-10440: Import Theta functionality from DataSketches
..


Patch Set 3: Code-Review+2

Thanks for the changes!


--
To view, visit http://gerrit.cloudera.org:8080/16959
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8485d6829f50b130c84ec8bef0a4b5895255ba6c
Gerrit-Change-Number: 16959
Gerrit-PatchSet: 3
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 21 Jan 2021 09:15:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8036/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 09:08:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10434: Fix impala-shell's unicode regressions on Python2

2021-01-21 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16960 )

Change subject: IMPALA-10434: Fix impala-shell's unicode regressions on Python2
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16960/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16960/4//COMMIT_MSG@7
PS4, Line 7: IMPALA-10434
> Oh, the JIRA id is IMPALA-10415 instead! So sorry for this mistake...
Thanks Fang-Yu's help. Contents of these two JIRAs are swapped.



--
To view, visit http://gerrit.cloudera.org:8080/16960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4
Gerrit-Change-Number: 16960
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Laszlo Gaal 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 08:55:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-21 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 7:

Added more tests and more fixes for the revealed problems. The implementation 
become more complex now. Still need more test coverage and there may be other 
issues.

I'm concerning whether it's worth the complexity to support the query hints... 
Maybe we can provide the query option first, and implement the hints when users 
are asking for it. I'll update the first two patches (utf8-func, 
utf8-char-varchar) to not consider this patch. So their implementation can be 
simpler hopefully.


--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 08:52:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16910/7/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16910/7/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1275
PS7, Line 1275: if (result.getType().isStringType() && !((ScalarType) 
result.getType()).isSetUtf8Mode()) {
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/16910/7/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/16910/7/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@498
PS7, Line 498: if (colType != null && colType.isFixedLengthType() && 
!((ScalarType) colType).isUtf8Mode()) {
line too long (97 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 08:47:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-21 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16910

to look at the new patch set (#7).

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..

[WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

This patch introduces a pair of query hints, UTF8_MODE_ON and
UTF8_MODE_OFF, to turn on/off the UTF-8 aware behavior. The query hint
should be put in front of the select list. It will affect the query
block and all subquery blocks until there is another hint in the
subquery. See more examples in the test file.

Implementation:
Each Analyzer instance is corresponding to a query block. We introduces
a flag, isUtf8Mode_, in Analyzer. When analyzing a SelectStmt, we change
the order to analyze the hints of the select list first. Then each
Analyzer will get the correct utf-8 hint. When detecting whether the
current query block is in UTF-8 mode, check the flag first. If it's not
set, inherits the ancessor Analyzer's state.

There are some gotchas in the current code base:
 - Type instances are shared across FE, e.g. metadata, slot descriptors,
   exprs, etc. Changing the utf8 marker of a Type instance should make
   sure it's not shared in any other places. Otherwise, we could
   accidentially change the utf8 mode of other parts.
 - In planning phase, some exprs are substituted and re-analyzed. The
   utf8 markers can lost due to using an analyzer of other scope.
 - The arg type of a ScalarFnCall expr is assumed to be the identical
   with the return type of the child expr. This is not true after this
   patch, since they could be in different utf8 mode.
The first problem can be resolved by cloning Type instances whenever we
do an assignment. To limit the scope of the code changes, we just do
this for creating slot descriptors. For exprs, we add the utf8 marker to
it as well and mark it there (instead of marking utf8 mode for each
child types and return type). This simplify the work since we Exprs
instances won't be shared with the metadata codes.

For the second problem, we change the utf8 marker from boolean to
Boolean, and initialize it as null. Precondition checks are added if the
utf8 marker is flipped. In the planning phase, analyzers will be marked
as in planning (by a new state field). Expr#analyze will ignore the utf8
marker of the analyzer in this state, which helps to keep its original
utf8 marker (if has).

For the third problem, we just need to take care of FunctionContextImpl
creation and the related codegen code paths. When generating arg types,
use the utf8 marker of the ScalarFnCall expr instead.

Tests:
 - Add tests for using query hints with string functions.
 - TODO: Add more tests for reading from tables.

Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
---
M be/src/exprs/anyval-util.h
M be/src/exprs/expr.h
M be/src/exprs/scalar-expr-evaluator.cc
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/slot-ref.cc
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/AggregateInfoBase.java
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/analysis/SelectList.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java
M fe/src/main/java/org/apache/impala/analysis/TypeDef.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/ArrayType.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Db.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/IcebergStructField.java
M fe/src/main/java/org/apache/impala/catalog/MapType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/StructField.java
M fe/src/main/java/org/apache/impala/catalog/StructType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
A testdata/workloads/functional-query/queries/QueryTest/utf8-hints.test
M tests/query_test/test_utf8_strings.py
34 files changed, 361 insertions(+), 97 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/10/16910/7
--
To view, visit