[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8036/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 09:08:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-21 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 7:

Added more tests and more fixes for the revealed problems. The implementation 
become more complex now. Still need more test coverage and there may be other 
issues.

I'm concerning whether it's worth the complexity to support the query hints... 
Maybe we can provide the query option first, and implement the hints when users 
are asking for it. I'll update the first two patches (utf8-func, 
utf8-char-varchar) to not consider this patch. So their implementation can be 
simpler hopefully.


--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 08:52:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16910/7/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16910/7/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1275
PS7, Line 1275: if (result.getType().isStringType() && !((ScalarType) 
result.getType()).isSetUtf8Mode()) {
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/16910/7/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/16910/7/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@498
PS7, Line 498: if (colType != null && colType.isFixedLengthType() && 
!((ScalarType) colType).isUtf8Mode()) {
line too long (97 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 08:47:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-21 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16910

to look at the new patch set (#7).

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..

[WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

This patch introduces a pair of query hints, UTF8_MODE_ON and
UTF8_MODE_OFF, to turn on/off the UTF-8 aware behavior. The query hint
should be put in front of the select list. It will affect the query
block and all subquery blocks until there is another hint in the
subquery. See more examples in the test file.

Implementation:
Each Analyzer instance is corresponding to a query block. We introduces
a flag, isUtf8Mode_, in Analyzer. When analyzing a SelectStmt, we change
the order to analyze the hints of the select list first. Then each
Analyzer will get the correct utf-8 hint. When detecting whether the
current query block is in UTF-8 mode, check the flag first. If it's not
set, inherits the ancessor Analyzer's state.

There are some gotchas in the current code base:
 - Type instances are shared across FE, e.g. metadata, slot descriptors,
   exprs, etc. Changing the utf8 marker of a Type instance should make
   sure it's not shared in any other places. Otherwise, we could
   accidentially change the utf8 mode of other parts.
 - In planning phase, some exprs are substituted and re-analyzed. The
   utf8 markers can lost due to using an analyzer of other scope.
 - The arg type of a ScalarFnCall expr is assumed to be the identical
   with the return type of the child expr. This is not true after this
   patch, since they could be in different utf8 mode.
The first problem can be resolved by cloning Type instances whenever we
do an assignment. To limit the scope of the code changes, we just do
this for creating slot descriptors. For exprs, we add the utf8 marker to
it as well and mark it there (instead of marking utf8 mode for each
child types and return type). This simplify the work since we Exprs
instances won't be shared with the metadata codes.

For the second problem, we change the utf8 marker from boolean to
Boolean, and initialize it as null. Precondition checks are added if the
utf8 marker is flipped. In the planning phase, analyzers will be marked
as in planning (by a new state field). Expr#analyze will ignore the utf8
marker of the analyzer in this state, which helps to keep its original
utf8 marker (if has).

For the third problem, we just need to take care of FunctionContextImpl
creation and the related codegen code paths. When generating arg types,
use the utf8 marker of the ScalarFnCall expr instead.

Tests:
 - Add tests for using query hints with string functions.
 - TODO: Add more tests for reading from tables.

Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
---
M be/src/exprs/anyval-util.h
M be/src/exprs/expr.h
M be/src/exprs/scalar-expr-evaluator.cc
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/slot-ref.cc
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/AggregateInfoBase.java
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/analysis/SelectList.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java
M fe/src/main/java/org/apache/impala/analysis/TypeDef.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/ArrayType.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Db.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/IcebergStructField.java
M fe/src/main/java/org/apache/impala/catalog/MapType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/StructField.java
M fe/src/main/java/org/apache/impala/catalog/StructType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
A testdata/workloads/functional-query/queries/QueryTest/utf8-hints.test
M tests/query_test/test_utf8_strings.py
34 files changed, 361 insertions(+), 97 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/10/16910/7
--
To view, visit http://gerrit.cl

[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-13 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16910/5/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16910/5/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2815
PS5, Line 2815:   public void setIsUtf8Mode(boolean isUtf8Mode) throws 
AnalysisException{
> I mentioned this to you out-of-band but thought I should mention that I thi
Yeah, I found some issues when adding more complex tests. I plan to make two 
refactors:
1) Change the isUtf8 marker from boolean to Boolean object. So it can have null 
as the initial state, which helps us detect whether the marker is reset.
2) In FE, move the isUtf8 marker from ScalarType to Expr. There are lots of 
Type instances shared in the metadata. We need explicitly cloning the Type 
instances in many places, which is error prone. Actually, CastExpr and 
FunctionCallExpr are the only places where the UTF8 semantic take place. Save 
the isUtf8 marker in them makes more sense. Still need to examine the 
substition codes to see whether we'll lose these markers.



--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 09:17:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16910/5/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16910/5/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2815
PS5, Line 2815:   public void setIsUtf8Mode(boolean isUtf8Mode) throws 
AnalysisException{
I mentioned this to you out-of-band but thought I should mention that I think 
this will get lost on some expressions when they are substituted and 
re-analyzed in a different query block or with the root analyzer during 
planning.



--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 08 Jan 2021 17:45:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..


Patch Set 2:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/7936/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 04 Jan 2021 09:21:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

2021-01-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/16910 )

Change subject: [WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior
..

[WIP] IMPALA-10410: Query hints for turn on/off UTF-8 behavior

This patch introduces a pair of query hints, UTF8_MODE_ON and
UTF8_MODE_OFF, to turn on/off the UTF-8 aware behavior. The query hint
should be put in front of the select list. It will affect the query
block and all subquery blocks until there is another hint in the
subquery. See more examples in the test file.

Implementation:
Each Analyzer instance is corresponding to a query block. We introduces
a flag, isUtf8Mode_, in Analyzer. When analyzing a SelectStmt, we change
the order to analyze the hints of the select list first. Then each
Analyzer will get the correct utf-8 hint. When detecting whether the
current query block is in UTF-8 mode, check the flag first. If it's not
set, inherits the ancessor Analyzer's state.

Tests:
 - Add tests for using query hints with string functions.
 - TODO: Add more tests for reading from tables.
 - TODO: Add tests for that consequent tuple descriptors both have
   char(N) slots but one is utf8 and the other is not.

Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
---
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SelectList.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
A testdata/workloads/functional-query/queries/QueryTest/utf8-hints.test
M tests/query_test/test_utf8_strings.py
5 files changed, 83 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/10/16910/2
--
To view, visit http://gerrit.cloudera.org:8080/16910
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7fa20b62b5cb06169048b0785b70e85a9f21bf07
Gerrit-Change-Number: 16910
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang