[jira] [Commented] (HIVE-24902) Incorrect result after fold CASE into COALESCE
[ https://issues.apache.org/jira/browse/HIVE-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305966#comment-17305966 ] Nemon Lou commented on HIVE-24902: -- Thank you [~kgyrtkirk] FYI, Hive 2.3.7 can remove the whole case by ProjectReduceExpressionsRule among others. An example sql: {code:sql} create table b(a int); insert into b values (1),(2); select * from ( select case when b.a=1 then cast (from_unixtime(unix_timestamp(cast(20210309 as string),'MMdd') - 86400,'MMdd') as bigint) else 20210309 end as col from b ) t where t.col is not null; {code} I can not tell why. A direct fix is by validating booleans on both branches during COALESCE rewrite.I will submit a PR following this proposal. > Incorrect result after fold CASE into COALESCE > -- > > Key: HIVE-24902 > URL: https://issues.apache.org/jira/browse/HIVE-24902 > Project: Hive > Issue Type: Bug > Components: CBO >Affects Versions: 3.1.2, 4.0.0 >Reporter: Nemon Lou >Priority: Major > > The following sql returns only one record (20210308) but expected two(20210308 > 20210309). > {code:sql} > select * from ( > select > case when b.a=1 > then > cast (from_unixtime(unix_timestamp(cast(20210309 as > string),'MMdd') - 86400,'MMdd') as bigint) > else > 20210309 > end > as col > from > (select stack(2,1,2) as (a)) > as b > ) t > where t.col is not null; > {code} > The query plan has incorrect predict: > predicate: COALESCE((col0 = 1),false) (type: boolean) > {code:sql} > STAGE DEPENDENCIES: > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: _dummy_table > Row Limit Per Split: 1 > Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column > stats: COMPLETE > Select Operator > expressions: 2 (type: int), 1 (type: int), 2 (type: int) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > UDTF Operator > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > function name: stack > Filter Operator > predicate: COALESCE((col0 = 1),false) (type: boolean) > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > Select Operator > expressions: CASE WHEN ((col0 = 1)) THEN (20210308L) ELSE > (20210309L) END (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > ListSink > Time taken: 0.155 seconds, Fetched: 28 row(s) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24902) Incorrect result after fold CASE into COALESCE
[ https://issues.apache.org/jira/browse/HIVE-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304812#comment-17304812 ] Zoltan Haindrich commented on HIVE-24902: - [~nemon] seems like we were debugging around the same time :D I'll have to jump on something else anyway...so I'll leave it for you ; let me know if you need any help! > Incorrect result after fold CASE into COALESCE > -- > > Key: HIVE-24902 > URL: https://issues.apache.org/jira/browse/HIVE-24902 > Project: Hive > Issue Type: Bug > Components: CBO >Affects Versions: 3.1.2, 4.0.0 >Reporter: Nemon Lou >Priority: Major > > The following sql returns only one record (20210308) but expected two(20210308 > 20210309). > {code:sql} > select * from ( > select > case when b.a=1 > then > cast (from_unixtime(unix_timestamp(cast(20210309 as > string),'MMdd') - 86400,'MMdd') as bigint) > else > 20210309 > end > as col > from > (select stack(2,1,2) as (a)) > as b > ) t > where t.col is not null; > {code} > The query plan has incorrect predict: > predicate: COALESCE((col0 = 1),false) (type: boolean) > {code:sql} > STAGE DEPENDENCIES: > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: _dummy_table > Row Limit Per Split: 1 > Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column > stats: COMPLETE > Select Operator > expressions: 2 (type: int), 1 (type: int), 2 (type: int) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > UDTF Operator > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > function name: stack > Filter Operator > predicate: COALESCE((col0 = 1),false) (type: boolean) > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > Select Operator > expressions: CASE WHEN ((col0 = 1)) THEN (20210308L) ELSE > (20210309L) END (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > ListSink > Time taken: 0.155 seconds, Fetched: 28 row(s) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24902) Incorrect result after fold CASE into COALESCE
[ https://issues.apache.org/jira/browse/HIVE-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304810#comment-17304810 ] Zoltan Haindrich commented on HIVE-24902: - the bug is in https://github.com/apache/hive/blob/eed78dfdcb6dfc2de400397a60de12e6f62b96e2/ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java#L1080 because convertCASEIntoCOALESCEFuncCallExpr identifies all CASEs with booleans on both branches as valid candidates https://github.com/apache/hive/blob/eed78dfdcb6dfc2de400397a60de12e6f62b96e2/ql/src/java/org/apache/hadoop/hive/ql/parse/type/ExprNodeDescExprFactory.java#L811 actually the whole case could be removed - but it was not done by calcite because it was not able to deduce that {code} is not null ( cast (from_unixtime(unix_timestamp(cast(20210309 as string),'MMdd') - 86400,'MMdd') as bigint) ) {code} is essentially true (most likely because {{from_unixtime}} and {{unix_timestamp}} functions are opaque) > Incorrect result after fold CASE into COALESCE > -- > > Key: HIVE-24902 > URL: https://issues.apache.org/jira/browse/HIVE-24902 > Project: Hive > Issue Type: Bug > Components: CBO >Affects Versions: 3.1.2, 4.0.0 >Reporter: Nemon Lou >Priority: Major > > The following sql returns only one record (20210308) but expected two(20210308 > 20210309). > {code:sql} > select * from ( > select > case when b.a=1 > then > cast (from_unixtime(unix_timestamp(cast(20210309 as > string),'MMdd') - 86400,'MMdd') as bigint) > else > 20210309 > end > as col > from > (select stack(2,1,2) as (a)) > as b > ) t > where t.col is not null; > {code} > The query plan has incorrect predict: > predicate: COALESCE((col0 = 1),false) (type: boolean) > {code:sql} > STAGE DEPENDENCIES: > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: _dummy_table > Row Limit Per Split: 1 > Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column > stats: COMPLETE > Select Operator > expressions: 2 (type: int), 1 (type: int), 2 (type: int) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > UDTF Operator > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > function name: stack > Filter Operator > predicate: COALESCE((col0 = 1),false) (type: boolean) > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > Select Operator > expressions: CASE WHEN ((col0 = 1)) THEN (20210308L) ELSE > (20210309L) END (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > ListSink > Time taken: 0.155 seconds, Fetched: 28 row(s) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24902) Incorrect result after fold CASE into COALESCE
[ https://issues.apache.org/jira/browse/HIVE-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304780#comment-17304780 ] Nemon Lou commented on HIVE-24902: -- Here is the process how filter expression goes wrong: Pre optimize(good): {code:sql} IS NOT NULL(CASE(=($0, 1), CAST(FROM_UNIXTIME(-(UNIX_TIMESTAMP(CAST(_UTF-16LE'20210309'):VARCHAR(2147483647) CHARACTER SET "UTF-16LE" COLLATE "ISO-8859-1$en_US$primary", _UTF-16LE'MMdd'), CAST(86400):BIGINT), _UTF-16LE'MMdd')):BIGINT, 20210309)) {code} After pushes predicates into CASE(good): {code:sql} CASE(=($0, 1), IS NOT NULL(CAST(FROM_UNIXTIME(-(UNIX_TIMESTAMP(CAST(_UTF-16LE'20210309'):VARCHAR(2147483647) CHARACTER SET "UTF-16LE" COLLATE "ISO-8859-1$en_US$primary", _UTF-16LE'MMdd'), CAST(86400):BIGINT), _UTF-16LE'MMdd')):BIGINT), true) {code} After constants folding(good): {code:sql} CASE(=($0, 1), true, true) {code} After Rewrite CASE into COALESCE(bad): {code:sql} COALESCE(=($0, 1),false) {code} The related code of COALESCE rewrite: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java#L1079 > Incorrect result after fold CASE into COALESCE > -- > > Key: HIVE-24902 > URL: https://issues.apache.org/jira/browse/HIVE-24902 > Project: Hive > Issue Type: Bug > Components: CBO >Affects Versions: 3.1.2, 4.0.0 >Reporter: Nemon Lou >Priority: Major > > The following sql returns only one record (20210308) but expected two(20210308 > 20210309). > {code:sql} > select * from ( > select > case when b.a=1 > then > cast (from_unixtime(unix_timestamp(cast(20210309 as > string),'MMdd') - 86400,'MMdd') as bigint) > else > 20210309 > end > as col > from > (select stack(2,1,2) as (a)) > as b > ) t > where t.col is not null; > {code} > The query plan has incorrect predict: > predicate: COALESCE((col0 = 1),false) (type: boolean) > {code:sql} > STAGE DEPENDENCIES: > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: _dummy_table > Row Limit Per Split: 1 > Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column > stats: COMPLETE > Select Operator > expressions: 2 (type: int), 1 (type: int), 2 (type: int) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > UDTF Operator > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > function name: stack > Filter Operator > predicate: COALESCE((col0 = 1),false) (type: boolean) > Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE > Column stats: COMPLETE > Select Operator > expressions: CASE WHEN ((col0 = 1)) THEN (20210308L) ELSE > (20210309L) END (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > ListSink > Time taken: 0.155 seconds, Fetched: 28 row(s) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)