[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly
[ https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366480#comment-16366480 ] Vihang Karajgaonkar edited comment on HIVE-18622 at 2/16/18 1:17 AM: - Thanks [~mmccline] for fixing this. I left minor comments which need to be addressed else this patch would regress fix for HIVE-18421. Would appreciate if you could take a look at them. We basically need to handle the overflows before returning in case the inputVector.isRepeating is true. Also, should the checked version of {{LongColModuloLongColumnChecked.java}} should also be modified similar to {{LongColModuloLongColumn.java}}? was (Author: vihangk1): Thanks [~mmccline] for fixing this. I left minor comments which need to be addressed else this patch would regress fix for HIVE-18421. Would appreciate if you could take a look at them. We basically need to handle the overflows before returning in case the inputVector.isRepeating is true. > Vectorization: IF Statements, Comparisons, and more do not handle NULLs > correctly > - > > Key: HIVE-18622 > URL: https://issues.apache.org/jira/browse/HIVE-18622 > Project: Hive > Issue Type: Bug > Components: Hive >Reporter: Matt McCline >Assignee: Matt McCline >Priority: Critical > Fix For: 3.0.0 > > Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, > HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, > HIVE-18622.08.patch, HIVE-18622.09.patch, HIVE-18622.091.patch, > HIVE-18622.092.patch, HIVE-18622.093.patch, HIVE-18622.094.patch, > HIVE-18622.095.patch, HIVE-18622.096.patch, HIVE-18622.097.patch, > HIVE-18622.098.patch, HIVE-18622.099.patch, HIVE-18622.0991.patch > > > > Many vector expression classes are setting noNulls to true which does not > work if the VRB is a scratch column being reused. The previous use may have > set noNulls to false and the isNull array will have some rows marked as NULL. > The result is wrong query results and sometimes NPEs (for BytesColumnVector). > So, many vector expressions need this: > {code:java} > // Carefully handle NULLs... > /* >* For better performance on LONG/DOUBLE we don't want the conditional >* statements inside the for loop. >*/ > outputColVector.noNulls = false; > {code} > And, vector expressions need to make sure the isNull array entry is set when > outputColVector.noNulls is false. > And, all place that assign column value need to set noNulls to false when the > value is NULL. > Almost all cases where noNulls is set to true are incorrect. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly
[ https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366480#comment-16366480 ] Vihang Karajgaonkar edited comment on HIVE-18622 at 2/16/18 12:56 AM: -- Thanks [~mmccline] for fixing this. I left minor comments which need to be addressed else this patch would regress fix for HIVE-18421. Would appreciate if you could take a look at them. We basically need to handle the overflows before returning in case the inputVector.isRepeating is true. was (Author: vihangk1): Thanks [~mmccline] for fixing this. I left minor comments which would regress fix for HIVE-18421 unless fixed. Would appreciate if you could take a look at them. We basically need to handle the overflows before returning in case the inputVector.isRepeating is true. > Vectorization: IF Statements, Comparisons, and more do not handle NULLs > correctly > - > > Key: HIVE-18622 > URL: https://issues.apache.org/jira/browse/HIVE-18622 > Project: Hive > Issue Type: Bug > Components: Hive >Reporter: Matt McCline >Assignee: Matt McCline >Priority: Critical > Fix For: 3.0.0 > > Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, > HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, > HIVE-18622.08.patch, HIVE-18622.09.patch, HIVE-18622.091.patch, > HIVE-18622.092.patch, HIVE-18622.093.patch, HIVE-18622.094.patch, > HIVE-18622.095.patch, HIVE-18622.096.patch, HIVE-18622.097.patch, > HIVE-18622.098.patch, HIVE-18622.099.patch, HIVE-18622.0991.patch > > > > Many vector expression classes are setting noNulls to true which does not > work if the VRB is a scratch column being reused. The previous use may have > set noNulls to false and the isNull array will have some rows marked as NULL. > The result is wrong query results and sometimes NPEs (for BytesColumnVector). > So, many vector expressions need this: > {code:java} > // Carefully handle NULLs... > /* >* For better performance on LONG/DOUBLE we don't want the conditional >* statements inside the for loop. >*/ > outputColVector.noNulls = false; > {code} > And, vector expressions need to make sure the isNull array entry is set when > outputColVector.noNulls is false. > And, all place that assign column value need to set noNulls to false when the > value is NULL. > Almost all cases where noNulls is set to true are incorrect. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly
[ https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365222#comment-16365222 ] Teddy Choi edited comment on HIVE-18622 at 2/15/18 8:11 AM: I tested HIVE-18622.098. Two files may need to be changed: vector_decimal_math_funcs.q.out and vectorization_part_project.q.out. I attached the changes I found. Please check it. Thanks. {noformat} diff --git ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out index b1c25c4180..526aa16445 100644 --- ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out +++ ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out @@ -326,7 +326,7 @@ from decimal_test) as q POSTHOOK: type: QUERY POSTHOOK: Input: default@decimal_test A masked pattern was here --198849865 +-1989051768985 PREHOOK: query: CREATE TABLE decimal_test_small STORED AS ORC AS SELECT cbigint, cdouble, CAST (((cdouble*22.1)/37) AS DECIMAL(12,4)) AS cdecimal1, CAST (((cdouble*9.3)/13) AS DECIMAL(14,8)) AS cdecimal2 FROM alltypesorc PREHOOK: type: CREATETABLE_AS_SELECT PREHOOK: Input: default@alltypesorc diff --git ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out index 130e1376ce..e46c7f4524 100644 --- ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out +++ ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out @@ -70,15 +70,15 @@ STAGE PLANS: Map Operator Tree: TableScan alias: alltypesorc_part - Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: PARTIAL + Statistics: Num rows: 200 Data size: 1592 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: (cdouble + 2.0) (type: double) outputColumnNames: _col0 -Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: PARTIAL +Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: double) sort order: + - Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: PARTIAL + Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: COMPLETE TopN Hash Memory Usage: 0.1 Execution mode: vectorized, llap LLAP IO: all inputs @@ -103,13 +103,13 @@ STAGE PLANS: Select Operator expressions: KEY.reducesinkkey0 (type: double) outputColumnNames: _col0 -Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: PARTIAL +Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: COMPLETE Limit Number of rows: 10 - Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: PARTIAL + Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false -Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: PARTIAL +Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat {noformat} was (Author: teddy.choi): I tested HIVE-18622.098. Two files may need to be changed: vector_decimal_math_funcs.q.out and vectorization_part_project.q.out. I attached the changes I found. Please check it. Thanks. {{diff --git ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}} {{index b1c25c4180..526aa16445 100644}} {{--- ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}} {{+++ ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}} {{@@ -326,7 +326,7 @@ from decimal_test) as q}} {{ POSTHOOK: type: QUERY}} {{ POSTHOOK: Input: default@decimal_test}} {{ A masked pattern was here }} {{--198849865}} {{+-1989051768985}} {{ PREHOOK: query: CREATE TABLE decimal_test_small STORED AS ORC AS SELECT cbigint, cdouble, CAST (((cdouble*22.1)/37) AS DECIMAL(12,4)) AS cdecimal1, CAST (((cdouble*9.3)/13) AS DECIM
[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly
[ https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365222#comment-16365222 ] Teddy Choi edited comment on HIVE-18622 at 2/15/18 8:11 AM: I tested HIVE-18622.098. Two files may need to be changed: vector_decimal_math_funcs.q.out and vectorization_part_project.q.out. I attached the changes I found. Please check it. Thanks. {{diff --git ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}} {{index b1c25c4180..526aa16445 100644}} {{--- ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}} {{+++ ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}} {{@@ -326,7 +326,7 @@ from decimal_test) as q}} {{ POSTHOOK: type: QUERY}} {{ POSTHOOK: Input: default@decimal_test}} {{ A masked pattern was here }} {{--198849865}} {{+-1989051768985}} {{ PREHOOK: query: CREATE TABLE decimal_test_small STORED AS ORC AS SELECT cbigint, cdouble, CAST (((cdouble*22.1)/37) AS DECIMAL(12,4)) AS cdecimal1, CAST (((cdouble*9.3)/13) AS DECIMAL(14,8)) AS cdecimal2 FROM alltypesorc}} {{ PREHOOK: type: CREATETABLE_AS_SELECT}} {{ PREHOOK: Input: default@alltypesorc}} {{diff --git ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out}} {{index 130e1376ce..e46c7f4524 100644}} {{--- ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out}} {{+++ ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out}} {{@@ -70,15 +70,15 @@ STAGE PLANS:}} {{ Map Operator Tree:}} {{ TableScan}} {{ alias: alltypesorc_part}} {{- Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: PARTIAL}} {{+ Statistics: Num rows: 200 Data size: 1592 Basic stats: COMPLETE Column stats: COMPLETE}} {{ Select Operator}} {{ expressions: (cdouble + 2.0) (type: double)}} {{ outputColumnNames: _col0}} {{- Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: PARTIAL}} {{+ Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: COMPLETE}} {{ Reduce Output Operator}} {{ key expressions: _col0 (type: double)}} {{ sort order: +}} {{- Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: PARTIAL}} {{+ Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: COMPLETE}} {{ TopN Hash Memory Usage: 0.1}} {{ Execution mode: vectorized, llap}} {{ LLAP IO: all inputs}} {{@@ -103,13 +103,13 @@ STAGE PLANS:}} {{ Select Operator}} {{ expressions: KEY.reducesinkkey0 (type: double)}} {{ outputColumnNames: _col0}} {{- Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: PARTIAL}} {{+ Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column stats: COMPLETE}} {{ Limit}} {{ Number of rows: 10}} {{- Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: PARTIAL}} {{+ Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: COMPLETE}} {{ File Output Operator}} {{ compressed: false}} {{- Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: PARTIAL}} {{+ Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: COMPLETE}} {{ table:}} {{ input format: org.apache.hadoop.mapred.SequenceFileInputFormat}} {{ output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}} was (Author: teddy.choi): I tested HIVE-18622.098. Two files may need to be changed: vector_decimal_math_funcs.q.out and vectorization_part_project.q.out. I attached the changes I found. Please check it. Thanks. > Vectorization: IF Statements, Comparisons, and more do not handle NULLs > correctly > - > > Key: HIVE-18622 > URL: https://issues.apache.org/jira/browse/HIVE-18622 > Project: Hive > Issue Type: Bug > Components: Hive >Reporter: Matt McCline >Assignee: Matt McCline >Priority: Critical > Fix For: 3.0.0 > > Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, > HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, > HIVE-18622.08.patch, HIVE-18622.09.patch, HIVE-18622.091.patch, > HIVE-18622.092.patch, HIVE-18622.093.patch, HIVE-18622.094.patch, > HIVE-18622.095.patch, HIVE-18622.096.patch, HIVE-18622.097.patch, > HIVE-18622.098.patch > > > > Many vector expression classes are setting noNulls to true which does not > work if the VRB is a scratch column being reused. The previous use may have > set noNulls to false and the isNull array will have some rows marked as NULL. > The result is wrong query results and sometimes NPEs (for BytesColumnV
[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly
[ https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363446#comment-16363446 ] Sergey Shelukhin edited comment on HIVE-18622 at 2/14/18 3:45 AM: -- I looked at most of the iteration 7 on RB (pages 10-13 remaining to go thru), and 7-8 diff. There's one correctness issue that I found, with 0/i mixup. Some removals of setting isNull are not clear to me. My main concern is that either I'm missing the big picture, or there is no unified semantic approach to noNulls, and to the pattern of setting and unsetting it. The idea of noNulls is to avoid looking at isNull, so it's not clear why certain places in the code that fill the meaningful parts of the batch with non-nulls still don't set noNulls (I commented on one or two, there are many). Seems like noNulls should be set every time there are no nulls, and isNull array should be set correctly by whoever sets noNulls to false. So, an example uniform approach could be: 0) The only parts of the batch that matter are those included in batch size. 1) set isNull to false every time we set a non-null value and noNulls is not true. 2) when flipping noNulls to false, make sure that isNull is correct; when flipping it as part of larger loop that always sets isNull unconditionally (e.g. in TreeReader::nextVector in ORC) no additional action necessary; when looping thru a bunch of non-nulls and finding a null it may be necessary to backfill the false-s in the preceding values and rely on (1) to fill the following values once noNulls is flipped; when setting individual elements without context it may be necessary to fill the array entirely (done only once when actually flipping noNulls). Right now it seems like some places are too conservative and fill isNull even when noNulls is true; and some actually remove isNull-setting when setting values to non-nulls, without checking for noNulls (so, presumably if noNulls is true isNull could be incorrect and it's not clear the next set that happens to be a null doesn't flip noNulls and renders the previous value invalid. Or at least the pattern in these approaches is not clear to me - could be because the patch is so large; perhaps it should be described somewhere in one place. was (Author: sershe): I looked at most of the iteration 7 on RB (pages 10-13 remaining to go thru), and 7-8 diff. There's one correctness issue that I found, with 0/i mixup. Some removals of setting isNull are not clear to me. My main concern is that either I'm missing the big picture, or there is no unified semantic approach to noNulls, and to the pattern of setting and unsetting it. The idea of noNulls is to avoid looking at isNull, so it's not clear why certain places in the code that fill the meaningful parts of the batch with non-nulls still don't set noNulls (I commented on one or two, there are many). Seems like noNulls should be set every time there are noNulls, and isNull array should be set correctly by whoever sets noNulls to false. So, the approach could be: 1) set isNull to false every time we set a non-null value and noNulls is not true. 2) when flipping noNulls to false, make sure that isNull is correct; when flipping it as part of larger loop that always sets isNull unconditionally (e.g. in TreeReader::nextVector in ORC) it's not necessary; when looping thru a bunch of non-nulls and finding a null it may be necessary to backfill the false-s in the preceding values and rely on (1) to fill the following values once noNulls is flipped; when setting individual elements without context it may be necessary to fill the array entirely (done only once when actually flipping noNulls). Right now it seems like some places are too conservative and fill isNull even when noNulls is true; and some actually remove isNull-setting when setting values to non-nulls, without checking for noNulls (so, presumably if noNulls is true isNull could be incorrect and it's not clear the next set that happens to be a null doesn't flip noNulls and renders the previous value invalid. Or at least the pattern in these approaches is not clear to me - could be because the patch is so large; perhaps it should be described somewhere in one place. > Vectorization: IF Statements, Comparisons, and more do not handle NULLs > correctly > - > > Key: HIVE-18622 > URL: https://issues.apache.org/jira/browse/HIVE-18622 > Project: Hive > Issue Type: Bug > Components: Hive >Reporter: Matt McCline >Assignee: Matt McCline >Priority: Critical > Fix For: 3.0.0 > > Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, > HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, > HIVE-18622.