[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly

2018-02-15 Thread Vihang Karajgaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366480#comment-16366480
 ] 

Vihang Karajgaonkar edited comment on HIVE-18622 at 2/16/18 1:17 AM:
-

Thanks [~mmccline] for fixing this. I left minor comments which need to be 
addressed else this patch would regress fix for HIVE-18421. Would appreciate if 
you could take a look at them. We basically need to handle the overflows before 
returning in case the inputVector.isRepeating is true. Also, should the checked 
version of {{LongColModuloLongColumnChecked.java}} should also be modified 
similar to {{LongColModuloLongColumn.java}}?


was (Author: vihangk1):
Thanks [~mmccline] for fixing this. I left minor comments which need to be 
addressed else this patch would regress fix for HIVE-18421. Would appreciate if 
you could take a look at them. We basically need to handle the overflows before 
returning in case the inputVector.isRepeating is true.

> Vectorization: IF Statements, Comparisons, and more do not handle NULLs 
> correctly
> -
>
> Key: HIVE-18622
> URL: https://issues.apache.org/jira/browse/HIVE-18622
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, 
> HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, 
> HIVE-18622.08.patch, HIVE-18622.09.patch, HIVE-18622.091.patch, 
> HIVE-18622.092.patch, HIVE-18622.093.patch, HIVE-18622.094.patch, 
> HIVE-18622.095.patch, HIVE-18622.096.patch, HIVE-18622.097.patch, 
> HIVE-18622.098.patch, HIVE-18622.099.patch, HIVE-18622.0991.patch
>
>
>  
>  Many vector expression classes are setting noNulls to true which does not 
> work if the VRB is a scratch column being reused. The previous use may have 
> set noNulls to false and the isNull array will have some rows marked as NULL. 
> The result is wrong query results and sometimes NPEs (for BytesColumnVector).
> So, many vector expressions need this:
> {code:java}
>   // Carefully handle NULLs...
>   /*
>* For better performance on LONG/DOUBLE we don't want the conditional
>* statements inside the for loop.
>*/
>   outputColVector.noNulls = false;
>  {code}
> And, vector expressions need to make sure the isNull array entry is set when 
> outputColVector.noNulls is false.
> And, all place that assign column value need to set noNulls to false when the 
> value is NULL.
> Almost all cases where noNulls is set to true are incorrect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly

2018-02-15 Thread Vihang Karajgaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366480#comment-16366480
 ] 

Vihang Karajgaonkar edited comment on HIVE-18622 at 2/16/18 12:56 AM:
--

Thanks [~mmccline] for fixing this. I left minor comments which need to be 
addressed else this patch would regress fix for HIVE-18421. Would appreciate if 
you could take a look at them. We basically need to handle the overflows before 
returning in case the inputVector.isRepeating is true.


was (Author: vihangk1):
Thanks [~mmccline] for fixing this. I left minor comments which would regress 
fix for HIVE-18421 unless fixed. Would appreciate if you could take a look at 
them. We basically need to handle the overflows before returning in case the 
inputVector.isRepeating is true.

> Vectorization: IF Statements, Comparisons, and more do not handle NULLs 
> correctly
> -
>
> Key: HIVE-18622
> URL: https://issues.apache.org/jira/browse/HIVE-18622
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, 
> HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, 
> HIVE-18622.08.patch, HIVE-18622.09.patch, HIVE-18622.091.patch, 
> HIVE-18622.092.patch, HIVE-18622.093.patch, HIVE-18622.094.patch, 
> HIVE-18622.095.patch, HIVE-18622.096.patch, HIVE-18622.097.patch, 
> HIVE-18622.098.patch, HIVE-18622.099.patch, HIVE-18622.0991.patch
>
>
>  
>  Many vector expression classes are setting noNulls to true which does not 
> work if the VRB is a scratch column being reused. The previous use may have 
> set noNulls to false and the isNull array will have some rows marked as NULL. 
> The result is wrong query results and sometimes NPEs (for BytesColumnVector).
> So, many vector expressions need this:
> {code:java}
>   // Carefully handle NULLs...
>   /*
>* For better performance on LONG/DOUBLE we don't want the conditional
>* statements inside the for loop.
>*/
>   outputColVector.noNulls = false;
>  {code}
> And, vector expressions need to make sure the isNull array entry is set when 
> outputColVector.noNulls is false.
> And, all place that assign column value need to set noNulls to false when the 
> value is NULL.
> Almost all cases where noNulls is set to true are incorrect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly

2018-02-15 Thread Teddy Choi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365222#comment-16365222
 ] 

Teddy Choi edited comment on HIVE-18622 at 2/15/18 8:11 AM:


I tested HIVE-18622.098. Two files may need to be changed: 
vector_decimal_math_funcs.q.out and vectorization_part_project.q.out. I 
attached the changes I found. Please check it. Thanks.
{noformat}
diff --git 
ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out 
ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out
index b1c25c4180..526aa16445 100644
--- ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out
+++ ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out
@@ -326,7 +326,7 @@ from decimal_test) as q
 POSTHOOK: type: QUERY
 POSTHOOK: Input: default@decimal_test
  A masked pattern was here 
--198849865
+-1989051768985
 PREHOOK: query: CREATE TABLE decimal_test_small STORED AS ORC AS SELECT 
cbigint, cdouble, CAST (((cdouble*22.1)/37) AS DECIMAL(12,4)) AS cdecimal1, 
CAST (((cdouble*9.3)/13) AS DECIMAL(14,8)) AS cdecimal2 FROM alltypesorc
 PREHOOK: type: CREATETABLE_AS_SELECT
 PREHOOK: Input: default@alltypesorc
diff --git 
ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out 
ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out
index 130e1376ce..e46c7f4524 100644
--- ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out
+++ ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out
@@ -70,15 +70,15 @@ STAGE PLANS:
 Map Operator Tree:
 TableScan
   alias: alltypesorc_part
-  Statistics: Num rows: 200 Data size: 1600 Basic stats: 
COMPLETE Column stats: PARTIAL
+  Statistics: Num rows: 200 Data size: 1592 Basic stats: 
COMPLETE Column stats: COMPLETE
   Select Operator
 expressions: (cdouble + 2.0) (type: double)
 outputColumnNames: _col0
-Statistics: Num rows: 200 Data size: 1600 Basic stats: 
COMPLETE Column stats: PARTIAL
+Statistics: Num rows: 200 Data size: 1600 Basic stats: 
COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col0 (type: double)
   sort order: +
-  Statistics: Num rows: 200 Data size: 1600 Basic stats: 
COMPLETE Column stats: PARTIAL
+  Statistics: Num rows: 200 Data size: 1600 Basic stats: 
COMPLETE Column stats: COMPLETE
   TopN Hash Memory Usage: 0.1
 Execution mode: vectorized, llap
 LLAP IO: all inputs
@@ -103,13 +103,13 @@ STAGE PLANS:
   Select Operator
 expressions: KEY.reducesinkkey0 (type: double)
 outputColumnNames: _col0
-Statistics: Num rows: 200 Data size: 1600 Basic stats: 
COMPLETE Column stats: PARTIAL
+Statistics: Num rows: 200 Data size: 1600 Basic stats: 
COMPLETE Column stats: COMPLETE
 Limit
   Number of rows: 10
-  Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE 
Column stats: PARTIAL
+  Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE 
Column stats: COMPLETE
   File Output Operator
 compressed: false
-Statistics: Num rows: 10 Data size: 80 Basic stats: 
COMPLETE Column stats: PARTIAL
+Statistics: Num rows: 10 Data size: 80 Basic stats: 
COMPLETE Column stats: COMPLETE
 table:
 input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
 output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
{noformat}
 


was (Author: teddy.choi):
I tested HIVE-18622.098. Two files may need to be changed: 
vector_decimal_math_funcs.q.out and vectorization_part_project.q.out. I 
attached the changes I found. Please check it. Thanks.

{{diff --git 
ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out 
ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}}
{{index b1c25c4180..526aa16445 100644}}
{{--- ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}}
{{+++ ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}}
{{@@ -326,7 +326,7 @@ from decimal_test) as q}}
{{ POSTHOOK: type: QUERY}}
{{ POSTHOOK: Input: default@decimal_test}}
{{  A masked pattern was here }}
{{--198849865}}
{{+-1989051768985}}
{{ PREHOOK: query: CREATE TABLE decimal_test_small STORED AS ORC AS SELECT 
cbigint, cdouble, CAST (((cdouble*22.1)/37) AS DECIMAL(12,4)) AS cdecimal1, 
CAST (((cdouble*9.3)/13) AS DECIM

[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly

2018-02-15 Thread Teddy Choi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365222#comment-16365222
 ] 

Teddy Choi edited comment on HIVE-18622 at 2/15/18 8:11 AM:


I tested HIVE-18622.098. Two files may need to be changed: 
vector_decimal_math_funcs.q.out and vectorization_part_project.q.out. I 
attached the changes I found. Please check it. Thanks.

{{diff --git 
ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out 
ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}}
{{index b1c25c4180..526aa16445 100644}}
{{--- ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}}
{{+++ ql/src/test/results/clientpositive/llap/vector_decimal_math_funcs.q.out}}
{{@@ -326,7 +326,7 @@ from decimal_test) as q}}
{{ POSTHOOK: type: QUERY}}
{{ POSTHOOK: Input: default@decimal_test}}
{{  A masked pattern was here }}
{{--198849865}}
{{+-1989051768985}}
{{ PREHOOK: query: CREATE TABLE decimal_test_small STORED AS ORC AS SELECT 
cbigint, cdouble, CAST (((cdouble*22.1)/37) AS DECIMAL(12,4)) AS cdecimal1, 
CAST (((cdouble*9.3)/13) AS DECIMAL(14,8)) AS cdecimal2 FROM alltypesorc}}
{{ PREHOOK: type: CREATETABLE_AS_SELECT}}
{{ PREHOOK: Input: default@alltypesorc}}
{{diff --git 
ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out 
ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out}}
{{index 130e1376ce..e46c7f4524 100644}}
{{--- ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out}}
{{+++ ql/src/test/results/clientpositive/llap/vectorization_part_project.q.out}}
{{@@ -70,15 +70,15 @@ STAGE PLANS:}}
{{ Map Operator Tree:}}
{{ TableScan}}
{{ alias: alltypesorc_part}}
{{- Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column 
stats: PARTIAL}}
{{+ Statistics: Num rows: 200 Data size: 1592 Basic stats: COMPLETE Column 
stats: COMPLETE}}
{{ Select Operator}}
{{ expressions: (cdouble + 2.0) (type: double)}}
{{ outputColumnNames: _col0}}
{{- Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column 
stats: PARTIAL}}
{{+ Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column 
stats: COMPLETE}}
{{ Reduce Output Operator}}
{{ key expressions: _col0 (type: double)}}
{{ sort order: +}}
{{- Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column 
stats: PARTIAL}}
{{+ Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column 
stats: COMPLETE}}
{{ TopN Hash Memory Usage: 0.1}}
{{ Execution mode: vectorized, llap}}
{{ LLAP IO: all inputs}}
{{@@ -103,13 +103,13 @@ STAGE PLANS:}}
{{ Select Operator}}
{{ expressions: KEY.reducesinkkey0 (type: double)}}
{{ outputColumnNames: _col0}}
{{- Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column 
stats: PARTIAL}}
{{+ Statistics: Num rows: 200 Data size: 1600 Basic stats: COMPLETE Column 
stats: COMPLETE}}
{{ Limit}}
{{ Number of rows: 10}}
{{- Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: 
PARTIAL}}
{{+ Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: 
COMPLETE}}
{{ File Output Operator}}
{{ compressed: false}}
{{- Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: 
PARTIAL}}
{{+ Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: 
COMPLETE}}
{{ table:}}
{{ input format: org.apache.hadoop.mapred.SequenceFileInputFormat}}
{{ output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}}


was (Author: teddy.choi):
I tested HIVE-18622.098. Two files may need to be changed: 
vector_decimal_math_funcs.q.out and vectorization_part_project.q.out. I 
attached the changes I found. Please check it. Thanks.

> Vectorization: IF Statements, Comparisons, and more do not handle NULLs 
> correctly
> -
>
> Key: HIVE-18622
> URL: https://issues.apache.org/jira/browse/HIVE-18622
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, 
> HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, 
> HIVE-18622.08.patch, HIVE-18622.09.patch, HIVE-18622.091.patch, 
> HIVE-18622.092.patch, HIVE-18622.093.patch, HIVE-18622.094.patch, 
> HIVE-18622.095.patch, HIVE-18622.096.patch, HIVE-18622.097.patch, 
> HIVE-18622.098.patch
>
>
>  
>  Many vector expression classes are setting noNulls to true which does not 
> work if the VRB is a scratch column being reused. The previous use may have 
> set noNulls to false and the isNull array will have some rows marked as NULL. 
> The result is wrong query results and sometimes NPEs (for BytesColumnV

[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly

2018-02-13 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363446#comment-16363446
 ] 

Sergey Shelukhin edited comment on HIVE-18622 at 2/14/18 3:45 AM:
--

I looked at most of the iteration 7 on RB (pages 10-13 remaining to go thru), 
and 7-8 diff.
There's one correctness issue that I found, with 0/i mixup.
Some removals of setting isNull are not clear to me.
My main concern is that either I'm missing the big picture, or there is no 
unified semantic approach to noNulls, and to the pattern of setting and 
unsetting it.
The idea of noNulls is to avoid looking at isNull, so it's not clear why 
certain places in the code that fill the meaningful parts of the batch with 
non-nulls still don't set noNulls (I commented on one or two, there are many). 
Seems like noNulls should be set every time there are no nulls, and isNull 
array should be set correctly by whoever sets noNulls to false.
So, an example uniform approach could be:
0) The only parts of the batch that matter are those included in batch size.
1) set isNull to false every time we set a non-null value and noNulls is not 
true.
2) when flipping noNulls to false, make sure that isNull is correct; when 
flipping it as part of larger loop that always sets isNull unconditionally 
(e.g. in TreeReader::nextVector in ORC) no additional action necessary; when 
looping thru a bunch of non-nulls and finding a null it may be necessary to 
backfill the false-s in the preceding values and rely on (1) to fill the 
following values once noNulls is flipped; when setting individual elements 
without context it may be necessary to fill the array entirely (done only once 
when actually flipping noNulls).

Right now it seems like some places are too conservative and fill isNull even 
when noNulls is true; and some actually remove isNull-setting when setting 
values to non-nulls, without checking for noNulls (so, presumably if noNulls is 
true isNull could be incorrect and it's not clear the next set that happens to 
be a null doesn't flip noNulls and renders the previous value invalid.
Or at least the pattern in these approaches is not clear to me - could be 
because the patch is so large; perhaps it should be described somewhere in one 
place.




was (Author: sershe):
I looked at most of the iteration 7 on RB (pages 10-13 remaining to go thru), 
and 7-8 diff.
There's one correctness issue that I found, with 0/i mixup.
Some removals of setting isNull are not clear to me.
My main concern is that either I'm missing the big picture, or there is no 
unified semantic approach to noNulls, and to the pattern of setting and 
unsetting it.
The idea of noNulls is to avoid looking at isNull, so it's not clear why 
certain places in the code that fill the meaningful parts of the batch with 
non-nulls still don't set noNulls (I commented on one or two, there are many). 
Seems like noNulls should be set every time there are noNulls, and isNull array 
should be set correctly by whoever sets noNulls to false.
So, the approach could be:
1) set isNull to false every time we set a non-null value and noNulls is not 
true.
2) when flipping noNulls to false, make sure that isNull is correct; when 
flipping it as part of larger loop that always sets isNull unconditionally 
(e.g. in TreeReader::nextVector in ORC) it's not necessary; when looping thru a 
bunch of non-nulls and finding a null it may be necessary to backfill the 
false-s in the preceding values and rely on (1) to fill the following values 
once noNulls is flipped; when setting individual elements without context it 
may be necessary to fill the array entirely (done only once when actually 
flipping noNulls).

Right now it seems like some places are too conservative and fill isNull even 
when noNulls is true; and some actually remove isNull-setting when setting 
values to non-nulls, without checking for noNulls (so, presumably if noNulls is 
true isNull could be incorrect and it's not clear the next set that happens to 
be a null doesn't flip noNulls and renders the previous value invalid.
Or at least the pattern in these approaches is not clear to me - could be 
because the patch is so large; perhaps it should be described somewhere in one 
place.



> Vectorization: IF Statements, Comparisons, and more do not handle NULLs 
> correctly
> -
>
> Key: HIVE-18622
> URL: https://issues.apache.org/jira/browse/HIVE-18622
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, 
> HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, 
> HIVE-18622.