Vineet Garg created HIVE-21481:
----------------------------------

             Summary: MERGE correctness issues with null safe equality
                 Key: HIVE-21481
                 URL: https://issues.apache.org/jira/browse/HIVE-21481
             Project: Hive
          Issue Type: Bug
          Components: Query Planning
            Reporter: Vineet Garg


The way Hive currently generates plan for MERGE statement can lead to wrong 
results with null safe equality.

To illustrate consider the following reproducer
{code:sql}
create table ttarget(s string, j int, flag string) stored as orc 
tblproperties("transactional"="true");
truncate table ttarget;
insert into ttarget values('not_null', 1, 'dont udpate'), (null,2, 'update');

create table tsource (i int);
insert into tsource values(null),(2);
{code}

Let's say you have the following MERGE statement
{code:sql}
explain merge into ttarget using tsource on i<=>j
 when matched THEN
        UPDATE set flag='updated'
 when not matched THEN
        INSERT VALUES('new', 1999, 'true');
{code}

With this MERGE {{*ONLY ONE*}} row should match in target which should be 
updated. But currently due to the plan hive generate it will end up matching 
both rows.

This is because MERGE statement is rewritten into RIGHT OUTER JOIN + FILTER 
corresponding to all branches.

The part of the plan generated by hive for this statement consist of:
{noformat}
Map 2
            Map Operator Tree:
                TableScan
                  alias: tsource
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
                  Map Join Operator
                    condition map:
                         Right Outer Join 0 to 1
                    keys:
                      0 j (type: int)
                      1 i (type: int)
                    nullSafes: [true]
                    outputColumnNames: _col0, _col1, _col5, _col6
                    input vertices:
                      0 Map 1
                    Statistics: Num rows: 1 Data size: 206 Basic stats: 
COMPLETE Column stats: NONE
                    HybridGraceHashJoin: true
                    Filter Operator
                      predicate: (_col6 IS NOT DISTINCT FROM _col1) (type: 
boolean)
                      Statistics: Num rows: 1 Data size: 206 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col5 (type: 
struct<writeid:bigint,bucketid:int,rowid:bigint>), _col0 (type: string), _col1 
(type: int)
                        outputColumnNames: _col0, _col1, _col2
                        Statistics: Num rows: 1 Data size: 206 Basic stats: 
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: _col0 (type: 
struct<writeid:bigint,bucketid:int,rowid:bigint>)
                          sort order: +
                          Map-reduce partition columns: UDFToInteger(_col0) 
(type: int)
                          Statistics: Num rows: 1 Data size: 206 Basic stats: 
COMPLETE Column stats: NONE
                          value expressions: _col1 (type: string), _col2 (type: 
int)
{noformat}

Result after JOIN will be :
{code:sql}
select s,j,i from ttarget right outer join tsource on i<=>j ;
NULL    NULL    NULL
NULL    NULL    2
{code}

On this resultset predicate {{(_col6 IS NOT DISTINCT FROM _col1)}} will be true 
for both resulting into both rows matching.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to