Joining Integer To String Can Produce Duplicates

Ryan C. Kleck Fri, 02 Feb 2018 14:00:07 -0800

Hey all,  I’m on Hive 1.2.2 at work and I found some unfavorable behavior on 
one of my joins and I wanted to see what you all think.  Below is an example:
https://github.secureserver.net/gist/rkleck/258a9a7b3dd3c915f94a53234e422a1a


WITH string_key_setup AS (
                SELECT                   CAST('1234  ' AS STRING)               
    AS my_key
                UNION ALL
                SELECT                   CAST('1234' AS STRING)                 
    AS my_key
)
, group_setup AS (
                SELECT                   my_key                                 
 AS my_key
                FROM                     string_key_setup
                GROUP BY             my_key
)
, string_key AS (
                SELECT                   CAST('1234' AS STRING)     AS my_key
)
, integer_key AS (
                SELECT                   CAST('1234' AS INT)                    
        AS my_key
)
SELECT                                   'String To String Join'                
          AS join_type
                                                                , COUNT(1)      
                      AS num_rows
FROM                                     string_key t1
JOIN                                        group_setup t2
ON                                                          t1.my_key = 
t2.my_key
UNION ALL
SELECT                                   'Integer To String Join'               
         AS join_type
                                                                , COUNT(1)      
                      AS num_rows
FROM                                     integer_key t1
JOIN                                        group_setup t2
ON                                                          t1.my_key = 
t2.my_key
;


This query returns the following:
join_type

num_rows

Integer To String Join

2

String To String Join

1



I feel it’s unfavorable because the GROUP BY is not TRIMming the extra space 
around the string, but when we do a join against an integer it does trim the 
space.  This query should not produce multiple rows.

In my opinion, there should be another check when comparing string to int to 
make sure the size of the string and integer are the same (so in this example 
the row with key ‘1234  ‘ will be filtered out).  Furthermore, the original 
query that produces these dupes has MANY more joins.  I could fix by 
CASTing/TRIMing, but it would require me to know all the data types for the 
columns in the tables involved in the join (and maybe casting a string to int 
will lose some rows and you can’t TRIM an INT).

Thoughts?



Ryan Kleck
Data Engineer IV
Advanced Analytics
480-505-8800 xt. 4024

This email message and any attachments hereto are intended for use only by its 
intended recipient(s) and may contain confidential information. If you have 
received this email in error, please immediately notify the sender and 
permanently delete the original and any copy of this message and its 
attachments.

Joining Integer To String Can Produce Duplicates

Reply via email to