Aditya Shah created HIVE-22561:
----------------------------------

             Summary: Data loss on map join for bucketed, partitioned table
                 Key: HIVE-22561
                 URL: https://issues.apache.org/jira/browse/HIVE-22561
             Project: Hive
          Issue Type: Bug
    Affects Versions: 3.1.2
            Reporter: Aditya Shah
         Attachments: Screenshot 2019-11-28 at 8.45.17 PM.png, 
image-2019-11-28-20-46-25-432.png

A map join on a column (which is neither involved in bucketing and partition) 
causes data loss. 

Steps to reproduce:

Env: [hive-dev-box|[https://github.com/kgyrtkirk/hive-dev-box]] hive 3.1.2.

Create tables:

 
{code:java}
CREATE TABLE `testj2`(
  `id` int, 
  `bn` string, 
  `cn` string, 
  `ad` map<string,int>, 
  `mi` array<int>)
PARTITIONED BY ( 
  `br` string)
CLUSTERED BY ( 
  bn) 
INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES (
  'bucketing_version'='2');

CREATE TABLE `testj1`(
  `id` int, 
  `can` string, 
  `cn` string, 
  `ad` map<string,int>, 
  `av` boolean, 
  `mi` array<int>)
PARTITIONED BY ( 
  `brand` string)
CLUSTERED BY ( 
  can) 
INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES (
  'bucketing_version'='2');
{code}
insert some data in both:
{code:java}
insert into testj1 values (100, 'mes_1', 'customer_1',  map('city1', 560077), 
false, array(5, 10), 'brand_1'),
(101, 'mes_2', 'customer_2',  map('city2', 560078), true, array(10, 20), 
'brand_2'),
(102, 'mes_3', 'customer_3',  map('city3', 560079), false, array(15, 30), 
'brand_3'),
(103, 'mes_4', 'customer_4',  map('city4', 560080), true, array(20, 40), 
'brand_4'),
(104, 'mes_5', 'customer_5',  map('city5', 560081), false, array(25, 50), 
'brand_5');

insert into table testj2 values (100, 'tv_0', 'customer_0', map('city0', 
560076),array(0, 0, 0), 'tv'),
(101, 'tv_1', 'customer_1', map('city1', 560077),array(20, 25, 30), 'tv'),
(102, 'tv_2', 'customer_2', map('city2', 560078),array(40, 50, 60), 'tv'),
(103, 'tv_3', 'customer_3', map('city3', 560079),array(60, 75, 90), 'tv'),
(104, 'tv_4', 'customer_4', map('city4', 560080),array(80, 100, 120), 'tv');
{code}
Do a join between them:
{code:java}
select t1.id, t1.can, t1.cn, t2.bn,t2.ad, t2.br FROM testj1 t1 JOIN testj2 t2 
on (t1.id = t2.id) order by t1.id;
{code}
Observed results:

!image-2019-11-28-20-46-25-432.png|width=524,height=100!

In the plan, I can see a map join. Disabling it gives the correct result.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to