Re: a question about lossing data during merge join using hive 3.1.3

八楼数哥 Wed, 12 Nov 2025 03:05:58 -0800

Thank you for your reply. The following content will show you how I reproduced 
this scene.





1.&nbsp; create Hive table
create external table test001
(
&nbsp; &nbsp; user_id string,
&nbsp; &nbsp; event string,
&nbsp; &nbsp; content string
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION 'hdfs://your/hdfs/path/to/test001';


create table test002
(
&nbsp; &nbsp; user_id string,
&nbsp; &nbsp; event string,
&nbsp; &nbsp; content string
)
stored as parquet;




2. using the Python script to generate data, and I execute it using python 
3.9.16, it works well. The data will be saved to a file called 
"user_contents.csv"
#！-*- coding: utf-8 -*-


import random
import string




if __name__ == '__main__':


&nbsp; &nbsp; ALL_USERS_COUNT=500000
&nbsp; &nbsp; TOTAL_EVENT_COUNT=30000000
&nbsp; &nbsp; PAGE_VIEWS_USER_COUNT=5000
&nbsp; &nbsp; PAGE_VIEW_EVENT_COUNT=18000
&nbsp; &nbsp; ELEMENT_VIEW_USER_COUNT=4900
&nbsp; &nbsp; ELEMENT_VIEW_EVENT_COUNT=20000


&nbsp; &nbsp; FIDS_COUNT=500
&nbsp; &nbsp; MAX_CONTENT_FIDS_PER_USER=5
&nbsp; &nbsp;




&nbsp; &nbsp; # generate user_ids with length 15 and all characters are digits
&nbsp; &nbsp; user_ids = set()
&nbsp; &nbsp; while len(user_ids) < ALL_USERS_COUNT:
&nbsp; &nbsp; &nbsp; &nbsp; user_id = ''.join(random.choices(string.digits, 
k=15))
&nbsp; &nbsp; &nbsp; &nbsp; user_ids.add(user_id)


&nbsp; &nbsp; # generate PAGE_VIEW &nbsp;user_ids
&nbsp; &nbsp; user_ids_list = list(user_ids)
&nbsp; &nbsp; page_view_user_ids = random.sample(user_ids_list, 
PAGE_VIEWS_USER_COUNT)
&nbsp; &nbsp; page_view_event_user_ids = 
page_view_user_ids.copy()+random.choices(page_view_user_ids, 
k=PAGE_VIEW_EVENT_COUNT - PAGE_VIEWS_USER_COUNT)
&nbsp; &nbsp; user_pages=[]
&nbsp; &nbsp; for user_id in page_view_event_user_ids:
&nbsp; &nbsp; &nbsp; &nbsp; page= 'PAGE' + 
''.join(random.choices(string.digits, k=2))
&nbsp; &nbsp; &nbsp; &nbsp; user_pages.append((user_id, 'PAGE_VIEW',page))




&nbsp; &nbsp; # generate fids with length 6 and all characters are digits
&nbsp; &nbsp; fids = set()
&nbsp; &nbsp; while len(fids) < FIDS_COUNT:
&nbsp; &nbsp; &nbsp; &nbsp; fid = ''.join(random.choices(string.digits, k=6))
&nbsp; &nbsp; &nbsp; &nbsp; fids.add(fid)


&nbsp; &nbsp; # get user_ids from page_view_user_ids and each has a content 
with random(less than MAX_CONTENT_FIDS_PER_USER) fids
&nbsp; &nbsp; element_view_user_ids = random.sample(page_view_user_ids, 
ELEMENT_VIEW_USER_COUNT)
&nbsp; &nbsp; element_view_event_user_ids = 
element_view_user_ids.copy()+random.choices(element_view_user_ids, 
k=ELEMENT_VIEW_EVENT_COUNT - ELEMENT_VIEW_USER_COUNT)
&nbsp; &nbsp; user_contents = []
&nbsp; &nbsp; for user_id in element_view_event_user_ids:
&nbsp; &nbsp; &nbsp; &nbsp; content_fids_count = random.randint(1, 
MAX_CONTENT_FIDS_PER_USER)
&nbsp; &nbsp; &nbsp; &nbsp; content_fids = random.sample(list(fids), 
content_fids_count)
&nbsp; &nbsp; &nbsp; &nbsp; user_contents.append((user_id, 'ELEMENT_VIEW', 
'|||'.join(content_fids)))
&nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; # output to file
&nbsp; &nbsp; with open('user_contents.csv', 'w') as f:
&nbsp; &nbsp; &nbsp; &nbsp; for user_id, event_type, fid in user_pages:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
f.write(f'{user_id},{event_type},{fid}\n')
&nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; for user_id, event_type, fids in user_contents:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
f.write(f'{user_id},{event_type},{fids}\n')
&nbsp; &nbsp; &nbsp; &nbsp;


&nbsp; &nbsp; &nbsp; &nbsp; # generate NO_EVENT user_ids
&nbsp; &nbsp; &nbsp; &nbsp; no_event_user_ids = random.choices(user_ids_list, 
k=TOTAL_EVENT_COUNT)
&nbsp; &nbsp; &nbsp; &nbsp; for user_id in no_event_user_ids:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
f.write(f'{user_id},NO_PAGE_VIEW,\n')


&nbsp; &nbsp; print("Data generation completed. Output written to 
user_contents.csv")




3. upload the file "user_contents.csv" to hdfs path: 
hdfs://your/hdfs/path/to/test001, so it's content could be read through table 
test001.




4. insert data from table test001 to test002
insert overwrite table msnsmp.test002
select
&nbsp; &nbsp; user_id,
&nbsp; &nbsp; event,
&nbsp; &nbsp; content
from msnsmp.test001
;




5. execute SQL below:
select
&nbsp; &nbsp; a.*,b.*
from (
&nbsp; &nbsp; select t1.user_id,
&nbsp; &nbsp; &nbsp; &nbsp; t1.page_exposure_count,
&nbsp; &nbsp; &nbsp; &nbsp; nvl(t2.feed_view_count,0) feed_view_count
&nbsp; &nbsp; from (select user_id,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; count(case when 
event='PAGE_VIEW' then user_id end) page_exposure_count
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; from msnsmp.test002
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; group by user_id
&nbsp; &nbsp; &nbsp; &nbsp; ) t1
&nbsp; &nbsp;
&nbsp; &nbsp; left join
&nbsp; &nbsp; &nbsp; &nbsp; (select user_id,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; count(fid) 
feed_view_count
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; from msnsmp.test002
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lateral view explode(split(content, 
'\\|\\|\\|')) y as fid
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; where event='ELEMENT_VIEW'
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; group by user_id
&nbsp; &nbsp; &nbsp; &nbsp; ) t2
&nbsp; &nbsp; &nbsp; &nbsp; on t1.user_id=t2.user_id
) a
join (
&nbsp; &nbsp; select t1.user_id,
&nbsp; &nbsp; &nbsp; &nbsp; t1.page_exposure_count,
&nbsp; &nbsp; &nbsp; &nbsp; nvl(t2.feed_view_count,0) feed_view_count
&nbsp; &nbsp; from (select user_id,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; count(case when 
event='PAGE_VIEW' then user_id end) page_exposure_count
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; from msnsmp.test002
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; group by user_id
&nbsp; &nbsp; &nbsp; &nbsp; ) t1
&nbsp; &nbsp;
&nbsp; &nbsp; left join
&nbsp; &nbsp; &nbsp; &nbsp; (select user_id,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; count(distinct fid) 
feed_view_count
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; from msnsmp.test002
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lateral view explode(split(content, 
'\\|\\|\\|')) y as fid
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; where event='ELEMENT_VIEW'
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; group by user_id
&nbsp; &nbsp; &nbsp; &nbsp; ) t2
&nbsp; &nbsp; &nbsp; &nbsp; on t1.user_id=t2.user_id
) b
on a.user_id=b.user_id
where a.feed_view_count&gt;0
and b.feed_view_count=0
limit 10;


And the result will be **LIKE** this: (Since the data is randomly generated, 
you might get different result)
+------------------+------------------------+--------------------+------------------+------------------------+--------------------+--+
| &nbsp; &nbsp;a.user_id &nbsp; &nbsp; | a.page_exposure_count &nbsp;| 
a.feed_view_count &nbsp;| &nbsp; &nbsp;b.user_id &nbsp; &nbsp; | 
b.page_exposure_count &nbsp;| b.feed_view_count &nbsp;|
+------------------+------------------------+--------------------+------------------+------------------------+--------------------+--+
| 000157341324733 &nbsp;| 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 7 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp;| 000157341324733 &nbsp;| 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 000758717439188 &nbsp;| 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 14 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; | 000758717439188 &nbsp;| 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 001202231651357 &nbsp;| 10 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; | 8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp;| 001202231651357 &nbsp;| 10 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 001905965263999 &nbsp;| 3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 11 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; | 001905965263999 &nbsp;| 3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 002994634536432 &nbsp;| 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 13 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; | 002994634536432 &nbsp;| 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 003036606454101 &nbsp;| 2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 7 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp;| 003036606454101 &nbsp;| 2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 003446282649967 &nbsp;| 3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 12 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; | 003446282649967 &nbsp;| 3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 003552152468449 &nbsp;| 2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp;| 003552152468449 &nbsp;| 2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 004048659394058 &nbsp;| 5 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 14 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; | 004048659394058 &nbsp;| 5 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
| 004498224166623 &nbsp;| 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;| 18 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; | 004498224166623 &nbsp;| 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp;|
+------------------+------------------------+--------------------+------------------+------------------------+--------------------+--+
The content "a.feed_view_count&gt;0 and b.feed_view_count=0" means the result 
is **NOT** right.





My Hive Environment:
Hive version:&nbsp; 3.1.3
Hive engine: Tez 0.10.1
Hadoop version: 3.3.4
Deployment: BigTop 3.2.0
Java version: jdk1.8.0_351


Some potentially relevant parameters：
set hive.execution.engine=tez;
set hive.convert.join.bucket.mapjoin.tez=false;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.container.size=2048;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.reduce.side=true;
set hive.auto.convert.sortmerge.join.to.mapjoin=true;




I hope the above information is helpful to you.


Best regards




         原始邮件
         
       
发件人：Sungwoo Park <[email protected]&gt;
发件时间：2025年11月12日 13:04
收件人：1046170148 <[email protected]&gt;
主题：Re: a question about lossing data during merge join using hive 3.1.3



Hi,


I am trying to reproduce your problem in Hive 3.1.3 and Hive 4.0.0.

I cannot execute some part of your query, and some part seems strange:

1. In 'insert overwrite', dt fields are not provided. I manually added as 
follows:

select &nbsp;'1234' user_id, 'PAGE_VIEW' event, 'P_PAGE_ONE' action, null 
content, '20250707' dt union all

2. In your script, action is either 'P_PAGE_...' or 'E_PAGE_...'. However, your 
query script uses 'PAGE_...'.


Could you check your script again? If you provide me with a working script, we 
can check if your problem is fixed in our version of Hive 3.1.3, Hive 4.0.0, 
and Hive 4.1.0.

Best regards,


--- Sungwoo

On Wed, Nov 12, 2025 at 12:24 PM Sungwoo Park <[email protected]&gt; wrote:
Hello,


We actively maintain Hive 3.1.3 in our product Hive on MR3. I don't know if 
Hive 3 on MR3 solves your problem, but we continue to backport patches of 
Apache Hive back to Hive 3.1.3 on MR3. If you explain how to reproduce the 
problem, we can also try to fix your problem. (One of our team members is an 
Apache Hive committer.)

Hive on MR3 uses the execution engine MR3 instead of Tez, and runs much faster 
than Hive on Tez. It is also easier&nbsp;to use. If interested, please see the 
documentation and the introduction to our product:

https://mr3docs.datamonad.com/
https://datamonad.com/

If you want to try Hive on MR3, please email back to me, or talk to us at Slack:

https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg


(In the Slack channel, we have a few Chinese members, including Lisoda.)


Best regards,


--- Sungwoo

Re: a question about lossing data during merge join using hive 3.1.3

Reply via email to