Thank you for your reply. The following content will show you how I reproduced
this scene.
1. create Hive table
create external table test001
(
user_id string,
event string,
content string
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION 'hdfs://your/hdfs/path/to/test001';
create table test002
(
user_id string,
event string,
content string
)
stored as parquet;
2. using the Python script to generate data, and I execute it using python
3.9.16, it works well. The data will be saved to a file called
"user_contents.csv"
#!-*- coding: utf-8 -*-
import random
import string
if __name__ == '__main__':
ALL_USERS_COUNT=500000
TOTAL_EVENT_COUNT=30000000
PAGE_VIEWS_USER_COUNT=5000
PAGE_VIEW_EVENT_COUNT=18000
ELEMENT_VIEW_USER_COUNT=4900
ELEMENT_VIEW_EVENT_COUNT=20000
FIDS_COUNT=500
MAX_CONTENT_FIDS_PER_USER=5
# generate user_ids with length 15 and all characters are digits
user_ids = set()
while len(user_ids) < ALL_USERS_COUNT:
user_id = ''.join(random.choices(string.digits,
k=15))
user_ids.add(user_id)
# generate PAGE_VIEW user_ids
user_ids_list = list(user_ids)
page_view_user_ids = random.sample(user_ids_list,
PAGE_VIEWS_USER_COUNT)
page_view_event_user_ids =
page_view_user_ids.copy()+random.choices(page_view_user_ids,
k=PAGE_VIEW_EVENT_COUNT - PAGE_VIEWS_USER_COUNT)
user_pages=[]
for user_id in page_view_event_user_ids:
page= 'PAGE' +
''.join(random.choices(string.digits, k=2))
user_pages.append((user_id, 'PAGE_VIEW',page))
# generate fids with length 6 and all characters are digits
fids = set()
while len(fids) < FIDS_COUNT:
fid = ''.join(random.choices(string.digits, k=6))
fids.add(fid)
# get user_ids from page_view_user_ids and each has a content
with random(less than MAX_CONTENT_FIDS_PER_USER) fids
element_view_user_ids = random.sample(page_view_user_ids,
ELEMENT_VIEW_USER_COUNT)
element_view_event_user_ids =
element_view_user_ids.copy()+random.choices(element_view_user_ids,
k=ELEMENT_VIEW_EVENT_COUNT - ELEMENT_VIEW_USER_COUNT)
user_contents = []
for user_id in element_view_event_user_ids:
content_fids_count = random.randint(1,
MAX_CONTENT_FIDS_PER_USER)
content_fids = random.sample(list(fids),
content_fids_count)
user_contents.append((user_id, 'ELEMENT_VIEW',
'|||'.join(content_fids)))
# output to file
with open('user_contents.csv', 'w') as f:
for user_id, event_type, fid in user_pages:
f.write(f'{user_id},{event_type},{fid}\n')
for user_id, event_type, fids in user_contents:
f.write(f'{user_id},{event_type},{fids}\n')
# generate NO_EVENT user_ids
no_event_user_ids = random.choices(user_ids_list,
k=TOTAL_EVENT_COUNT)
for user_id in no_event_user_ids:
f.write(f'{user_id},NO_PAGE_VIEW,\n')
print("Data generation completed. Output written to
user_contents.csv")
3. upload the file "user_contents.csv" to hdfs path:
hdfs://your/hdfs/path/to/test001, so it's content could be read through table
test001.
4. insert data from table test001 to test002
insert overwrite table msnsmp.test002
select
user_id,
event,
content
from msnsmp.test001
;
5. execute SQL below:
select
a.*,b.*
from (
select t1.user_id,
t1.page_exposure_count,
nvl(t2.feed_view_count,0) feed_view_count
from (select user_id,
count(case when
event='PAGE_VIEW' then user_id end) page_exposure_count
from msnsmp.test002
group by user_id
) t1
left join
(select user_id,
count(fid)
feed_view_count
from msnsmp.test002
lateral view explode(split(content,
'\\|\\|\\|')) y as fid
where event='ELEMENT_VIEW'
group by user_id
) t2
on t1.user_id=t2.user_id
) a
join (
select t1.user_id,
t1.page_exposure_count,
nvl(t2.feed_view_count,0) feed_view_count
from (select user_id,
count(case when
event='PAGE_VIEW' then user_id end) page_exposure_count
from msnsmp.test002
group by user_id
) t1
left join
(select user_id,
count(distinct fid)
feed_view_count
from msnsmp.test002
lateral view explode(split(content,
'\\|\\|\\|')) y as fid
where event='ELEMENT_VIEW'
group by user_id
) t2
on t1.user_id=t2.user_id
) b
on a.user_id=b.user_id
where a.feed_view_count>0
and b.feed_view_count=0
limit 10;
And the result will be **LIKE** this: (Since the data is randomly generated,
you might get different result)
+------------------+------------------------+--------------------+------------------+------------------------+--------------------+--+
| a.user_id | a.page_exposure_count |
a.feed_view_count | b.user_id |
b.page_exposure_count | b.feed_view_count |
+------------------+------------------------+--------------------+------------------+------------------------+--------------------+--+
| 000157341324733 | 4
| 7
| 000157341324733 | 4
| 0
|
| 000758717439188 | 4
| 14
| 000758717439188 | 4
| 0
|
| 001202231651357 | 10
| 8
| 001202231651357 | 10
| 0
|
| 001905965263999 | 3
| 11
| 001905965263999 | 3
| 0
|
| 002994634536432 | 4
| 13
| 002994634536432 | 4
| 0
|
| 003036606454101 | 2
| 7
| 003036606454101 | 2
| 0
|
| 003446282649967 | 3
| 12
| 003446282649967 | 3
| 0
|
| 003552152468449 | 2
| 8
| 003552152468449 | 2
| 0
|
| 004048659394058 | 5
| 14
| 004048659394058 | 5
| 0
|
| 004498224166623 | 4
| 18
| 004498224166623 | 4
| 0
|
+------------------+------------------------+--------------------+------------------+------------------------+--------------------+--+
The content "a.feed_view_count>0 and b.feed_view_count=0" means the result
is **NOT** right.
My Hive Environment:
Hive version: 3.1.3
Hive engine: Tez 0.10.1
Hadoop version: 3.3.4
Deployment: BigTop 3.2.0
Java version: jdk1.8.0_351
Some potentially relevant parameters:
set hive.execution.engine=tez;
set hive.convert.join.bucket.mapjoin.tez=false;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.container.size=2048;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.reduce.side=true;
set hive.auto.convert.sortmerge.join.to.mapjoin=true;
I hope the above information is helpful to you.
Best regards
原始邮件
发件人:Sungwoo Park <[email protected]>
发件时间:2025年11月12日 13:04
收件人:1046170148 <[email protected]>
主题:Re: a question about lossing data during merge join using hive 3.1.3
Hi,
I am trying to reproduce your problem in Hive 3.1.3 and Hive 4.0.0.
I cannot execute some part of your query, and some part seems strange:
1. In 'insert overwrite', dt fields are not provided. I manually added as
follows:
select '1234' user_id, 'PAGE_VIEW' event, 'P_PAGE_ONE' action, null
content, '20250707' dt union all
2. In your script, action is either 'P_PAGE_...' or 'E_PAGE_...'. However, your
query script uses 'PAGE_...'.
Could you check your script again? If you provide me with a working script, we
can check if your problem is fixed in our version of Hive 3.1.3, Hive 4.0.0,
and Hive 4.1.0.
Best regards,
--- Sungwoo
On Wed, Nov 12, 2025 at 12:24 PM Sungwoo Park <[email protected]> wrote:
Hello,
We actively maintain Hive 3.1.3 in our product Hive on MR3. I don't know if
Hive 3 on MR3 solves your problem, but we continue to backport patches of
Apache Hive back to Hive 3.1.3 on MR3. If you explain how to reproduce the
problem, we can also try to fix your problem. (One of our team members is an
Apache Hive committer.)
Hive on MR3 uses the execution engine MR3 instead of Tez, and runs much faster
than Hive on Tez. It is also easier to use. If interested, please see the
documentation and the introduction to our product:
https://mr3docs.datamonad.com/
https://datamonad.com/
If you want to try Hive on MR3, please email back to me, or talk to us at Slack:
https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg
(In the Slack channel, we have a few Chinese members, including Lisoda.)
Best regards,
--- Sungwoo