zhangshilong created YARN-7214:
----------------------------------
Summary: duplicated container completed To AM
Key: YARN-7214
URL: https://issues.apache.org/jira/browse/YARN-7214
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 3.0.0-alpha3, 2.7.1
Environment: hadoop 2.7.1 rm recovery and nm recovery enabled
Reporter: zhangshilong
env: hadoop 2.7.1 with rm recovery and nm recovery enabled
case:
spark app(app1) running least one container(named c1) in NM1.
1、NM1 crashed,and RM found NM1 expired in 10 minutes.
2、RM will remove all containers in NM1(RMNodeImpl). and app1 will receive c1
completed message.But RM can not send c1(to be removed) to NM1 because NM1 lost.
3、NM1 restart and register with RM(c1 in register request),but RM found NM1 is
lost and will not handle containers from NM1.
4、NM1 will not heartbeat with c1(c1 not in heartbeat request). So c1 will not
removed from context of NM1.
5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. RM
will send c1 complted message to AM of app1. So, app1 received duplicated c1.
once spark AM receive one container completed from RM, it will allocate one
new container.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]