zhangshilong created YARN-7214: ---------------------------------- Summary: duplicated container completed To AM Key: YARN-7214 URL: https://issues.apache.org/jira/browse/YARN-7214 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha3, 2.7.1 Environment: hadoop 2.7.1 rm recovery and nm recovery enabled Reporter: zhangshilong
env: hadoop 2.7.1 with rm recovery and nm recovery enabled case: spark app(app1) running least one container(named c1) in NM1. 1、NM1 crashed,and RM found NM1 expired in 10 minutes. 2、RM will remove all containers in NM1(RMNodeImpl). and app1 will receive c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 lost. 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 is lost and will not handle containers from NM1. 4、NM1 will not heartbeat with c1(c1 not in heartbeat request). So c1 will not removed from context of NM1. 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. RM will send c1 complted message to AM of app1. So, app1 received duplicated c1. once spark AM receive one container completed from RM, it will allocate one new container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org