zhangshilong created YARN-7214:
----------------------------------

             Summary: duplicated container completed To AM
                 Key: YARN-7214
                 URL: https://issues.apache.org/jira/browse/YARN-7214
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.0.0-alpha3, 2.7.1
         Environment: hadoop 2.7.1  rm recovery and nm recovery enabled
            Reporter: zhangshilong


env: hadoop 2.7.1  with rm recovery and nm recovery enabled
case:
 spark app(app1) running least one container(named c1) in NM1.
 1、NM1 crashed,and RM found NM1 expired in 10 minutes.
 2、RM will remove all containers in NM1(RMNodeImpl). and  app1 will receive c1 
completed message.But RM can not send c1(to be removed) to NM1 because NM1 lost.
 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 is 
lost and will not handle containers from NM1.
4、NM1 will not heartbeat with c1(c1 not in heartbeat request).  So c1 will not 
removed from context of NM1.
5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. RM 
will send c1 complted message to AM of app1.  So, app1 received duplicated c1. 
once spark AM   receive one container completed from RM, it will allocate one 
new container.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to