[ https://issues.apache.org/jira/browse/YARN-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
rangjiaheng updated YARN-7377: ------------------------------ Description: Case: A Spark streaming application named app1 running on yarn for a long time; app1 has *3 containers* in total, one of them named c1 runs on a NM named nm1; 1. The NM named nm1 was lost for some reason, but the containers on it runs well; 2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM tells app1's AM that a container of app1 was failed because NM lost, so app1's AM killed that container through RPC and then request a new container named c2 from RM, which is duplicate to c1; 3. Administrator found nm1 lost, so he restart it; since NM's recovery was enabled, NM restore all the containers including container c1, but now c1's status is 'DONE'; A bug here: this NM will list this container in webui forever; 4. RM restart for some reason; since RM's recovery was enabled, RM restore all the apps including app1, and all the NM need re-register to RM; However, when nm1 registers to RM, RM found the container c1's status was DONE, so RM tells app1's AM that a container of app1 was complete, since spark streaming application has fixed number of containers, so AM request a new container named c3 from RM, which is duplicate to c1. Now, app1 has *4 containers* in total, while *c2 and c3 was the same*. was: Case: A Spark streaming application named app1 running on yarn for a long time; app1 has *3 containers* in total, one of them named c1 runs on a NM named nm1; 1. The NM named nm1 was lost for some reason, but the containers on it runs well; 2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM tells app1's AM that a container of app1 was failed because NM lost, so app1's AM killed that container through RPC and then request a new container named c2 from RM, which is duplicate to c1; 3. Administrator found nm1 lost, so he restart it; since NM's recovery was enabled, NM restore all the containers including container c1, but now c1's status is 'DONE'; A bug here: this NM will list this container in webui forever; 4. RM restart for some reason; since RM's recovery was enabled, RM restore all the apps including app1, and all the NM need re-register to RM; However, when nm1 registers to RM, RM found the container c1's status was DONE, so RM tells app1's AM that a container of app1 was complete, since spark streaming application has fixed number of containers, so AM request a new container named c3 from RM, which is duplicate to c1. Now, app1 has *4 containers* in total, while *c2 and c3 was the same*. > Duplicate Containers allocated for Long-Running Application after NM lost and > restart and RM restart > ---------------------------------------------------------------------------------------------------- > > Key: YARN-7377 > URL: https://issues.apache.org/jira/browse/YARN-7377 > Project: Hadoop YARN > Issue Type: Bug > Components: applications, nodemanager, RM, yarn > Affects Versions: 3.0.0-alpha3 > Environment: Hadoop 2.7.1 RM recovery and NM recovery enabled; > Spark streaming application, a long-running application on yarn > Reporter: rangjiaheng > Labels: patch > > Case: > A Spark streaming application named app1 running on yarn for a long time; > app1 has *3 containers* in total, one of them named c1 runs on a NM named nm1; > 1. The NM named nm1 was lost for some reason, but the containers on it runs > well; > 2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM > tells app1's AM that a container of app1 was failed because NM lost, so > app1's AM killed that container through RPC and then request a new container > named c2 from RM, which is duplicate to c1; > 3. Administrator found nm1 lost, so he restart it; since NM's recovery was > enabled, NM restore all the containers including container c1, but now c1's > status is 'DONE'; A bug here: this NM will list this container in webui > forever; > 4. RM restart for some reason; since RM's recovery was enabled, RM restore > all the apps including app1, and all the NM need re-register to RM; However, > when nm1 registers to RM, RM found the container c1's status was DONE, so RM > tells app1's AM that a container of app1 was complete, since spark streaming > application has fixed number of containers, so AM request a new container > named c3 from RM, which is duplicate to c1. > Now, app1 has *4 containers* in total, while *c2 and c3 was the same*. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org