[jira] [Updated] (YARN-7377) Duplicate Containers allocated for Long-Running Application after NM lost and restart and RM restart

rangjiaheng (JIRA) Fri, 20 Oct 2017 22:38:58 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


rangjiaheng updated YARN-7377:
------------------------------
    Description: 
Case:
A Spark streaming application named app1 running on yarn for a long time; app1 
has *3 containers* in total, one of them named c1 runs on a NM named nm1;

1. The NM named nm1 was lost for some reason, but the containers on it runs 
well; 

2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM 
tells app1's AM that a container of app1 was failed because NM lost, so app1's 
AM killed that container through RPC and then request a new container named c2 
from RM, which is duplicate to c1;

3. Administrator found nm1 lost, so he restart it; since NM's recovery was 
enabled, NM restore all the containers including container c1, but now c1's 
status is 'DONE'; A bug here: this NM will list this container in webui forever;

4. RM restart for some reason; since RM's recovery was enabled, RM restore all 
the apps including app1, and all the NM need re-register to RM; However, when 
nm1 registers to RM, RM found the container c1's status was DONE, so RM tells 
app1's AM that a container of app1 was complete, since spark streaming 
application has fixed number of containers, so AM request a new container named 
c3 from RM, which is duplicate to c1. 

Now, app1 has *4 containers* in total, while *c2 and c3 was the same*.


  was:
Case:
A Spark streaming application named app1 running on yarn for a long time; app1 
has *3 containers* in total, one of them named c1 runs on a NM named nm1;

1. The NM named nm1 was lost for some reason, but the containers on it runs 
well; 

2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM 
tells app1's AM that a container of app1 was failed because NM lost, so app1's 
AM killed that container through RPC and then request a new container named c2 
from RM, which is duplicate to c1;

3. Administrator found nm1 lost, so he restart it; since NM's recovery was 
enabled, NM restore all the containers including container c1, but now c1's 
status is 'DONE'; A bug here: this NM will list this container in webui forever;

4. RM restart for some reason; since RM's recovery was enabled, RM restore all 
the apps including app1, and all the NM need re-register to RM; However, when 
nm1 registers to RM, RM found the container c1's status was DONE, so RM tells 
app1's AM that a container of app1 was complete, since spark streaming 
application has fixed number of containers, so AM request a new container named 
c3 from RM, which is duplicate to c1. 
Now, app1 has *4 containers* in total, while *c2 and c3 was the same*.



> Duplicate Containers allocated for Long-Running Application after NM lost and 
> restart and RM restart
> ----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-7377
>                 URL: https://issues.apache.org/jira/browse/YARN-7377
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications, nodemanager, RM, yarn
>    Affects Versions: 3.0.0-alpha3
>         Environment: Hadoop 2.7.1 RM recovery and NM recovery enabled；
> Spark streaming application, a long-running application on yarn
>            Reporter: rangjiaheng
>              Labels: patch
>
> Case:
> A Spark streaming application named app1 running on yarn for a long time; 
> app1 has *3 containers* in total, one of them named c1 runs on a NM named nm1;
> 1. The NM named nm1 was lost for some reason, but the containers on it runs 
> well; 
> 2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM 
> tells app1's AM that a container of app1 was failed because NM lost, so 
> app1's AM killed that container through RPC and then request a new container 
> named c2 from RM, which is duplicate to c1;
> 3. Administrator found nm1 lost, so he restart it; since NM's recovery was 
> enabled, NM restore all the containers including container c1, but now c1's 
> status is 'DONE'; A bug here: this NM will list this container in webui 
> forever;
> 4. RM restart for some reason; since RM's recovery was enabled, RM restore 
> all the apps including app1, and all the NM need re-register to RM; However, 
> when nm1 registers to RM, RM found the container c1's status was DONE, so RM 
> tells app1's AM that a container of app1 was complete, since spark streaming 
> application has fixed number of containers, so AM request a new container 
> named c3 from RM, which is duplicate to c1. 
> Now, app1 has *4 containers* in total, while *c2 and c3 was the same*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7377) Duplicate Containers allocated for Long-Running Application after NM lost and restart and RM restart

Reply via email to