[ 
https://issues.apache.org/jira/browse/YARN-73?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-73.
-----------------------------------------

    Resolution: Duplicate

With YARN-495 in, we changed NM reboot behaviour to be a simple resync - kill 
all containers and re-register with RM.

So in sum, YARN-72 cleans up containers on shutdown, YARN-495 does so on 
resync. 

There is still case when operator issues a shutdown but because 
NM_SLEEP_DELAY_BEFORE_SIGKILL_MS + NM_PROCESS_KILL_WAIT_MS + 
SHUTDOWN_CLEANUP_SLOP_MS is not enough to cleanup all containers. We can make 
the later configurable or can mandate operators to kill containers explicitly 
in that case.

Closing this as a duplicate.
                
> nodemanager should cleanup running containers when it starts
> ------------------------------------------------------------
>
>                 Key: YARN-73
>                 URL: https://issues.apache.org/jira/browse/YARN-73
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>            Reporter: Thomas Graves
>
> Currently the nodemanager doesn't cleanup running containers when it gets 
> restarted. This can cause containers to get lost and stick around forever. 
> We've seen this happen multiple times when the RM is restarted. When the RM 
> is brought back up, it doesn't know about what was running on the cluster, it 
> tells the NMs to reboot and when the NM reboots it loses what it had running. 
> If there are any containers that are behaving badly there is no one left that 
> knows about them to kill them.
> We should kill any running containers when the nodemanager is being started.  
> Note that when the NM is being brought up it needs to somehow figure out what 
> containers were running and be sure it doesn't kill anything it shouldn't.
> Note, we should also try to kill any running containers when the node manager 
> is shutting down (jira 4213 was filed for this).
> This might change a bit when RM restart is implemented if tasks can actually 
> survive across RM/NM being rebooted, but that can be addressed at that point.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to