[ 
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802148#comment-13802148
 ] 

Jason Lowe commented on YARN-1336:
----------------------------------

Upgrading all of the nodes in a cluster for a rolling upgrade can be a very 
disruptive or lengthy process.  If the nodemanager is taken down then all 
active containers on that node are killed.  This is disruptive to jobs with 
long-running tasks, especially if one of the tasks ends up hitting this 
situation across multiple attempts.  An alternative would be a 
drain-decommision for nodes as proposed in YARN-914.  However with long-running 
applications/tasks it can take a very long time to decommission a node, as we 
have to not only wait for the active containers to complete but also active 
applications in general (e.g.: node still has to serve up map task data after 
map task completes, so auxiliary services can have responsibilities beyond the 
active containers).  Performing a rolling upgrade on a large cluster will take 
a very long time if we need to wait for a clean drain-decommission of each node.

Therefore it would be nice if the nodemanager supported a mode where it could 
be restarted and recover state.  This would include recovering active container 
state, tokens, localized resource cache state, etc.  We could then bounce the 
nodemanager to an updated version without losing containers and with minimal 
impact to jobs running on the grid, and the time to perform a rolling upgrade 
of a large cluster would no longer be tied to the running time of applications 
currently active on the cluster.

> Work-preserving nodemanager restart
> -----------------------------------
>
>                 Key: YARN-1336
>                 URL: https://issues.apache.org/jira/browse/YARN-1336
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>
> This serves as an umbrella ticket for tasks related to work-preserving 
> nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to