[
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802148#comment-13802148
]
Jason Lowe commented on YARN-1336:
----------------------------------
Upgrading all of the nodes in a cluster for a rolling upgrade can be a very
disruptive or lengthy process. If the nodemanager is taken down then all
active containers on that node are killed. This is disruptive to jobs with
long-running tasks, especially if one of the tasks ends up hitting this
situation across multiple attempts. An alternative would be a
drain-decommision for nodes as proposed in YARN-914. However with long-running
applications/tasks it can take a very long time to decommission a node, as we
have to not only wait for the active containers to complete but also active
applications in general (e.g.: node still has to serve up map task data after
map task completes, so auxiliary services can have responsibilities beyond the
active containers). Performing a rolling upgrade on a large cluster will take
a very long time if we need to wait for a clean drain-decommission of each node.
Therefore it would be nice if the nodemanager supported a mode where it could
be restarted and recover state. This would include recovering active container
state, tokens, localized resource cache state, etc. We could then bounce the
nodemanager to an updated version without losing containers and with minimal
impact to jobs running on the grid, and the time to perform a rolling upgrade
of a large cluster would no longer be tied to the running time of applications
currently active on the cluster.
> Work-preserving nodemanager restart
> -----------------------------------
>
> Key: YARN-1336
> URL: https://issues.apache.org/jira/browse/YARN-1336
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
>
> This serves as an umbrella ticket for tasks related to work-preserving
> nodemanager restart.
--
This message was sent by Atlassian JIRA
(v6.1#6144)