[
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808334#comment-13808334
]
Jason Lowe commented on YARN-1336:
----------------------------------
bq. do you have some writeup about the workflow of work-preserving NM restart?
Not yet. I'll try to cover the high points below in the interim.
bq. According to the current sub tasks, I can see that we need a NMStateStore
Rather than explicitly call that out as a subtask I was expecting that would be
organically grown and extended as the persistence and restore for each piece of
context was added. Having a subtask for doing the entire state store didn't
make as much sense since the people working on persisting/restoring the
separate context pieces will know best what they require of the state store.
Otherwise it seems like the state store subtask is 80% of the work. :-)
bq. Beyond this, how does NM contact RM and AM about its reserved work?
We don't currently see any need for the NM to contact the AM about anything
related to restart. The whole point of the restart is to be as transparent as
possible to the rest of the system -- as if the restart never occurred. As for
the RM, we do need a small change where it no longer assumes all containers
have been killed when a NM registers redundantly (i.e.: RM has not yet expired
the node yet it is registering again). That should be covered in the container
state recovery subtask or we can create an explicit separate subtask for that.
bq. How do we distinguish NM restart and shutdown?
That is a good question and something we need to determine. I'll file a
separate subtask to track it. The current thinking is that if recovery is
enabled for the NM then it will persist its state as it changes and support
restarting/rejoining the cluster without containers being lost if it can do so
before the RM notices (i.e.: expires the NM). Then we could use this feature
not only for rolling upgrades but also for running the NM under a supervisor
that can restart it if it crashes without catastrophic loss to the workload
running on that node at the time. This requires decommission (i.e.: a shutdown
where the NM is not coming back anytime soon) to be a separate, explicit
command sent to the NM (either via special signal or admin RPC call) so the NM
knows to cleanup containers and directories since it will not be coming back
anytime soon.
At a high level, the NM performs these tasks during restart:
* Recover last known state (containers, distcache contents, active tokens,
pending deletions, etc.)
* Reacquire active containers and noting containers that have exited in the
interim since the last known state.
* Re-register with the RM. If RM has not expired the NM, then the containers
are still valid and we can proceed to update the RM with any containers that
have exited
* Recover log-aggregations in progress (which may involve re-uploading logs
that were in-progress)
* Resume pending deletions
* Recover container localizations in progress (either reconnect or
abort-and-retry)
We don't anticipate any changes to AMs, etc. The NM will be temporarily
unavailable while it restarts, but the unavailability should be on the order of
seconds.
> Work-preserving nodemanager restart
> -----------------------------------
>
> Key: YARN-1336
> URL: https://issues.apache.org/jira/browse/YARN-1336
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
>
> This serves as an umbrella ticket for tasks related to work-preserving
> nodemanager restart.
--
This message was sent by Atlassian JIRA
(v6.1#6144)