[ 
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808334#comment-13808334
 ] 

Jason Lowe commented on YARN-1336:
----------------------------------

bq. do you have some writeup about the workflow of work-preserving NM restart?

Not yet.  I'll try to cover the high points below in the interim.

bq. According to the current sub tasks, I can see that we need a NMStateStore

Rather than explicitly call that out as a subtask I was expecting that would be 
organically grown and extended as the persistence and restore for each piece of 
context was added.  Having a subtask for doing the entire state store didn't 
make as much sense since the people working on persisting/restoring the 
separate context pieces will know best what they require of the state store.  
Otherwise it seems like the state store subtask is 80% of the work. :-)

bq. Beyond this, how does NM contact RM and AM about its reserved work?

We don't currently see any need for the NM to contact the AM about anything 
related to restart.  The whole point of the restart is to be as transparent as 
possible to the rest of the system -- as if the restart never occurred.  As for 
the RM, we do need a small change where it no longer assumes all containers 
have been killed when a NM registers redundantly (i.e.: RM has not yet expired 
the node yet it is registering again).  That should be covered in the container 
state recovery subtask or we can create an explicit separate subtask for that.

bq. How do we distinguish NM restart and shutdown?

That is a good question and something we need to determine.  I'll file a 
separate subtask to track it.  The current thinking is that if recovery is 
enabled for the NM then it will persist its state as it changes and support 
restarting/rejoining the cluster without containers being lost if it can do so 
before the RM notices (i.e.: expires the NM).  Then we could use this feature 
not only for rolling upgrades but also for running the NM under a supervisor 
that can restart it if it crashes without catastrophic loss to the workload 
running on that node at the time.  This requires decommission (i.e.: a shutdown 
where the NM is not coming back anytime soon) to be a separate, explicit 
command sent to the NM (either via special signal or admin RPC call) so the NM 
knows to cleanup containers and directories since it will not be coming back 
anytime soon.

At a high level, the NM performs these tasks during restart:

* Recover last known state (containers, distcache contents, active tokens, 
pending deletions, etc.)
* Reacquire active containers and noting containers that have exited in the 
interim since the last known state.
* Re-register with the RM.  If RM has not expired the NM, then the containers 
are still valid and we can proceed to update the RM with any containers that 
have exited
* Recover log-aggregations in progress (which may involve re-uploading logs 
that were in-progress)
* Resume pending deletions
* Recover container localizations in progress (either reconnect or 
abort-and-retry)

We don't anticipate any changes to AMs, etc.  The NM will be temporarily 
unavailable while it restarts, but the unavailability should be on the order of 
seconds.

> Work-preserving nodemanager restart
> -----------------------------------
>
>                 Key: YARN-1336
>                 URL: https://issues.apache.org/jira/browse/YARN-1336
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>
> This serves as an umbrella ticket for tasks related to work-preserving 
> nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to