[ 
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13807652#comment-13807652
 ] 

Zhijie Shen commented on YARN-1336:
-----------------------------------

It sounds a nice feature. I thought about it a bit before: as we allow RM to 
restart, why not NM? [~jlowe], do you have some writeup about the workflow of 
work-preserving NM restart? If you have, would you mind sharing it? I'm curious 
about the design. According to the current sub tasks, I can see that we need a 
NMStateStore (like RMStateStore for RM) to store the aforementioned information 
when NM stops, and to recover all the states, when NM starts again. Beyond 
this, how does NM contact RM and AM about its reserved work?

I've another question w.r.t this feature. How do we distinguish NM restart and 
shutdown? If an NM shutdowns, and never come back, should the work still be 
preserved (or trapped) there? Currently, NM will notify of killing the 
containers on it immediately, and the application has the chance to start 
another container to do its work.

> Work-preserving nodemanager restart
> -----------------------------------
>
>                 Key: YARN-1336
>                 URL: https://issues.apache.org/jira/browse/YARN-1336
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>
> This serves as an umbrella ticket for tasks related to work-preserving 
> nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to