[ 
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819059#comment-13819059
 ] 

Jason Lowe commented on YARN-1336:
----------------------------------

It depends upon the nature of the config change or fix.  In essence this is no 
different than the RM restart use-case today.  Any config changes or fixes need 
to keep recovery on startup in mind.  Most fixes won't be an issue, but 
anything that changes the syntax or semantics of the state store data or 
recovery process in general will have to deal with the state store format from 
a previous version to remain compatible.

Ideally we'd like to be able to support work-preserving rolling upgrades as 
well as work-preserving rolling downgrades, so one can smoothly recover from a 
spoiled upgrade without taking down the whole cluster.  If the persisted state 
format isn't changing then this should be straightforward.  However if the 
state format does change between versions and we end up only supporting a 
one-way conversion from the old format to the new format then that would be a 
case where we support a work-preserving rolling upgrade but not a 
work-preserving rolling downgrade between those versions.  A downgrade would 
still be possible with the loss of containers, of course, by simply removing 
the state store data and restarting.

In summary, we would need to be cognizant of changes that affect state recovery 
upon startup so a work-preserving restart can be used to support 
work-preserving rolling upgrades.  This applies to both the RM and the NM.

> Work-preserving nodemanager restart
> -----------------------------------
>
>                 Key: YARN-1336
>                 URL: https://issues.apache.org/jira/browse/YARN-1336
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>
> This serves as an umbrella ticket for tasks related to work-preserving 
> nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to