[ 
https://issues.apache.org/jira/browse/YARN-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430799#comment-15430799
 ] 

Jason Lowe commented on YARN-5049:
----------------------------------

The major version should change when an older version of the software should 
not try to use the state store.  If we only bump the minor version then the old 
software will happily use the state store because all schemas with the same 
major version are "compatible."

So we need to think about two scenarios:
# What happens if we upgrade to a newer version of software that sees the old 
schema without these keys?
# What happens if we downgrade from a newer version of software with these keys 
to an older one that doesn't know about them?

For #1 I think it's easy.  Old software doesn't support queued containers, so 
those keys won't be there.  No queued containers means nothing to restore for 
that subsystem, so we should be fine during recovery.

For #2 it's more complicated.  If we have queued containers then do a rolling 
downgrade then we could end up losing those containers because the old software 
doesn't support them.  Therefore I think we can't support rolling downgrades as 
soon as queued containers are used.

So it looks like the proper way forward is to bump the major version because of 
the lack of rolling downgrade support.  IMHO the version number should be 
updated "lazily," meaning if we're currently on schema version 1 but never use 
queued containers then it stays at version 1.  If we're on version 1 when a 
queued container needs to be saved in the state store then we update the major 
version at that time.  This has a number of important benefits to the end user:
- No need for a "migration script" that needs to be run manually
- Users don't lose the ability to do a rolling downgrade until they leverage 
the functionality that broke the ability to downgrade.

This matches the precedent set by the container ID epoch change for RM 
work-preserving restart in 2.6.  2.5 apps were supported on 2.6 until the user 
did a work-preserving RM restart, since that's what caused the epoch ID to be 
added to the container ID, breaking any 2.5 app that tried to parse a container 
ID.


> Extend NMStateStore to save queued container information
> --------------------------------------------------------
>
>                 Key: YARN-5049
>                 URL: https://issues.apache.org/jira/browse/YARN-5049
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Konstantinos Karanasos
>            Assignee: Konstantinos Karanasos
>             Fix For: 2.9.0
>
>         Attachments: YARN-5049.001.patch, YARN-5049.002.patch, 
> YARN-5049.003.patch
>
>
> This JIRA is about extending the NMStateStore to save queued container 
> information whenever a new container is added to the NM queue. 
> It also removes the information from the state store when the queued 
> container starts its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to