[
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985709#comment-13985709
]
Bikas Saha commented on YARN-2001:
----------------------------------
Requiring all NM's to re-register might to too constraining because after a
full code rollout, it may be common for some NM's to not come back. If the RM
gets stuck for a minority of NM's not re-registering then it would effectively
be loss of HA.
I like the idea of waiting for a time period before considering the cluster
fully up. However this timeout has to be small or else we will have a lot of
downtime. Can this timeout be less than the AM liveliness period? If not then
how do we treat AMs that are running on NM's that have not re-registered within
the NM timeout?
> Persist NMs info for RM restart
> -------------------------------
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have
> registered with RM. For that, RM needs to remember the previous NMs and wait
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.
--
This message was sent by Atlassian JIRA
(v6.2#6252)