[
https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693360#comment-14693360
]
Junping Du commented on YARN-3212:
----------------------------------
Thanks [~sunilg] for the comments! I agree this is not a bad idea for node in
decommissioning to give more chances for nodes just in UNHEALTHY. However, it
will involve more complexities, like: how much rounds we should wait (heartbeat
number or timing, a separated configuration?), an additional state for the node
that is in decommissioning and unhealthy, etc. We should evaluate if it worth
it before we have hands-on experience on this new feature. In practically, I
saw rare cases that nodes can back to healthy state quite soon (unless get
fixed immediately with people log in) - that's saying within the timeout.
Thus, I would prefer to keep the current transition which sounds slightly
aggressively but a good trade-off with simplicity at this moment. I can put a
TODO in later patch (if other outstanding issues according to the comments) to
think more on this when we back with more experiences. Make sense?
> RMNode State Transition Update with DECOMMISSIONING state
> ---------------------------------------------------------
>
> Key: YARN-3212
> URL: https://issues.apache.org/jira/browse/YARN-3212
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch,
> YARN-3212-v2.patch, YARN-3212-v3.patch, YARN-3212-v4.1.patch,
> YARN-3212-v4.patch, YARN-3212-v5.1.patch, YARN-3212-v5.patch
>
>
> As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and
> can transition from “running” state triggered by a new event -
> “decommissioning”.
> This new state can be transit to state of “decommissioned” when
> Resource_Update if no running apps on this NM or NM reconnect after restart.
> Or it received DECOMMISSIONED event (after timeout from CLI).
> In addition, it can back to “running” if user decides to cancel previous
> decommission by calling recommission on the same node. The reaction to other
> events is similar to RUNNING state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)