[ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404645#comment-15404645
 ] 

Daniel Zhi commented on YARN-4676:
----------------------------------

If NM crashes (for example, JVM exit due to out of heap), it suppose to restart 
automatically, instead of waiting fur human to start it. Isn't that the general 
practice? NM code, upon receive shutdown from RM, will exit self. But nothing 
prevent/disallow the NM daemon from restart, wither automatically or by human. 
When such NM restart, it will try to register itself to RM, which will be told 
to shutdown if it still appear in the exclude list. Such node will remain as 
DECOMMISSIONED inside RM until 10+ minutes later into LOST after the EXPIRE 
event.

Such DECOMMISSIONED node can be recommissioned (refreshNodes after it is 
removed from the exclude list). During which it is transition into RUNNING 
state.

These behavior appears to me as robust instead of hacking. It appears that the 
behavior you expected relies on a separate mechanism that permanently shutdown 
NM once it is DECOMMISSIONED. As long as such DECOMMISSIONED node never try to 
register or be recommissioned, yes, I expect these transitions you listed could 
be removed.

So I see these transitions are really needed. That said, I could removed them 
and maintain them privately inside EMR branch for the sake of getting this JIRA 
going.

These transitions are there almost single the beginning of this JIRA, any other 
comments/surprises?





> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, 
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch, 
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch, 
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch, 
> YARN-4676.012.patch, YARN-4676.013.patch, YARN-4676.014.patch, 
> YARN-4676.015.patch, YARN-4676.016.patch, YARN-4676.017.patch, 
> YARN-4676.018.patch, YARN-4676.019.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to 
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager 
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission 
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to 
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning 
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The 
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves 
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful 
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to