[jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking

Robert Kanter (JIRA) Tue, 14 Jun 2016 14:20:12 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330625#comment-15330625
 ]


Robert Kanter commented on YARN-4676:
-------------------------------------

{quote}
2. I am not very sure the context of this point. The "earlier comments" link 
lead to comments No 22 which is about separate timer for the poll, which was 
addressed by the previous patch.{quote}
Sorry, it looks like my link didn't work right.  I was referring to a different 
comment that [~vvasudev] made that was not numbered.  I'll quote it this time:
{quote}
Robert Kanter, Karthik Kambatla, Junping Du - instead of storing the timeouts 
in a state store, we could also modify the RM-NM protocol to support a delayed 
shutdown. That way when the node is decommissioned gracefully, we tell the NM 
to shutdown after the specified timeout. There'll have to some logic to cancel 
a shutdown for handling re-commissioned nodes but we won't need to worry about 
updating the RM state store with timeouts/timestamps. It also avoids the clock 
skew issue that Karthik mentioned above. Like Karthik and Robert mentioned, I'm 
fine with handling this in a follow up JIRA as long as the command exits 
without doing anything if graceful decommission is specified and the cluster is 
setup with work preserving restart.{quote}

I think this should simplify things because we wouldn't have to do anything 
special for HA and the RM doesn't have to keep track anything.  I know that's a 
bit different than what you've been working on so far, but what do you think 
about this idea?

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, 
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch, 
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch, 
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch, 
> YARN-4676.012.patch, YARN-4676.013.patch, YARN-4676.014.patch, 
> YARN-4676.015.patch, YARN-4676.016.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to 
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager 
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission 
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to 
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning 
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The 
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves 
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful 
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking

Reply via email to