[ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321698#comment-15321698
 ] 

Robert Kanter commented on YARN-4676:
-------------------------------------

Sorry [~danzhi] for disappearing for a bit there.  I got sidetracked with some 
other responsibilities.  Thanks [~vvasudev] for your detailed comments too.  
Here's some additional comments on the latest patch (14):

# The Patch doesn't cleanly apply to the current trunk
#- I did rollback my repo to an older time when the patch does apply cleanly, 
but some tests failed:
{noformat}
testNodeRemovalNormally(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService)
  Time elapsed: 12.43 sec  <<< FAILURE!
java.lang.AssertionError: Node state is not correct (timedout) 
expected:<DECOMMISSIONING> but was:<SHUTDOWN>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:743)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:727)
        at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1474)
        at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1413)

testNodeRemovalGracefully(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService)
  Time elapsed: 3.184 sec  <<< FAILURE!
java.lang.AssertionError: Node should have been forgotten! 
expected:<host2:5678> but was:<null>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:743)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1586)
        at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalGracefully(TestResourceTrackerService.java:1421)
{noformat}
# I like [~vvasudev]'s suggestion in an [earlier 
comment|https://issues.apache.org/jira/browse/YARN-4676?focusedCommentId=15272554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15272554]
 about having the RM tell the NM to do a delayed shutdown.  This keeps the RM 
from having to track anything and we don't have to worry about RM failovers; 
the RM also has less stuff to keep track of.  And I think it would be a lot 
simpler to implement and maintain.  I'd suggest that we do that in this JIRA 
instead of a followup JIRA because otherwise we'll commit a bunch of code here, 
just to throw it out later.  
# In {{HostsFileReader#readXmlFileToMapWithFileInputStream}}, you can replace 
the multiple {{catch}} blocks with a single {{catch}} using this syntax:
{code:java}
catch(IOException|SAXException|ParserConfigurationException e) {
   ...
}
{code}
# I also agree with [~vvasudev] on point 7 about the exit-wait.ms property.  
This seems like a separate feature, so if you still want it, I'd suggest 
creating a separate JIRA with just this.

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, 
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch, 
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch, 
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch, 
> YARN-4676.012.patch, YARN-4676.013.patch, YARN-4676.014.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to 
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager 
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission 
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to 
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning 
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The 
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves 
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful 
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to