[ 
https://issues.apache.org/jira/browse/YARN-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953267#comment-15953267
 ] 

Varun Saxena commented on YARN-6420:
------------------------------------

Thanks [~bibinchundatt] for the patch.
You have added write locks in multiple methods of CommonNodeLabelsManager. But 
adding a write lock in only CommonNodeLabelsManager#addToCluserNodeLabels would 
be sufficient to fix this issue as for other node label operations come via 
RMNodeLabelManager which acquires the same write lock already.

However, CommonNodeLabelsManager class if taken in isolation and the fact that 
its not an abstract class means that we can actually make the locking 
consistent there as well.
So I am fine with what you have done in the patch.

Let me look at the patch further in detail later today.

> RM startup failure due to wrong order in nodelabel editlog
> ----------------------------------------------------------
>
>                 Key: YARN-6420
>                 URL: https://issues.apache.org/jira/browse/YARN-6420
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: YARN-6420.0001.patch
>
>
> Edit log file for nodelabel written in wrong order if 
> {{StoreNewClusterNodeLabels}} addition is delayed and 
> {{UpdateNodeToLabelsMappingsEvent}} is added to dispatcher.
> Configure RM admin client thread count to 2
> Add node label to cluster X - Client 1
> Delay event addition to dispatcher 
> Replace node label on node1  to X - Client 2
> Make sure {{UpdateNodeToLabelsMappingsEvent}} added to dispatcher first.
> Restart  resource manager
> {noformat}
> 2017-03-31 16:20:42,236 | WARN  | main-EventThread | Exception handling the 
> winning of election | ActiveStandbyElector.java:836
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:728)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:600)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:331)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
>         ... 4 more
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Not all labels being replaced contained by known label 
> collections, please check, new labels=[1]
>         at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>         at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:642)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1042)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1083)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1079)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to