[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2024-01-27 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated YARN-1:
--
Hadoop Flags: Reviewed

> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When i make configure-type from delegated-centralized to centralized in 
> yarn-site.xml and restart the RM, it failed.
> The error stacktrace is as follows
>  
> {code:txt}
> 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
> ... 5 more
> 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
>  {code}
> When i digging into the codebase, found that the node and labels mapping is 
> stored in the nodelabel.mirror file when configured the type of centralized. 
> So the content of nodelabel.mirror 

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2024-01-27 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated YARN-1:
--
Component/s: yarn

> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When i make configure-type from delegated-centralized to centralized in 
> yarn-site.xml and restart the RM, it failed.
> The error stacktrace is as follows
>  
> {code:txt}
> 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
> ... 5 more
> 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
>  {code}
> When i digging into the codebase, found that the node and labels mapping is 
> stored in the nodelabel.mirror file when configured the type of centralized. 
> So the content of nodelabel.mirror file 

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2024-01-27 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated YARN-1:
--
 Target Version/s: 3.4.0
Affects Version/s: 3.4.0

> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When i make configure-type from delegated-centralized to centralized in 
> yarn-site.xml and restart the RM, it failed.
> The error stacktrace is as follows
>  
> {code:txt}
> 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
> ... 5 more
> 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
>  {code}
> When i digging into the codebase, found that the node and labels mapping is 
> stored in the nodelabel.mirror file when configured the type of centralized. 
> So the content of 

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-22 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-1:
-
Fix Version/s: 3.4.0
   (was: 3.3.4)

Changed the fix version to 3.4.0 because it is now only in trunk.

> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When i make configure-type from delegated-centralized to centralized in 
> yarn-site.xml and restart the RM, it failed.
> The error stacktrace is as follows
>  
> {code:txt}
> 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
> ... 5 more
> 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
>  {code}
> When i digging into the codebase, found that the node and labels mapping is 
> stored in the nodelabel.mirror file when configured the type of 

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-1:
--
Labels: pull-request-available  (was: )

> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When i make configure-type from delegated-centralized to centralized in 
> yarn-site.xml and restart the RM, it failed.
> The error stacktrace is as follows
>  
> {code:txt}
> 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
> ... 5 more
> 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
>  {code}
> When i digging into the codebase, found that the node and labels mapping is 
> stored in the nodelabel.mirror file when configured the type of centralized. 
> So the content of nodelabel.mirror file is as follows
> 1. the node-label list
> 2. the node to 

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-1:

Description: 
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
... 5 more
2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session

 {code}

When i digging into the codebase, found that the node and labels mapping is 
stored in the nodelabel.mirror file when configured the type of centralized. 

So the content of nodelabel.mirror file is as follows
1. the node-label list
2. the node to label mapping (only exist when configured the type of 
centralized)

 

  was:
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-1:

Description: 
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
... 5 more
2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session

 {code}

When i digging into the codebase, found that the node and labels mapping is 
stored in the nodelabel.mirror file when configured the type of centralized. 
However the conf

 

  was:
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-1:

Description: 
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
... 5 more
2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session

 {code}

When i digging into the codebase, found that the node and labels mapping is 
stored into the nodelabel.mirror file when configured the 

 

  was:When i 


> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> When i make configure-type from delegated-centralized to centralized in 
> yarn-site.xml and restart the RM, it failed.
> The error stacktrace is as follows
>  
> {code:txt}
> 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-1:

Description: When i 

> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> When i 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org