[
https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wangda Tan updated YARN-7252:
-----------------------------
Target Version/s: YARN-5734
> Removing queue then failing over results in exception
> -----------------------------------------------------
>
> Key: YARN-7252
> URL: https://issues.apache.org/jira/browse/YARN-7252
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Jonathan Hung
> Assignee: Jonathan Hung
> Priority: Critical
> Attachments: YARN-7252-YARN-5734.001.patch
>
>
> Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1
> is active. First, put root.a into STOPPED state, then remove it. Then put rm1
> in standby and rm2 in active. Here's the exception: {noformat}Operation
> failed: Error on refreshAll during transition to Active
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation
> failed
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 10 more
> Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted
> from the new capacity scheduler configuration, but the queue is not yet in
> stopped state. Current State : RUNNING
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736)
> ... 11 more
> Caused by: java.io.IOException: root.a is deleted from the new capacity
> scheduler configuration, but the queue is not yet in stopped state. Current
> State : RUNNING
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432)
> ... 13 more{noformat}
> Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and
> sees it is deleted, it throws exception.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]