Jonathan Hung created YARN-7252:
-----------------------------------
Summary: Removing queue then failing over results in exception
Key: YARN-7252
URL: https://issues.apache.org/jira/browse/YARN-7252
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Jonathan Hung
Assignee: Jonathan Hung
Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 is
active. First, put root.a into STOPPED state, then remove it. Then put rm1 in
standby and rm2 in active. Here's the exception: {noformat}Operation failed:
Error on refreshAll during transition to Active
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation
failed
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
... 10 more
Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted
from the new capacity scheduler configuration, but the queue is not yet in
stopped state. Current State : RUNNING
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736)
... 11 more
Caused by: java.io.IOException: root.a is deleted from the new capacity
scheduler configuration, but the queue is not yet in stopped state. Current
State : RUNNING
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432)
... 13 more{noformat}
Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and
sees it is deleted, it throws exception.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]