Jonathan Hung created YARN-7252: ----------------------------------- Summary: Removing queue then failing over results in exception Key: YARN-7252 URL: https://issues.apache.org/jira/browse/YARN-7252 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Hung Assignee: Jonathan Hung
Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 is active. First, put root.a into STOPPED state, then remove it. Then put rm1 in standby and rm2 in active. Here's the exception: {noformat}Operation failed: Error on refreshAll during transition to Active at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) ... 10 more Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted from the new capacity scheduler configuration, but the queue is not yet in stopped state. Current State : RUNNING at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) ... 11 more Caused by: java.io.IOException: root.a is deleted from the new capacity scheduler configuration, but the queue is not yet in stopped state. Current State : RUNNING at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) ... 13 more{noformat} Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org