[jira] [Created] (YARN-7252) Removing queue then failing over results in exception

Jonathan Hung (JIRA) Mon, 25 Sep 2017 17:27:33 -0700

Jonathan Hung created YARN-7252:
-----------------------------------

             Summary: Removing queue then failing over results in exception
                 Key: YARN-7252
                 URL: https://issues.apache.org/jira/browse/YARN-7252
             Project: Hadoop YARN
          Issue Type: Sub-task
            Reporter: Jonathan Hung
            Assignee: Jonathan Hung



Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 is 
active. First, put root.a into STOPPED state, then remove it. Then put rm1 in 
standby and rm2 in active. Here's the exception: {noformat}Operation failed: 
Error on refreshAll during transition to Active
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
        at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
failed
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
        ... 10 more
Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted 
from the new capacity scheduler configuration, but the queue is not yet in 
stopped state. Current State : RUNNING
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736)
        ... 11 more
Caused by: java.io.IOException: root.a is deleted from the new capacity 
scheduler configuration, but the queue is not yet in stopped state. Current 
State : RUNNING
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432)
        ... 13 more{noformat}
Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and 
sees it is deleted, it throws exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YARN-7252) Removing queue then failing over results in exception

Reply via email to