[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception

Jonathan Hung (JIRA) Tue, 26 Sep 2017 13:30:36 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181512#comment-16181512
 ]


Jonathan Hung commented on YARN-7252:
-------------------------------------

Hi [~leftnoteasy], it will be added to the logs. The issue is only on RM HA - 
on active RM1, we need two operations to delete a queue, 1. STOP 2. delete 
queue. For the standby RM2, it will still have the queue in memory. Then when 
RM2 is put into active, it tries to refresh from the old conf containing the 
queue (in RUNNING state) to the new conf without the queue, which is why the 
exception is thrown.

The patch just bypasses the logic which throws this exception 
(CapacitySchedulerQueueManager#validateQueueHierarchy). On failover (when using 
the mutable conf provider) we don't want to validate, since it was already 
validated on RM1.

> Removing queue then failing over results in exception
> -----------------------------------------------------
>
>                 Key: YARN-7252
>                 URL: https://issues.apache.org/jira/browse/YARN-7252
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Jonathan Hung
>            Assignee: Jonathan Hung
>            Priority: Critical
>         Attachments: YARN-7252-YARN-5734.001.patch
>
>
> Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 
> is active. First, put root.a into STOPPED state, then remove it. Then put rm1 
> in standby and rm2 in active. Here's the exception: {noformat}Operation 
> failed: Error on refreshAll during transition to Active
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
>       at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
>       at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
>       ... 10 more
> Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted 
> from the new capacity scheduler configuration, but the queue is not yet in 
> stopped state. Current State : RUNNING
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736)
>       ... 11 more
> Caused by: java.io.IOException: root.a is deleted from the new capacity 
> scheduler configuration, but the queue is not yet in stopped state. Current 
> State : RUNNING
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432)
>       ... 13 more{noformat}
> Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and 
> sees it is deleted, it throws exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception

Reply via email to