[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197793#comment-16197793 ] Hudson commented on YARN-7252: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13057 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13057/]) YARN-7252. Removing queue then failing over results in exception (jhung: rev 1d36b53ab6d9bb1d9144101e424c24371343c5bf) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerQueueManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java > Removing queue then failing over results in exception > - > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Critical > Fix For: YARN-5734 > > Attachments: YARN-7252-YARN-5734.001.patch, > YARN-7252-YARN-5734.002.patch > > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 > is active. First, put root.a into STOPPED state, then remove it. Then put rm1 > in standby and rm2 in active. Here's the exception: {noformat}Operation > failed: Error on refreshAll during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted > from the new capacity scheduler configuration, but the queue is not yet in > stopped state. Current State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity > scheduler configuration, but the queue is not yet in stopped state. Current > State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and > sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.ap
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16183536#comment-16183536 ] Wangda Tan commented on YARN-7252: -- +1, thanks [~jhung]! > Removing queue then failing over results in exception > - > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Critical > Attachments: YARN-7252-YARN-5734.001.patch, > YARN-7252-YARN-5734.002.patch > > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 > is active. First, put root.a into STOPPED state, then remove it. Then put rm1 > in standby and rm2 in active. Here's the exception: {noformat}Operation > failed: Error on refreshAll during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted > from the new capacity scheduler configuration, but the queue is not yet in > stopped state. Current State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity > scheduler configuration, but the queue is not yet in stopped state. Current > State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and > sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181813#comment-16181813 ] Hadoop QA commented on YARN-7252: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} YARN-5734 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 48s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 58s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} YARN-5734 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 23s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 2 unchanged - 0 fixed = 3 total (was 2) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 14s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 50m 17s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 97m 27s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppStarvation | | Timed out junit tests | org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:71bbb86 | | JIRA Issue | YARN-7252 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12889169/YARN-7252-YARN-5734.002.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 1988b9fa172a 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | YARN-5734 / 9cac5d3 | | Default Java | 1.8.0_144 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/17650/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-serve
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181728#comment-16181728 ] Jonathan Hung commented on YARN-7252: - [~leftnoteasy] thanks for taking a look. Attached 002 patch. bq. Add isConfigurationMutable to CSContext can avoid type conversion. done bq. Could you add a test to make sure that scheduler still validate queue hierarchy when transition from init->active? (since we don't valid queue hierarchy when fail over, we should check it when RM start. I don't think validateQueueHierarchy is called on RM startup, it's only called on scheduler refresh. (i.e. it is only called in CapacityScheduler#reinitializeQueues, but not in CapacityScheduler#initializeQueues). It's just to validate differences between the old and new queue hierarchy, and on startup there is no old queue hierarchy. Let me know if I'm missing something. > Removing queue then failing over results in exception > - > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Critical > Attachments: YARN-7252-YARN-5734.001.patch, > YARN-7252-YARN-5734.002.patch > > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 > is active. First, put root.a into STOPPED state, then remove it. Then put rm1 > in standby and rm2 in active. Here's the exception: {noformat}Operation > failed: Error on refreshAll during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted > from the new capacity scheduler configuration, but the queue is not yet in > stopped state. Current State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity > scheduler configuration, but the queue is not yet in stopped state. Current > State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and > sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181576#comment-16181576 ] Wangda Tan commented on YARN-7252: -- [~jhung], Thanks for the explanation, overall the fix looks good, few comments: 1) Add isConfigurationMutable to CSContext can avoid type conversion. 2) Could you add a test to make sure that scheduler still validate queue hierarchy when transition from init->active? (since we don't valid queue hierarchy when fail over, we should check it when RM start. > Removing queue then failing over results in exception > - > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Critical > Attachments: YARN-7252-YARN-5734.001.patch > > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 > is active. First, put root.a into STOPPED state, then remove it. Then put rm1 > in standby and rm2 in active. Here's the exception: {noformat}Operation > failed: Error on refreshAll during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted > from the new capacity scheduler configuration, but the queue is not yet in > stopped state. Current State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity > scheduler configuration, but the queue is not yet in stopped state. Current > State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and > sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181570#comment-16181570 ] Hadoop QA commented on YARN-7252: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} YARN-5734 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 11s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 54s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s{color} | {color:green} YARN-5734 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} YARN-5734 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 23s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 8s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 46m 16s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 92m 19s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueState | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueParsing | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:71bbb86 | | JIRA Issue | YARN-7252 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12889134/YARN-7252-YARN-5734.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 43070c6ea174 3.13.0-123-generic #172-Ubuntu SMP Mon Jun 26 18:04:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | YARN-5734 / 9cac5d3 | | Default Java | 1.8.0_144 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/17643/artifact/patchprocess/diff-checks
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181512#comment-16181512 ] Jonathan Hung commented on YARN-7252: - Hi [~leftnoteasy], it will be added to the logs. The issue is only on RM HA - on active RM1, we need two operations to delete a queue, 1. STOP 2. delete queue. For the standby RM2, it will still have the queue in memory. Then when RM2 is put into active, it tries to refresh from the old conf containing the queue (in RUNNING state) to the new conf without the queue, which is why the exception is thrown. The patch just bypasses the logic which throws this exception (CapacitySchedulerQueueManager#validateQueueHierarchy). On failover (when using the mutable conf provider) we don't want to validate, since it was already validated on RM1. > Removing queue then failing over results in exception > - > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Critical > Attachments: YARN-7252-YARN-5734.001.patch > > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 > is active. First, put root.a into STOPPED state, then remove it. Then put rm1 > in standby and rm2 in active. Here's the exception: {noformat}Operation > failed: Error on refreshAll during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted > from the new capacity scheduler configuration, but the queue is not yet in > stopped state. Current State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity > scheduler configuration, but the queue is not yet in stopped state. Current > State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and > sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181441#comment-16181441 ] Wangda Tan commented on YARN-7252: -- [~jhung], I'm trying to understand the problem: when root.a deleted by admin, will be this added to edit log? If yes, we should not check STOPPED state for a delete queue. > Removing queue then failing over results in exception > - > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung > Attachments: YARN-7252-YARN-5734.001.patch > > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 > is active. First, put root.a into STOPPED state, then remove it. Then put rm1 > in standby and rm2 in active. Here's the exception: {noformat}Operation > failed: Error on refreshAll during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted > from the new capacity scheduler configuration, but the queue is not yet in > stopped state. Current State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity > scheduler configuration, but the queue is not yet in stopped state. Current > State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and > sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181394#comment-16181394 ] Jonathan Hung commented on YARN-7252: - 001 patch to fix this issue. Ended up going a different approach, the issue only happens when failing over. In this case validation should not be done. So it just checks if it is still in standby (i.e. we are still doing refresh* prior to transitioning to active) and also we are using mutable configuration provider. [~leftnoteasy]/[~xgong] can you take a look at this? Thanks! > Removing queue then failing over results in exception > - > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung > Attachments: YARN-7252-YARN-5734.001.patch > > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 > is active. First, put root.a into STOPPED state, then remove it. Then put rm1 > in standby and rm2 in active. Here's the exception: {noformat}Operation > failed: Error on refreshAll during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted > from the new capacity scheduler configuration, but the queue is not yet in > stopped state. Current State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity > scheduler configuration, but the queue is not yet in stopped state. Current > State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and > sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
[ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180072#comment-16180072 ] Jonathan Hung commented on YARN-7252: - One approach I see is adding a reinitialize wrapper in CapacityScheduler with a flag to tell it whether to check queue hierarchy or not. Assumption being, on failover, you don't need to check the queue hierarchy since the previous RM did it. Another less intrusive approach would be to replay the logs not read by the standby-now-active RM. This seems wasteful though, and there's a lot more logic required for this change. > Removing queue then failing over results in exception > - > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 > is active. First, put root.a into STOPPED state, then remove it. Then put rm1 > in standby and rm2 in active. Here's the exception: {noformat}Operation > failed: Error on refreshAll during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted > from the new capacity scheduler configuration, but the queue is not yet in > stopped state. Current State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity > scheduler configuration, but the queue is not yet in stopped state. Current > State : RUNNING > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and > sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org