[jira] [Commented] (YARN-11623) FairScheduler: Document AM preemption related changes (YARN-9537 and YARN-10625)
[ https://issues.apache.org/jira/browse/YARN-11623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792686#comment-17792686 ] ASF GitHub Bot commented on YARN-11623: --- singer-bin opened a new pull request, #6320: URL: https://github.com/apache/hadoop/pull/6320 ### Description of PR ### How was this patch tested? ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > FairScheduler: Document AM preemption related changes (YARN-9537 and > YARN-10625) > > > Key: YARN-11623 > URL: https://issues.apache.org/jira/browse/YARN-11623 > Project: Hadoop YARN > Issue Type: Task > Components: fairscheduler >Reporter: yanbin.zhang >Priority: Major > > Extend the documentation with these enhancements about YARN-9537 and > YARN-10625. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11623) FairScheduler: Document AM preemption related changes (YARN-9537 and YARN-10625)
[ https://issues.apache.org/jira/browse/YARN-11623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11623: -- Labels: pull-request-available (was: ) > FairScheduler: Document AM preemption related changes (YARN-9537 and > YARN-10625) > > > Key: YARN-11623 > URL: https://issues.apache.org/jira/browse/YARN-11623 > Project: Hadoop YARN > Issue Type: Task > Components: fairscheduler >Reporter: yanbin.zhang >Priority: Major > Labels: pull-request-available > > Extend the documentation with these enhancements about YARN-9537 and > YARN-10625. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11623) FairScheduler: Document AM preemption related changes (YARN-9537 and YARN-10625)
yanbin.zhang created YARN-11623: --- Summary: FairScheduler: Document AM preemption related changes (YARN-9537 and YARN-10625) Key: YARN-11623 URL: https://issues.apache.org/jira/browse/YARN-11623 Project: Hadoop YARN Issue Type: Task Components: fairscheduler Reporter: yanbin.zhang Extend the documentation with these enhancements about YARN-9537 and YARN-10625. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10631) Document AM preemption related changes (YARN-9537 and YARN-10625)
[ https://issues.apache.org/jira/browse/YARN-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792656#comment-17792656 ] yanbin.zhang commented on YARN-10631: - take it up > Document AM preemption related changes (YARN-9537 and YARN-10625) > - > > Key: YARN-10631 > URL: https://issues.apache.org/jira/browse/YARN-10631 > Project: Hadoop YARN > Issue Type: Task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > Preemption-related changes were introduced in YARN-9537 and YARN-10625. > These also introduce new properties which are not documented for Fair > Scheduler. Extend the documentation with these enhancements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792631#comment-17792631 ] Xiaoqiao He commented on YARN-11622: Hi [~hiwangzhihui] Thanks for your report. It's interesting case. Would you mind to check if any active branches include the same issue? Just notice that you mark 3.0.0, 3.1.3 as affect version which are both EOL now. Thanks. > ResourceManager asynchronous switch to Standy、Active exception > -- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0, 3.1.3 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During
[jira] [Commented] (YARN-11613) [Federation] Router CLI Supports Delete SubClusterPolicyConfiguration Of Queues.
[ https://issues.apache.org/jira/browse/YARN-11613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792620#comment-17792620 ] ASF GitHub Bot commented on YARN-11613: --- hadoop-yetus commented on PR #6295: URL: https://github.com/apache/hadoop/pull/6295#issuecomment-1837798797 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 50s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | buf | 0m 0s | | buf was not available. | | +0 :ok: | buf | 0m 0s | | buf was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 6 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 34m 28s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 35m 51s | | trunk passed | | +1 :green_heart: | compile | 7m 52s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 7m 9s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 57s | | trunk passed | | +1 :green_heart: | mvnsite | 5m 7s | | trunk passed | | +1 :green_heart: | javadoc | 5m 8s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 4m 48s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 9m 41s | | trunk passed | | +1 :green_heart: | shadedclient | 37m 42s | | branch has no errors when building and testing our client artifacts. | | -0 :warning: | patch | 38m 6s | | Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 30s | | Maven dependency ordering for patch | | -1 :x: | mvninstall | 0m 28s | [/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common.txt) | hadoop-yarn-server-common in the patch failed. | | -1 :x: | mvninstall | 0m 30s | [/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch failed. | | -1 :x: | mvninstall | 0m 19s | [/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt) | hadoop-yarn-server-router in the patch failed. | | -1 :x: | compile | 1m 12s | [/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt) | hadoop-yarn in the patch failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04. | | -1 :x: | cc | 1m 12s | [/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt) | hadoop-yarn in the patch failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04. | | -1 :x: | javac | 1m 12s | [/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt) | hadoop-yarn in the patch failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04. | | -1 :x: | compile | 0m 58s |
[jira] [Commented] (YARN-11621) Fix intermittently failing unit test: TestAMRMProxy.testAMRMProxyTokenRenewal
[ https://issues.apache.org/jira/browse/YARN-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792537#comment-17792537 ] ASF GitHub Bot commented on YARN-11621: --- susheelgupta7 commented on code in PR #6310: URL: https://github.com/apache/hadoop/pull/6310#discussion_r1413125808 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMProxy.java: ## @@ -156,13 +156,13 @@ public void testAMRMProxyTokenRenewal() throws Exception { YarnClient rmClient = YarnClient.createYarnClient()) { Configuration conf = new YarnConfiguration(); conf.setBoolean(YarnConfiguration.AMRM_PROXY_ENABLED, true); - conf.setInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, 4500); - conf.setInt(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS, 4500); - conf.setInt(YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS, 4500); + conf.setInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, 8000); + conf.setInt(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS, 8000); + conf.setInt(YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS, 12000); // RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS should be at least // RM_AM_EXPIRY_INTERVAL_MS * 1.5 *3 conf.setInt( - YarnConfiguration.RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, 20); + YarnConfiguration.RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, 37); Review Comment: @slfan1989 According to this [comment](https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMProxy.java#L162-L163) `// RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS should be at least // RM_AM_EXPIRY_INTERVAL_MS * 1.5 *3` (i.e it should be greater than 54 sec) but the code [code](https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/AMRMTokenSecretManager.java#L107-L112) states `YarnConfiguration.RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + " should be more than 3 X " + YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS`. So I chose the code and set it as 37 sec. > Fix intermittently failing unit test: TestAMRMProxy.testAMRMProxyTokenRenewal > - > > Key: YARN-11621 > URL: https://issues.apache.org/jira/browse/YARN-11621 > Project: Hadoop YARN > Issue Type: Test > Components: yarn >Affects Versions: 3.3.6 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > This test seems to be flaky as it failed 3 times out of 200 runs based on the > trunk. > This was fixed earlier with YARN-7020, but it seems it didn't cover all the > flakiness. > h3. > {code:java} > Error Message > Application attempt appattempt_1630750910491_0001_01 doesn't exist in > ApplicationMasterService cache. at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:407) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor$3.allocate(DefaultRequestInterceptor.java:224) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor.allocate(DefaultRequestInterceptor.java:135) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.AMRMProxyService.allocate(AMRMProxyService.java:329) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894) > Stacktrace > org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: > Application attempt appattempt_1630750910491_0001_01 doesn't exist in > ApplicationMasterService cache. at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:407) > at >
[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11622: -- Affects Version/s: 3.1.3 > ResourceManager asynchronous switch to Standy、Active exception > -- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0, 3.1.3 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh the Scheduler configuration. At this time, the > csConfProvider property of the CapacityScheduler is not initialized and its > value is null. As a result.
[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11622: -- Description: h1. Two exception cases: h2. The first case: *The exception desc:* {code:java} 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:748){{}} * {code} * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 14:52:57, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. * As shown in the following figure, Thread_1 during the toStandby process , reinitializes the activeServices to null. At this point, Thread_2 will use the "activeServices" when executing the handleTransitionToStandByInNewThread method ultimately resulting in a NullPointerException and the Reosurcemanager server exit. !yuque_diagram.jpg|width=629,height=100! h2. The second case: *The exception desc:* {code:java} 06:17:35,913 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll during transition to Active at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) ... 5 more Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) ... 6 more 06:17:35,917 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(898)) - Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera tion failed{{}} {code} * ActiveStandbyElector and ZKRMStateStore triggered toActive event and toStandby event at 06:17:35, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is called to refresh the Scheduler configuration. At this time, the csConfProvider property of the CapacityScheduler is not initialized and its value is null. As a result. when the reinitialize method is executed csConfProvider is used, triggering a NullPointerException and causing Thread_ 1 transition to active fail. !yuque_diagram (1).jpg|width=568,height=155! h1. Solution Due to the limited scope of lock control in ResourceMmanger’s transitionToActive and transitionToStandby methods, different events triggered asynchronously outside this lock scope can influence each other, leading to unpredictable issues. The proposed solution is to encapsulate different asynchronous tasks as TransitionToActiveStandbyRunner and enqueue them in a queue to be executed in order by a SingleThreadExecutor. This approach resolves the asynchronous problem and provides
[jira] [Created] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
wangzhihui created YARN-11622: - Summary: ResourceManager asynchronous switch to Standy、Active exception Key: YARN-11622 URL: https://issues.apache.org/jira/browse/YARN-11622 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: wangzhihui Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, yuque_diagram.jpg h1. Two exception cases: h2. The first case: *The exception desc:* 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:748){{}} * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 14:52:57, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. * As shown in the following figure, Thread_1 during the toStandby process , reinitializes the activeServices to null. At this point, Thread_2 will use the "activeServices" when executing the handleTransitionToStandByInNewThread method ultimately resulting in a NullPointerException and the Reosurcemanager server exit. !yuque_diagram.jpg|width=629,height=100! h2. The second case: *The exception desc:* 06:17:35,913 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll during transition to Active at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) ... 5 more Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) ... 6 more 06:17:35,917 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(898)) - Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera tion failed{{}} * ActiveStandbyElector and ZKRMStateStore triggered toActive event and toStandby event at 06:17:35, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is called to refresh the Scheduler configuration. At this time, the csConfProvider property of the CapacityScheduler is not initialized and its value is null. As a result. when the reinitialize method is executed csConfProvider is used, triggering a NullPointerException and causing Thread_ 1 transition to active fail. !yuque_diagram (1).jpg|width=568,height=155! h1. Solution Due to the limited scope of lock control in
[jira] [Commented] (YARN-11561) [Federation] GPG Supports Format PolicyStateStore.
[ https://issues.apache.org/jira/browse/YARN-11561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792507#comment-17792507 ] ASF GitHub Bot commented on YARN-11561: --- slfan1989 commented on PR #6300: URL: https://github.com/apache/hadoop/pull/6300#issuecomment-1837438799 @goiri Thank you very much for your help in reviewing the code! > [Federation] GPG Supports Format PolicyStateStore. > -- > > Key: YARN-11561 > URL: https://issues.apache.org/jira/browse/YARN-11561 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11561) [Federation] GPG Supports Format PolicyStateStore.
[ https://issues.apache.org/jira/browse/YARN-11561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11561. --- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Target Version/s: 3.4.0 Resolution: Fixed > [Federation] GPG Supports Format PolicyStateStore. > -- > > Key: YARN-11561 > URL: https://issues.apache.org/jira/browse/YARN-11561 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11561) [Federation] GPG Supports Format PolicyStateStore.
[ https://issues.apache.org/jira/browse/YARN-11561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792506#comment-17792506 ] ASF GitHub Bot commented on YARN-11561: --- slfan1989 merged PR #6300: URL: https://github.com/apache/hadoop/pull/6300 > [Federation] GPG Supports Format PolicyStateStore. > -- > > Key: YARN-11561 > URL: https://issues.apache.org/jira/browse/YARN-11561 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org