[jira] [Commented] (YARN-11623) FairScheduler: Document AM preemption related changes (YARN-9537 and YARN-10625)

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792686#comment-17792686
 ] 

ASF GitHub Bot commented on YARN-11623:
---

singer-bin opened a new pull request, #6320:
URL: https://github.com/apache/hadoop/pull/6320

   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> FairScheduler: Document AM preemption related changes (YARN-9537 and 
> YARN-10625)
> 
>
> Key: YARN-11623
> URL: https://issues.apache.org/jira/browse/YARN-11623
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: fairscheduler
>Reporter: yanbin.zhang
>Priority: Major
>
> Extend the documentation with these enhancements about YARN-9537 and 
> YARN-10625.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11623) FairScheduler: Document AM preemption related changes (YARN-9537 and YARN-10625)

2023-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11623:
--
Labels: pull-request-available  (was: )

> FairScheduler: Document AM preemption related changes (YARN-9537 and 
> YARN-10625)
> 
>
> Key: YARN-11623
> URL: https://issues.apache.org/jira/browse/YARN-11623
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: fairscheduler
>Reporter: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
>
> Extend the documentation with these enhancements about YARN-9537 and 
> YARN-10625.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11623) FairScheduler: Document AM preemption related changes (YARN-9537 and YARN-10625)

2023-12-03 Thread yanbin.zhang (Jira)
yanbin.zhang created YARN-11623:
---

 Summary: FairScheduler: Document AM preemption related changes 
(YARN-9537 and YARN-10625)
 Key: YARN-11623
 URL: https://issues.apache.org/jira/browse/YARN-11623
 Project: Hadoop YARN
  Issue Type: Task
  Components: fairscheduler
Reporter: yanbin.zhang


Extend the documentation with these enhancements about YARN-9537 and YARN-10625.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10631) Document AM preemption related changes (YARN-9537 and YARN-10625)

2023-12-03 Thread yanbin.zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792656#comment-17792656
 ] 

yanbin.zhang commented on YARN-10631:
-

take it up

> Document AM preemption related changes (YARN-9537 and YARN-10625)
> -
>
> Key: YARN-10631
> URL: https://issues.apache.org/jira/browse/YARN-10631
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Preemption-related changes were introduced in YARN-9537 and YARN-10625.
> These also introduce new properties which are not documented for Fair 
> Scheduler. Extend the documentation with these enhancements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792631#comment-17792631
 ] 

Xiaoqiao He commented on YARN-11622:


Hi [~hiwangzhihui] Thanks for your report. It's interesting case. Would you 
mind to check if any active branches include the same issue? Just notice that 
you mark 3.0.0, 3.1.3 as affect version which are both EOL now. Thanks.

> ResourceManager asynchronous switch to Standy、Active exception
> --
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0, 3.1.3
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During 

[jira] [Commented] (YARN-11613) [Federation] Router CLI Supports Delete SubClusterPolicyConfiguration Of Queues.

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792620#comment-17792620
 ] 

ASF GitHub Bot commented on YARN-11613:
---

hadoop-yetus commented on PR #6295:
URL: https://github.com/apache/hadoop/pull/6295#issuecomment-1837798797

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  buf  |   0m  0s |  |  buf was not available.  |
   | +0 :ok: |  buf  |   0m  0s |  |  buf was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 6 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  34m 28s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  35m 51s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   7m 52s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   7m  9s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 57s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   5m  7s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   5m  8s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   4m 48s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   9m 41s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  37m 42s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  38m  6s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 30s |  |  Maven dependency ordering for patch  |
   | -1 :x: |  mvninstall  |   0m 28s | 
[/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common.txt)
 |  hadoop-yarn-server-common in the patch failed.  |
   | -1 :x: |  mvninstall  |   0m 30s | 
[/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed.  |
   | -1 :x: |  mvninstall  |   0m 19s | 
[/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt)
 |  hadoop-yarn-server-router in the patch failed.  |
   | -1 :x: |  compile  |   1m 12s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-yarn in the patch failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  cc  |   1m 12s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-yarn in the patch failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  javac  |   1m 12s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6295/14/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-yarn in the patch failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  compile  |   0m 58s | 

[jira] [Commented] (YARN-11621) Fix intermittently failing unit test: TestAMRMProxy.testAMRMProxyTokenRenewal

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792537#comment-17792537
 ] 

ASF GitHub Bot commented on YARN-11621:
---

susheelgupta7 commented on code in PR #6310:
URL: https://github.com/apache/hadoop/pull/6310#discussion_r1413125808


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMProxy.java:
##
@@ -156,13 +156,13 @@ public void testAMRMProxyTokenRenewal() throws Exception {
YarnClient rmClient = YarnClient.createYarnClient()) {
   Configuration conf = new YarnConfiguration();
   conf.setBoolean(YarnConfiguration.AMRM_PROXY_ENABLED, true);
-  conf.setInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, 4500);
-  conf.setInt(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS, 4500);
-  conf.setInt(YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS, 4500);
+  conf.setInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, 8000);
+  conf.setInt(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS, 8000);
+  conf.setInt(YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS, 12000);
   // RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS should be at least
   // RM_AM_EXPIRY_INTERVAL_MS * 1.5 *3
   conf.setInt(
-  YarnConfiguration.RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, 
20);
+  YarnConfiguration.RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, 
37);

Review Comment:
   @slfan1989  According to this 
[comment](https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMProxy.java#L162-L163)
 
   `// RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS should be at least
 // RM_AM_EXPIRY_INTERVAL_MS * 1.5 *3` (i.e it should be greater than 
54 sec) 
but the code 
[code](https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/AMRMTokenSecretManager.java#L107-L112)
 states `YarnConfiguration.RM_AMRM_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS
 + " should be more than 3 X "
 + YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS`. 
So I chose the code and set it as 37 sec.  





> Fix intermittently failing unit test: TestAMRMProxy.testAMRMProxyTokenRenewal
> -
>
> Key: YARN-11621
> URL: https://issues.apache.org/jira/browse/YARN-11621
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.6
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> This test seems to be flaky as it failed 3 times out of 200 runs based on the 
> trunk.
> This was fixed earlier with YARN-7020, but it seems it didn't cover all the 
> flakiness.
> h3.  
> {code:java}
> Error Message
> Application attempt appattempt_1630750910491_0001_01 doesn't exist in 
> ApplicationMasterService cache. at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:407)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor$3.allocate(DefaultRequestInterceptor.java:224)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor.allocate(DefaultRequestInterceptor.java:135)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.AMRMProxyService.allocate(AMRMProxyService.java:329)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> Stacktrace
> org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: 
> Application attempt appattempt_1630750910491_0001_01 doesn't exist in 
> ApplicationMasterService cache. at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:407)
>  at 
> 

[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-03 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11622:
--
Affects Version/s: 3.1.3

> ResourceManager asynchronous switch to Standy、Active exception
> --
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0, 3.1.3
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler configuration. At this time, the 
> csConfProvider property of the CapacityScheduler is not initialized and its 
> value is null. As a result. 

[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-03 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11622:
--
Description: 
h1. Two exception cases:
h2. The first case:

*The exception desc:*
{code:java}
14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - 
Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748){{}} * {code}
 
 * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 14:52:57,

Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
 * As shown in the following figure, Thread_1 during the toStandby process , 
reinitializes the activeServices to null. At this point, Thread_2 will use the 
"activeServices" when executing the handleTransitionToStandByInNewThread method 
ultimately resulting in a NullPointerException and the Reosurcemanager server 
exit.

!yuque_diagram.jpg|width=629,height=100!
h2. The second case:

*The exception desc:* 
{code:java}
06:17:35,913 WARN ha.ActiveStandbyElector 
(ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning 
of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
during transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
failed
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
... 5 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
... 6 more
06:17:35,917 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
tion failed{{}} {code}
 * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
toStandby event at 06:17:35, Two asynchronous events are respectively referred 
to as Thread_ 1、Thread_ 2.
 * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
called to refresh the Scheduler configuration. At this time, the csConfProvider 
property of the CapacityScheduler is not initialized and its value is null. As 
a result. when the reinitialize method is executed csConfProvider is used, 
triggering a NullPointerException and causing Thread_ 1 transition to active 
fail.

!yuque_diagram (1).jpg|width=568,height=155!
h1. Solution

Due to the limited scope of lock control in ResourceMmanger’s 
transitionToActive and transitionToStandby methods, different events triggered 
asynchronously outside this lock scope can influence each other, leading to 
unpredictable issues. The proposed solution is to encapsulate different 
asynchronous tasks as TransitionToActiveStandbyRunner and enqueue them in a 
queue to be executed in order by a SingleThreadExecutor. This approach resolves 
the asynchronous problem and provides 

[jira] [Created] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-03 Thread wangzhihui (Jira)
wangzhihui created YARN-11622:
-

 Summary: ResourceManager asynchronous switch to Standy、Active 
exception
 Key: YARN-11622
 URL: https://issues.apache.org/jira/browse/YARN-11622
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: wangzhihui
 Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
yuque_diagram.jpg

h1. Two exception cases:
h2. The first case:

*The exception desc:* 
14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - 
Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748){{}} * ActiveStandbyElector and 
ZKRMStateStore triggered toStandy event at 14:52:57,

Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
 * As shown in the following figure, Thread_1 during the toStandby process , 
reinitializes the activeServices to null. At this point, Thread_2 will use the 
"activeServices" when executing the handleTransitionToStandByInNewThread method 
ultimately resulting in a NullPointerException and the Reosurcemanager server 
exit.

 !yuque_diagram.jpg|width=629,height=100!

h2. The second case:

*The exception desc:* 
06:17:35,913 WARN  ha.ActiveStandbyElector 
(ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning 
of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
during transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
failed
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
... 5 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
... 6 more
06:17:35,917 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
tion failed{{}}

 * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
toStandby event at 06:17:35, Two asynchronous events are respectively referred 
to as Thread_ 1、Thread_ 2.
 * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
called to refresh the Scheduler configuration. At this time, the csConfProvider 
property of the CapacityScheduler is not initialized and its value is null. As 
a result. when the reinitialize method is executed csConfProvider is used, 
triggering a NullPointerException and causing Thread_ 1 transition to active 
fail.

 !yuque_diagram (1).jpg|width=568,height=155!

h1. Solution

Due to the limited scope of lock control in 

[jira] [Commented] (YARN-11561) [Federation] GPG Supports Format PolicyStateStore.

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792507#comment-17792507
 ] 

ASF GitHub Bot commented on YARN-11561:
---

slfan1989 commented on PR #6300:
URL: https://github.com/apache/hadoop/pull/6300#issuecomment-1837438799

   @goiri Thank you very much for your help in reviewing the code!




> [Federation] GPG Supports Format PolicyStateStore.
> --
>
> Key: YARN-11561
> URL: https://issues.apache.org/jira/browse/YARN-11561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11561) [Federation] GPG Supports Format PolicyStateStore.

2023-12-03 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11561.
---
   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

> [Federation] GPG Supports Format PolicyStateStore.
> --
>
> Key: YARN-11561
> URL: https://issues.apache.org/jira/browse/YARN-11561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11561) [Federation] GPG Supports Format PolicyStateStore.

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792506#comment-17792506
 ] 

ASF GitHub Bot commented on YARN-11561:
---

slfan1989 merged PR #6300:
URL: https://github.com/apache/hadoop/pull/6300




> [Federation] GPG Supports Format PolicyStateStore.
> --
>
> Key: YARN-11561
> URL: https://issues.apache.org/jira/browse/YARN-11561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org