[jira] [Resolved] (YARN-11688) FS-CS converter: call System.exit replaced with ExitUtil.halt

2024-04-20 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui resolved YARN-11688.
---
Resolution: Resolved

> FS-CS converter: call System.exit replaced with ExitUtil.halt
> -
>
> Key: YARN-11688
> URL: https://issues.apache.org/jira/browse/YARN-11688
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: image-2024-04-20-22-17-49-522.png
>
>
> Added System.exit logic in YARN-10191 to avoid issues with the tool will 
> never terminate.
> Causing TestFSConfigToCSConfigConverterMain to VM terminated during running 
> test.
> ExitUtil tool in Hadoop-common facilitates process termination for tests, 
> debugging.
> Call System.exit replaced with ExitUtil.halt ,It would be more suitable for 
> this purpose.
> {code:java}
> // code placeholder
> Crashed tests:
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
> org.apache.maven.surefire.booter.SurefireBooterForkException: 
> ExecutionException The forked VM terminated without properly saying goodbye. 
> VM crash or System.exit called?
> Command was /bin/sh -c cd 
> /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
>  && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m 
> -XX:+HeapDumpOnOutOfMemoryError -jar 
> /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire/surefirebooter2247421570320659117.jar
>  
> /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire
>  2024-04-17T14-34-01_743-jvmRun1 surefire5773923906402489727tmp 
> surefire_1524181064953128391099tmp
> Process Exit Code: 0
> Crashed tests:
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
>   at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511)
>   at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:458)
>   at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:299)
>   at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247)
>   at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1149)
>   at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:991)
>   at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:837)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
>   at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
>   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
>   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:956)
>   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
>   at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>   at 
> 

[jira] [Created] (YARN-11688) FS-CS converter: call System.exit replaced with ExitUtil.halt

2024-04-20 Thread wangzhihui (Jira)
wangzhihui created YARN-11688:
-

 Summary: FS-CS converter: call System.exit replaced with 
ExitUtil.halt
 Key: YARN-11688
 URL: https://issues.apache.org/jira/browse/YARN-11688
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: wangzhihui
Assignee: wangzhihui
 Fix For: 3.3.0
 Attachments: image-2024-04-20-22-17-49-522.png

Added System.exit logic in YARN-10191 to avoid issues with the tool will never 
terminate.

Causing TestFSConfigToCSConfigConverterMain to VM terminated during running 
test.

ExitUtil tool in Hadoop-common facilitates process termination for tests, 
debugging.

Call System.exit replaced with ExitUtil.halt ,It would be more suitable for 
this purpose.
{code:java}
// code placeholder
Crashed tests:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
org.apache.maven.surefire.booter.SurefireBooterForkException: 
ExecutionException The forked VM terminated without properly saying goodbye. VM 
crash or System.exit called?
Command was /bin/sh -c cd 
/home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m 
-XX:+HeapDumpOnOutOfMemoryError -jar 
/home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire/surefirebooter2247421570320659117.jar
 
/home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire
 2024-04-17T14-34-01_743-jvmRun1 surefire5773923906402489727tmp 
surefire_1524181064953128391099tmp
Process Exit Code: 0
Crashed tests:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:458)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:299)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247)
at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1149)
at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:991)
at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:837)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:956)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The 
forked VM terminated without properly saying goodbye. VM crash or System.exit 
called?
Command was 

[jira] [Commented] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size

2024-03-31 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832700#comment-17832700
 ] 

wangzhihui commented on YARN-11582:
---

hi, [~hexiaoqiao] .   We have submitted a simple optimization for 
ReosurceManager.  Please help review and merge it.

> Improve WebUI diagnosticMessage to show AM Container resource request size
> --
>
> Key: YARN-11582
> URL: https://issues.apache.org/jira/browse/YARN-11582
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications, resourcemanager
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2023-10-02-00-05-34-337.png, 
> image-2024-03-28-22-11-37-903.png, success_ShowAMInfo.jpg
>
>
> When Yarn resources are insufficient, the newly submitted job AM may be in 
> the state of "Application is Activated, waiting for resources to be assigned 
> for AM". This is obviously because Yarn doesn't have enough resources to 
> allocate another AM Container, so we want to know how large the AM Container 
> is currently allocated. Unfortunately, the current diagnosticMessage on the 
> Web page does not show this data. Therefore, it is necessary to add the 
> resource size of the AM Container in the diagnosticMessage, which will be 
> very useful for us to troubleshoise the production faults on line.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size

2024-03-28 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831827#comment-17831827
 ] 

wangzhihui edited comment on YARN-11582 at 3/28/24 2:15 PM:


hi, [~slfan1989] . This  [PR|https://github.com/apache/hadoop/pull/6139] has 
added valid Test content and passed the latest Jenkins check; please help merge 
it. Thanks!
 


was (Author: JIRAUSER302479):
hi, [~slfan1989] This  [PR|https://github.com/apache/hadoop/pull/6139] has 
added valid Test content and passed the latest Jenkins check; please help merge 
it. Thanks!
 

> Improve WebUI diagnosticMessage to show AM Container resource request size
> --
>
> Key: YARN-11582
> URL: https://issues.apache.org/jira/browse/YARN-11582
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications, resourcemanager
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2023-10-02-00-05-34-337.png, 
> image-2024-03-28-22-11-37-903.png, success_ShowAMInfo.jpg
>
>
> When Yarn resources are insufficient, the newly submitted job AM may be in 
> the state of "Application is Activated, waiting for resources to be assigned 
> for AM". This is obviously because Yarn doesn't have enough resources to 
> allocate another AM Container, so we want to know how large the AM Container 
> is currently allocated. Unfortunately, the current diagnosticMessage on the 
> Web page does not show this data. Therefore, it is necessary to add the 
> resource size of the AM Container in the diagnosticMessage, which will be 
> very useful for us to troubleshoise the production faults on line.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size

2024-03-28 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831827#comment-17831827
 ] 

wangzhihui commented on YARN-11582:
---

hi, [~slfan1989] This  [PR|https://github.com/apache/hadoop/pull/6139] has 
added valid Test content and passed the latest Jenkins check; please help merge 
it. Thanks!
 

> Improve WebUI diagnosticMessage to show AM Container resource request size
> --
>
> Key: YARN-11582
> URL: https://issues.apache.org/jira/browse/YARN-11582
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications, resourcemanager
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2023-10-02-00-05-34-337.png, 
> image-2024-03-28-22-11-37-903.png, success_ShowAMInfo.jpg
>
>
> When Yarn resources are insufficient, the newly submitted job AM may be in 
> the state of "Application is Activated, waiting for resources to be assigned 
> for AM". This is obviously because Yarn doesn't have enough resources to 
> allocate another AM Container, so we want to know how large the AM Container 
> is currently allocated. Unfortunately, the current diagnosticMessage on the 
> Web page does not show this data. Therefore, it is necessary to add the 
> resource size of the AM Container in the diagnosticMessage, which will be 
> very useful for us to troubleshoise the production faults on line.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size

2024-03-28 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11582:
--
Attachment: image-2024-03-28-22-11-37-903.png

> Improve WebUI diagnosticMessage to show AM Container resource request size
> --
>
> Key: YARN-11582
> URL: https://issues.apache.org/jira/browse/YARN-11582
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications, resourcemanager
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2023-10-02-00-05-34-337.png, 
> image-2024-03-28-22-11-37-903.png, success_ShowAMInfo.jpg
>
>
> When Yarn resources are insufficient, the newly submitted job AM may be in 
> the state of "Application is Activated, waiting for resources to be assigned 
> for AM". This is obviously because Yarn doesn't have enough resources to 
> allocate another AM Container, so we want to know how large the AM Container 
> is currently allocated. Unfortunately, the current diagnosticMessage on the 
> Web page does not show this data. Therefore, it is necessary to add the 
> resource size of the AM Container in the diagnosticMessage, which will be 
> very useful for us to troubleshoise the production faults on line.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size

2024-03-27 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11582:
--
Priority: Minor  (was: Major)

> Improve WebUI diagnosticMessage to show AM Container resource request size
> --
>
> Key: YARN-11582
> URL: https://issues.apache.org/jira/browse/YARN-11582
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications, resourcemanager
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2023-10-02-00-05-34-337.png, success_ShowAMInfo.jpg
>
>
> When Yarn resources are insufficient, the newly submitted job AM may be in 
> the state of "Application is Activated, waiting for resources to be assigned 
> for AM". This is obviously because Yarn doesn't have enough resources to 
> allocate another AM Container, so we want to know how large the AM Container 
> is currently allocated. Unfortunately, the current diagnosticMessage on the 
> Web page does not show this data. Therefore, it is necessary to add the 
> resource size of the AM Container in the diagnosticMessage, which will be 
> very useful for us to troubleshoise the production faults on line.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-19 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808866#comment-17808866
 ] 

wangzhihui commented on YARN-11622:
---

hi, [~slfan1989] 

I'm so sorry, I've been busy lately and haven't been able to promptly handle 
questions or details about the current issue.
I have added a testTransitionedToStandbyShouldNotNPE test case to reproduce the 
problem described in YARN-11622.
So far, we still have a Spotbug prompt that needs to be discussed on how to 
handle it.
Looking forward to your reply, thank you.

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: 

[jira] [Created] (YARN-11626) Optimization of the safeDelete operation in ZKRMStateStore

2023-12-05 Thread wangzhihui (Jira)
wangzhihui created YARN-11626:
-

 Summary: Optimization of the safeDelete operation in ZKRMStateStore
 Key: YARN-11626
 URL: https://issues.apache.org/jira/browse/YARN-11626
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 3.3.0, 3.1.1, 3.0.0-alpha4
Reporter: wangzhihui


h1. Description 
 * We can be observed that removing app info started at 06:17:20, but the 
NoNodeException was received at 06:17:35. 
 * During the 15s interval, Curator was retrying the metadata operation. Due to 
the non-idempotent nature of the Zookeeper deletion operation, in one of the 
retry attempts, the metadata operation was successful but no response was 
received. In the next retry it resulted in a NoNodeException, triggering the 
STATE_STORE_FENCED event and ultimately causing the current ResourceManager to 
switch to standby .

{code:java}
2023-10-28 06:17:20,359 INFO  recovery.RMStateStore 
(RMStateStore.java:transition(333)) - Removing info for app: 
application_1697410508608_140368
2023-10-28 06:17:20,359 INFO  resourcemanager.RMAppManager 
(RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be 
expired, max number of completed apps kept in memory met: 
maxCompletedAppsInMemory = 1000, removing app application_1697410508608_140368 
from memory:
2023-10-28 06:17:35,665 ERROR recovery.RMStateStore 
(RMStateStore.java:transition(337)) - Error removing app: 
application_1697410508608_140368
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
2023-10-28 06:17:35,666 INFO  recovery.RMStateStore 
(RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from 
ACTIVE to FENCED
2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
STATE_STORE_FENCED, caused by 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2023-10-28 06:17:35,666 INFO  resourcemanager.ResourceManager 
(ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby 
state
 {code}
h1. Solution

The NoNodeException clearly indicates that the Znode no longer exists, so we 
can safely ignore this exception to avoid triggering a larger impact on the 
cluster caused by ResourceManager failover.
h1. Other

We also need to discuss and optimize the same issues in safeCreate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11626) Optimization of the safeDelete operation in ZKRMStateStore

2023-12-05 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11626:
--
Priority: Minor  (was: Major)

> Optimization of the safeDelete operation in ZKRMStateStore
> --
>
> Key: YARN-11626
> URL: https://issues.apache.org/jira/browse/YARN-11626
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Minor
>
> h1. Description 
>  * We can be observed that removing app info started at 06:17:20, but the 
> NoNodeException was received at 06:17:35. 
>  * During the 15s interval, Curator was retrying the metadata operation. Due 
> to the non-idempotent nature of the Zookeeper deletion operation, in one of 
> the retry attempts, the metadata operation was successful but no response was 
> received. In the next retry it resulted in a NoNodeException, triggering the 
> STATE_STORE_FENCED event and ultimately causing the current ResourceManager 
> to switch to standby .
> {code:java}
> 2023-10-28 06:17:20,359 INFO  recovery.RMStateStore 
> (RMStateStore.java:transition(333)) - Removing info for app: 
> application_1697410508608_140368
> 2023-10-28 06:17:20,359 INFO  resourcemanager.RMAppManager 
> (RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be 
> expired, max number of completed apps kept in memory met: 
> maxCompletedAppsInMemory = 1000, removing app 
> application_1697410508608_140368 from memory:
> 2023-10-28 06:17:35,665 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(337)) - Error removing app: 
> application_1697410508608_140368
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> 2023-10-28 06:17:35,666 INFO  recovery.RMStateStore 
> (RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from 
> ACTIVE to FENCED
> 2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> STATE_STORE_FENCED, caused by 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
> 2023-10-28 06:17:35,666 INFO  resourcemanager.ResourceManager 
> (ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby 
> state
>  {code}
> h1. Solution
> The NoNodeException clearly indicates that the Znode no longer exists, so we 
> can safely ignore this exception to avoid triggering a larger impact on the 
> cluster caused by ResourceManager failover.
> h1. Other
> We also need to discuss and optimize the same issues in safeCreate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11625) All job statuses can't be updated on Active ResourceManager services

2023-12-05 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11625:
--
Affects Version/s: 3.3.0
   3.1.1
   3.0.0-alpha4

> All job statuses can't be updated on Active ResourceManager services
> 
>
> Key: YARN-11625
> URL: https://issues.apache.org/jira/browse/YARN-11625
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: yuque_diagram.jpg
>
>
>       The process through steps ① to ⑩ ultimately leads to the Active 
> ResourceManager’s RMStateStore being stopped in the FENCED state, resulting 
> in the inability to update the all job status.
> !yuque_diagram.jpg|width=600,height=273!
> h1.  
> h1. Solution
> First, adopting the solution described in YARN-11622 enables an ordered 
> switch between the "toActive" and "toStandby", in which case we can remove 
> the control of the "hasAlreadyRun" variable to avoid this issue。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-05 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793182#comment-17793182
 ] 

wangzhihui commented on YARN-11622:
---

hi, [~slfan1989]  Could you please review another related question?

https://issues.apache.org/jira/browse/YARN-11625

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the 

[jira] [Created] (YARN-11625) All job statuses can't be updated on Active ResourceManager services

2023-12-05 Thread wangzhihui (Jira)
wangzhihui created YARN-11625:
-

 Summary: All job statuses can't be updated on Active 
ResourceManager services
 Key: YARN-11625
 URL: https://issues.apache.org/jira/browse/YARN-11625
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: wangzhihui
 Attachments: yuque_diagram.jpg

      The process through steps ① to ⑩ ultimately leads to the Active 
ResourceManager’s RMStateStore being stopped in the FENCED state, resulting in 
the inability to update the all job status.

!yuque_diagram.jpg|width=600,height=273!
h1.  
h1. Solution

First, adopting the solution described in YARN-11622 enables an ordered switch 
between the "toActive" and "toStandby", in which case we can remove the control 
of the "hasAlreadyRun" variable to avoid this issue。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-04 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793117#comment-17793117
 ] 

wangzhihui commented on YARN-11622:
---

[~hexiaoqiao]  
[~elgoiri]   [~slfan1989] Thank you all, I will start the relevant repairs soon.

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-04 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792813#comment-17792813
 ] 

wangzhihui commented on YARN-11622:
---

The root cause of this issue can be traced back to the asynchronous processing 
logic introduced in the PATCH of the 3.0.0-alpha4 branch.

> ResourceManager asynchronous switch to Standy、Active exception
> --
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to 

[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-04 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11622:
--
Affects Version/s: 3.3.0

> ResourceManager asynchronous switch to Standy、Active exception
> --
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler configuration. At this time, the 
> csConfProvider property of the CapacityScheduler is not initialized and its 
> value is null. 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-04 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792812#comment-17792812
 ] 

wangzhihui commented on YARN-11622:
---

hi ,[~hexiaoqiao]    I have checked the Active Branch 3.4, 3.3, 3.2, and the 
latest 3.3.6 versions,  and they all have the same issue.

> ResourceManager asynchronous switch to Standy、Active exception
> --
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh 

[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-04 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11622:
--
Affects Version/s: 3.1.1
   3.0.0-alpha4
   (was: 3.0.0)
   (was: 3.1.3)

> ResourceManager asynchronous switch to Standy、Active exception
> --
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler configuration. At 

[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-03 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11622:
--
Affects Version/s: 3.1.3

> ResourceManager asynchronous switch to Standy、Active exception
> --
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0, 3.1.3
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler configuration. At this time, the 
> csConfProvider property of the CapacityScheduler is not initialized and its 
> value is null. As a result. 

[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-03 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui updated YARN-11622:
--
Description: 
h1. Two exception cases:
h2. The first case:

*The exception desc:*
{code:java}
14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - 
Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748){{}} * {code}
 
 * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 14:52:57,

Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
 * As shown in the following figure, Thread_1 during the toStandby process , 
reinitializes the activeServices to null. At this point, Thread_2 will use the 
"activeServices" when executing the handleTransitionToStandByInNewThread method 
ultimately resulting in a NullPointerException and the Reosurcemanager server 
exit.

!yuque_diagram.jpg|width=629,height=100!
h2. The second case:

*The exception desc:* 
{code:java}
06:17:35,913 WARN ha.ActiveStandbyElector 
(ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning 
of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
during transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
failed
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
... 5 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
... 6 more
06:17:35,917 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
tion failed{{}} {code}
 * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
toStandby event at 06:17:35, Two asynchronous events are respectively referred 
to as Thread_ 1、Thread_ 2.
 * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
called to refresh the Scheduler configuration. At this time, the csConfProvider 
property of the CapacityScheduler is not initialized and its value is null. As 
a result. when the reinitialize method is executed csConfProvider is used, 
triggering a NullPointerException and causing Thread_ 1 transition to active 
fail.

!yuque_diagram (1).jpg|width=568,height=155!
h1. Solution

Due to the limited scope of lock control in ResourceMmanger’s 
transitionToActive and transitionToStandby methods, different events triggered 
asynchronously outside this lock scope can influence each other, leading to 
unpredictable issues. The proposed solution is to encapsulate different 
asynchronous tasks as TransitionToActiveStandbyRunner and enqueue them in a 
queue to be executed in order by a SingleThreadExecutor. This approach resolves 
the asynchronous problem and provides 

[jira] [Created] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

2023-12-03 Thread wangzhihui (Jira)
wangzhihui created YARN-11622:
-

 Summary: ResourceManager asynchronous switch to Standy、Active 
exception
 Key: YARN-11622
 URL: https://issues.apache.org/jira/browse/YARN-11622
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: wangzhihui
 Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
yuque_diagram.jpg

h1. Two exception cases:
h2. The first case:

*The exception desc:* 
14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - 
Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748){{}} * ActiveStandbyElector and 
ZKRMStateStore triggered toStandy event at 14:52:57,

Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
 * As shown in the following figure, Thread_1 during the toStandby process , 
reinitializes the activeServices to null. At this point, Thread_2 will use the 
"activeServices" when executing the handleTransitionToStandByInNewThread method 
ultimately resulting in a NullPointerException and the Reosurcemanager server 
exit.

 !yuque_diagram.jpg|width=629,height=100!

h2. The second case:

*The exception desc:* 
06:17:35,913 WARN  ha.ActiveStandbyElector 
(ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning 
of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
during transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
failed
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
... 5 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
... 6 more
06:17:35,917 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
tion failed{{}}

 * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
toStandby event at 06:17:35, Two asynchronous events are respectively referred 
to as Thread_ 1、Thread_ 2.
 * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
called to refresh the Scheduler configuration. At this time, the csConfProvider 
property of the CapacityScheduler is not initialized and its value is null. As 
a result. when the reinitialize method is executed csConfProvider is used, 
triggering a NullPointerException and causing Thread_ 1 transition to active 
fail.

 !yuque_diagram (1).jpg|width=568,height=155!

h1. Solution

Due to the limited scope of lock control in