[jira] [Resolved] (YARN-11688) FS-CS converter: call System.exit replaced with ExitUtil.halt
[ https://issues.apache.org/jira/browse/YARN-11688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui resolved YARN-11688. --- Resolution: Resolved > FS-CS converter: call System.exit replaced with ExitUtil.halt > - > > Key: YARN-11688 > URL: https://issues.apache.org/jira/browse/YARN-11688 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: wangzhihui >Assignee: wangzhihui >Priority: Blocker > Fix For: 3.3.0 > > Attachments: image-2024-04-20-22-17-49-522.png > > > Added System.exit logic in YARN-10191 to avoid issues with the tool will > never terminate. > Causing TestFSConfigToCSConfigConverterMain to VM terminated during running > test. > ExitUtil tool in Hadoop-common facilitates process termination for tests, > debugging. > Call System.exit replaced with ExitUtil.halt ,It would be more suitable for > this purpose. > {code:java} > // code placeholder > Crashed tests: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain > org.apache.maven.surefire.booter.SurefireBooterForkException: > ExecutionException The forked VM terminated without properly saying goodbye. > VM crash or System.exit called? > Command was /bin/sh -c cd > /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager > && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m > -XX:+HeapDumpOnOutOfMemoryError -jar > /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire/surefirebooter2247421570320659117.jar > > /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire > 2024-04-17T14-34-01_743-jvmRun1 surefire5773923906402489727tmp > surefire_1524181064953128391099tmp > Process Exit Code: 0 > Crashed tests: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain > at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511) > at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:458) > at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:299) > at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247) > at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1149) > at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:991) > at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:837) > at > org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81) > at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56) > at > org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128) > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305) > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192) > at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105) > at org.apache.maven.cli.MavenCli.execute(MavenCli.java:956) > at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288) > at org.apache.maven.cli.MavenCli.main(MavenCli.java:192) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) > at > org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) > at >
[jira] [Created] (YARN-11688) FS-CS converter: call System.exit replaced with ExitUtil.halt
wangzhihui created YARN-11688: - Summary: FS-CS converter: call System.exit replaced with ExitUtil.halt Key: YARN-11688 URL: https://issues.apache.org/jira/browse/YARN-11688 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: wangzhihui Assignee: wangzhihui Fix For: 3.3.0 Attachments: image-2024-04-20-22-17-49-522.png Added System.exit logic in YARN-10191 to avoid issues with the tool will never terminate. Causing TestFSConfigToCSConfigConverterMain to VM terminated during running test. ExitUtil tool in Hadoop-common facilitates process termination for tests, debugging. Call System.exit replaced with ExitUtil.halt ,It would be more suitable for this purpose. {code:java} // code placeholder Crashed tests: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain org.apache.maven.surefire.booter.SurefireBooterForkException: ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Command was /bin/sh -c cd /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -jar /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire/surefirebooter2247421570320659117.jar /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire 2024-04-17T14-34-01_743-jvmRun1 surefire5773923906402489727tmp surefire_1524181064953128391099tmp Process Exit Code: 0 Crashed tests: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:458) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:299) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247) at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1149) at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:991) at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:837) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:956) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288) at org.apache.maven.cli.MavenCli.main(MavenCli.java:192) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Command was
[jira] [Commented] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size
[ https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832700#comment-17832700 ] wangzhihui commented on YARN-11582: --- hi, [~hexiaoqiao] . We have submitted a simple optimization for ReosurceManager. Please help review and merge it. > Improve WebUI diagnosticMessage to show AM Container resource request size > -- > > Key: YARN-11582 > URL: https://issues.apache.org/jira/browse/YARN-11582 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications, resourcemanager >Affects Versions: 3.3.4 >Reporter: xiaojunxiang >Priority: Minor > Labels: pull-request-available > Attachments: image-2023-10-02-00-05-34-337.png, > image-2024-03-28-22-11-37-903.png, success_ShowAMInfo.jpg > > > When Yarn resources are insufficient, the newly submitted job AM may be in > the state of "Application is Activated, waiting for resources to be assigned > for AM". This is obviously because Yarn doesn't have enough resources to > allocate another AM Container, so we want to know how large the AM Container > is currently allocated. Unfortunately, the current diagnosticMessage on the > Web page does not show this data. Therefore, it is necessary to add the > resource size of the AM Container in the diagnosticMessage, which will be > very useful for us to troubleshoise the production faults on line. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size
[ https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831827#comment-17831827 ] wangzhihui edited comment on YARN-11582 at 3/28/24 2:15 PM: hi, [~slfan1989] . This [PR|https://github.com/apache/hadoop/pull/6139] has added valid Test content and passed the latest Jenkins check; please help merge it. Thanks! was (Author: JIRAUSER302479): hi, [~slfan1989] This [PR|https://github.com/apache/hadoop/pull/6139] has added valid Test content and passed the latest Jenkins check; please help merge it. Thanks! > Improve WebUI diagnosticMessage to show AM Container resource request size > -- > > Key: YARN-11582 > URL: https://issues.apache.org/jira/browse/YARN-11582 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications, resourcemanager >Affects Versions: 3.3.4 >Reporter: xiaojunxiang >Priority: Minor > Labels: pull-request-available > Attachments: image-2023-10-02-00-05-34-337.png, > image-2024-03-28-22-11-37-903.png, success_ShowAMInfo.jpg > > > When Yarn resources are insufficient, the newly submitted job AM may be in > the state of "Application is Activated, waiting for resources to be assigned > for AM". This is obviously because Yarn doesn't have enough resources to > allocate another AM Container, so we want to know how large the AM Container > is currently allocated. Unfortunately, the current diagnosticMessage on the > Web page does not show this data. Therefore, it is necessary to add the > resource size of the AM Container in the diagnosticMessage, which will be > very useful for us to troubleshoise the production faults on line. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size
[ https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831827#comment-17831827 ] wangzhihui commented on YARN-11582: --- hi, [~slfan1989] This [PR|https://github.com/apache/hadoop/pull/6139] has added valid Test content and passed the latest Jenkins check; please help merge it. Thanks! > Improve WebUI diagnosticMessage to show AM Container resource request size > -- > > Key: YARN-11582 > URL: https://issues.apache.org/jira/browse/YARN-11582 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications, resourcemanager >Affects Versions: 3.3.4 >Reporter: xiaojunxiang >Priority: Minor > Labels: pull-request-available > Attachments: image-2023-10-02-00-05-34-337.png, > image-2024-03-28-22-11-37-903.png, success_ShowAMInfo.jpg > > > When Yarn resources are insufficient, the newly submitted job AM may be in > the state of "Application is Activated, waiting for resources to be assigned > for AM". This is obviously because Yarn doesn't have enough resources to > allocate another AM Container, so we want to know how large the AM Container > is currently allocated. Unfortunately, the current diagnosticMessage on the > Web page does not show this data. Therefore, it is necessary to add the > resource size of the AM Container in the diagnosticMessage, which will be > very useful for us to troubleshoise the production faults on line. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size
[ https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11582: -- Attachment: image-2024-03-28-22-11-37-903.png > Improve WebUI diagnosticMessage to show AM Container resource request size > -- > > Key: YARN-11582 > URL: https://issues.apache.org/jira/browse/YARN-11582 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications, resourcemanager >Affects Versions: 3.3.4 >Reporter: xiaojunxiang >Priority: Minor > Labels: pull-request-available > Attachments: image-2023-10-02-00-05-34-337.png, > image-2024-03-28-22-11-37-903.png, success_ShowAMInfo.jpg > > > When Yarn resources are insufficient, the newly submitted job AM may be in > the state of "Application is Activated, waiting for resources to be assigned > for AM". This is obviously because Yarn doesn't have enough resources to > allocate another AM Container, so we want to know how large the AM Container > is currently allocated. Unfortunately, the current diagnosticMessage on the > Web page does not show this data. Therefore, it is necessary to add the > resource size of the AM Container in the diagnosticMessage, which will be > very useful for us to troubleshoise the production faults on line. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11582) Improve WebUI diagnosticMessage to show AM Container resource request size
[ https://issues.apache.org/jira/browse/YARN-11582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11582: -- Priority: Minor (was: Major) > Improve WebUI diagnosticMessage to show AM Container resource request size > -- > > Key: YARN-11582 > URL: https://issues.apache.org/jira/browse/YARN-11582 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications, resourcemanager >Affects Versions: 3.3.4 >Reporter: xiaojunxiang >Priority: Minor > Labels: pull-request-available > Attachments: image-2023-10-02-00-05-34-337.png, success_ShowAMInfo.jpg > > > When Yarn resources are insufficient, the newly submitted job AM may be in > the state of "Application is Activated, waiting for resources to be assigned > for AM". This is obviously because Yarn doesn't have enough resources to > allocate another AM Container, so we want to know how large the AM Container > is currently allocated. Unfortunately, the current diagnosticMessage on the > Web page does not show this data. Therefore, it is necessary to add the > resource size of the AM Container in the diagnosticMessage, which will be > very useful for us to troubleshoise the production faults on line. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808866#comment-17808866 ] wangzhihui commented on YARN-11622: --- hi, [~slfan1989] I'm so sorry, I've been busy lately and haven't been able to promptly handle questions or details about the current issue. I have added a testTransitionedToStandbyShouldNotNPE test case to reproduce the problem described in YARN-11622. So far, we still have a Spotbug prompt that needs to be discussed on how to handle it. Looking forward to your reply, thank you. > ResourceManager asynchronous switch from Standy to Active exception > --- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 >Reporter: wangzhihui >Assignee: wangzhihui >Priority: Major > Labels: pull-request-available > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings:
[jira] [Created] (YARN-11626) Optimization of the safeDelete operation in ZKRMStateStore
wangzhihui created YARN-11626: - Summary: Optimization of the safeDelete operation in ZKRMStateStore Key: YARN-11626 URL: https://issues.apache.org/jira/browse/YARN-11626 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 3.3.0, 3.1.1, 3.0.0-alpha4 Reporter: wangzhihui h1. Description * We can be observed that removing app info started at 06:17:20, but the NoNodeException was received at 06:17:35. * During the 15s interval, Curator was retrying the metadata operation. Due to the non-idempotent nature of the Zookeeper deletion operation, in one of the retry attempts, the metadata operation was successful but no response was received. In the next retry it resulted in a NoNodeException, triggering the STATE_STORE_FENCED event and ultimately causing the current ResourceManager to switch to standby . {code:java} 2023-10-28 06:17:20,359 INFO recovery.RMStateStore (RMStateStore.java:transition(333)) - Removing info for app: application_1697410508608_140368 2023-10-28 06:17:20,359 INFO resourcemanager.RMAppManager (RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000, removing app application_1697410508608_140368 from memory: 2023-10-28 06:17:35,665 ERROR recovery.RMStateStore (RMStateStore.java:transition(337)) - Error removing app: application_1697410508608_140368 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) 2023-10-28 06:17:35,666 INFO recovery.RMStateStore (RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from ACTIVE to FENCED 2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(898)) - Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 2023-10-28 06:17:35,666 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby state {code} h1. Solution The NoNodeException clearly indicates that the Znode no longer exists, so we can safely ignore this exception to avoid triggering a larger impact on the cluster caused by ResourceManager failover. h1. Other We also need to discuss and optimize the same issues in safeCreate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11626) Optimization of the safeDelete operation in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11626: -- Priority: Minor (was: Major) > Optimization of the safeDelete operation in ZKRMStateStore > -- > > Key: YARN-11626 > URL: https://issues.apache.org/jira/browse/YARN-11626 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 >Reporter: wangzhihui >Priority: Minor > > h1. Description > * We can be observed that removing app info started at 06:17:20, but the > NoNodeException was received at 06:17:35. > * During the 15s interval, Curator was retrying the metadata operation. Due > to the non-idempotent nature of the Zookeeper deletion operation, in one of > the retry attempts, the metadata operation was successful but no response was > received. In the next retry it resulted in a NoNodeException, triggering the > STATE_STORE_FENCED event and ultimately causing the current ResourceManager > to switch to standby . > {code:java} > 2023-10-28 06:17:20,359 INFO recovery.RMStateStore > (RMStateStore.java:transition(333)) - Removing info for app: > application_1697410508608_140368 > 2023-10-28 06:17:20,359 INFO resourcemanager.RMAppManager > (RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be > expired, max number of completed apps kept in memory met: > maxCompletedAppsInMemory = 1000, removing app > application_1697410508608_140368 from memory: > 2023-10-28 06:17:35,665 ERROR recovery.RMStateStore > (RMStateStore.java:transition(337)) - Error removing app: > application_1697410508608_140368 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > 2023-10-28 06:17:35,666 INFO recovery.RMStateStore > (RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from > ACTIVE to FENCED > 2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > STATE_STORE_FENCED, caused by > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > 2023-10-28 06:17:35,666 INFO resourcemanager.ResourceManager > (ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby > state > {code} > h1. Solution > The NoNodeException clearly indicates that the Znode no longer exists, so we > can safely ignore this exception to avoid triggering a larger impact on the > cluster caused by ResourceManager failover. > h1. Other > We also need to discuss and optimize the same issues in safeCreate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11625) All job statuses can't be updated on Active ResourceManager services
[ https://issues.apache.org/jira/browse/YARN-11625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11625: -- Affects Version/s: 3.3.0 3.1.1 3.0.0-alpha4 > All job statuses can't be updated on Active ResourceManager services > > > Key: YARN-11625 > URL: https://issues.apache.org/jira/browse/YARN-11625 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 >Reporter: wangzhihui >Priority: Major > Attachments: yuque_diagram.jpg > > > The process through steps ① to ⑩ ultimately leads to the Active > ResourceManager’s RMStateStore being stopped in the FENCED state, resulting > in the inability to update the all job status. > !yuque_diagram.jpg|width=600,height=273! > h1. > h1. Solution > First, adopting the solution described in YARN-11622 enables an ordered > switch between the "toActive" and "toStandby", in which case we can remove > the control of the "hasAlreadyRun" variable to avoid this issue。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793182#comment-17793182 ] wangzhihui commented on YARN-11622: --- hi, [~slfan1989] Could you please review another related question? https://issues.apache.org/jira/browse/YARN-11625 > ResourceManager asynchronous switch from Standy to Active exception > --- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh the
[jira] [Created] (YARN-11625) All job statuses can't be updated on Active ResourceManager services
wangzhihui created YARN-11625: - Summary: All job statuses can't be updated on Active ResourceManager services Key: YARN-11625 URL: https://issues.apache.org/jira/browse/YARN-11625 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: wangzhihui Attachments: yuque_diagram.jpg The process through steps ① to ⑩ ultimately leads to the Active ResourceManager’s RMStateStore being stopped in the FENCED state, resulting in the inability to update the all job status. !yuque_diagram.jpg|width=600,height=273! h1. h1. Solution First, adopting the solution described in YARN-11622 enables an ordered switch between the "toActive" and "toStandby", in which case we can remove the control of the "hasAlreadyRun" variable to avoid this issue。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793117#comment-17793117 ] wangzhihui commented on YARN-11622: --- [~hexiaoqiao] [~elgoiri] [~slfan1989] Thank you all, I will start the relevant repairs soon. > ResourceManager asynchronous switch from Standy to Active exception > --- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh the Scheduler
[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792813#comment-17792813 ] wangzhihui commented on YARN-11622: --- The root cause of this issue can be traced back to the asynchronous processing logic introduced in the PATCH of the 3.0.0-alpha4 branch. > ResourceManager asynchronous switch to Standy、Active exception > -- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to
[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11622: -- Affects Version/s: 3.3.0 > ResourceManager asynchronous switch to Standy、Active exception > -- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh the Scheduler configuration. At this time, the > csConfProvider property of the CapacityScheduler is not initialized and its > value is null.
[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792812#comment-17792812 ] wangzhihui commented on YARN-11622: --- hi ,[~hexiaoqiao] I have checked the Active Branch 3.4, 3.3, 3.2, and the latest 3.3.6 versions, and they all have the same issue. > ResourceManager asynchronous switch to Standy、Active exception > -- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh
[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11622: -- Affects Version/s: 3.1.1 3.0.0-alpha4 (was: 3.0.0) (was: 3.1.3) > ResourceManager asynchronous switch to Standy、Active exception > -- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh the Scheduler configuration. At
[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11622: -- Affects Version/s: 3.1.3 > ResourceManager asynchronous switch to Standy、Active exception > -- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0, 3.1.3 >Reporter: wangzhihui >Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh the Scheduler configuration. At this time, the > csConfProvider property of the CapacityScheduler is not initialized and its > value is null. As a result.
[jira] [Updated] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui updated YARN-11622: -- Description: h1. Two exception cases: h2. The first case: *The exception desc:* {code:java} 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:748){{}} * {code} * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 14:52:57, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. * As shown in the following figure, Thread_1 during the toStandby process , reinitializes the activeServices to null. At this point, Thread_2 will use the "activeServices" when executing the handleTransitionToStandByInNewThread method ultimately resulting in a NullPointerException and the Reosurcemanager server exit. !yuque_diagram.jpg|width=629,height=100! h2. The second case: *The exception desc:* {code:java} 06:17:35,913 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll during transition to Active at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) ... 5 more Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) ... 6 more 06:17:35,917 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(898)) - Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera tion failed{{}} {code} * ActiveStandbyElector and ZKRMStateStore triggered toActive event and toStandby event at 06:17:35, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is called to refresh the Scheduler configuration. At this time, the csConfProvider property of the CapacityScheduler is not initialized and its value is null. As a result. when the reinitialize method is executed csConfProvider is used, triggering a NullPointerException and causing Thread_ 1 transition to active fail. !yuque_diagram (1).jpg|width=568,height=155! h1. Solution Due to the limited scope of lock control in ResourceMmanger’s transitionToActive and transitionToStandby methods, different events triggered asynchronously outside this lock scope can influence each other, leading to unpredictable issues. The proposed solution is to encapsulate different asynchronous tasks as TransitionToActiveStandbyRunner and enqueue them in a queue to be executed in order by a SingleThreadExecutor. This approach resolves the asynchronous problem and provides
[jira] [Created] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception
wangzhihui created YARN-11622: - Summary: ResourceManager asynchronous switch to Standy、Active exception Key: YARN-11622 URL: https://issues.apache.org/jira/browse/YARN-11622 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: wangzhihui Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, yuque_diagram.jpg h1. Two exception cases: h2. The first case: *The exception desc:* 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:748){{}} * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 14:52:57, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. * As shown in the following figure, Thread_1 during the toStandby process , reinitializes the activeServices to null. At this point, Thread_2 will use the "activeServices" when executing the handleTransitionToStandByInNewThread method ultimately resulting in a NullPointerException and the Reosurcemanager server exit. !yuque_diagram.jpg|width=629,height=100! h2. The second case: *The exception desc:* 06:17:35,913 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll during transition to Active at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) ... 5 more Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) ... 6 more 06:17:35,917 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(898)) - Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera tion failed{{}} * ActiveStandbyElector and ZKRMStateStore triggered toActive event and toStandby event at 06:17:35, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is called to refresh the Scheduler configuration. At this time, the csConfProvider property of the CapacityScheduler is not initialized and its value is null. As a result. when the reinitialize method is executed csConfProvider is used, triggering a NullPointerException and causing Thread_ 1 transition to active fail. !yuque_diagram (1).jpg|width=568,height=155! h1. Solution Due to the limited scope of lock control in