[ https://issues.apache.org/jira/browse/YARN-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
tim yu updated YARN-10464: -------------------------- Description: I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA enabled but when I test it out by killing the active RM it brings down the entire cluster. I have configured Flink's HA in flink-conf.yml. When I try to kill the active RM using kill -9, YARN correctly switches to the standby RM and I can see applications as ACCEPTED for a minute but soon the standby RM crashes throwing the following exception: {code:java} 2020-10-18 15:39:36.112 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702) at java.lang.Thread.run(Thread.java:745){code} I found some code about submitting high-availability jobs in flink project: {code:java} private void activateHighAvailabilitySupport(ApplicationSubmissionContext appContext) throws InvocationTargetException, IllegalAccessException { ApplicationSubmissionContextReflector reflector = ApplicationSubmissionContextReflector.getInstance(); reflector.setKeepContainersAcrossApplicationAttempts(appContext, true); reflector.setAttemptFailuresValidityInterval( appContext, flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL)); } {code} Flink HA jobs set KeepContainersAcrossApplicationAttempts to true. Some properties in yarn-site.xml: <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.work-preserving-recovery.enabled</name> <value>false</value> </property> was: I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA enabled but when I test it out by killing the active RM it brings down the entire cluster. I have configured Flink's HA in flink-conf.yml. When I try to kill the active RM using kill -9, YARN correctly switches to the standby RM and I can see applications as ACCEPTED for a minute but soon the standby RM crashes throwing the following exception: 2020-10-18 15:39:36.112 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702) at java.lang.Thread.run(Thread.java:745) I found some code about submitting high-availability jobs in flink project: private void activateHighAvailabilitySupport(ApplicationSubmissionContext appContext) throws InvocationTargetException, IllegalAccessException { ApplicationSubmissionContextReflector reflector = ApplicationSubmissionContextReflector.getInstance(); reflector.setKeepContainersAcrossApplicationAttempts(appContext, true); reflector.setAttemptFailuresValidityInterval( appContext, flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL)); } Flink HA jobs set KeepContainersAcrossApplicationAttempts to true. Environment: (was: some properties in yarn-site.xml: <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.work-preserving-recovery.enabled</name> <value>false</value> </property>) > Flink job on YARN with HA enabled crashes all RMs on attempt recovery > --------------------------------------------------------------------- > > Key: YARN-10464 > URL: https://issues.apache.org/jira/browse/YARN-10464 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: tim yu > Priority: Critical > > I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA > enabled but when I test it out by killing the active RM it brings down the > entire cluster. > I have configured Flink's HA in flink-conf.yml. > When I try to kill the active RM using kill -9, YARN correctly switches to > the standby RM and I can see applications as ACCEPTED for a minute but soon > the standby RM crashes throwing the following exception: > {code:java} > 2020-10-18 15:39:36.112 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_ADDED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702) > at java.lang.Thread.run(Thread.java:745){code} > I found some code about submitting high-availability jobs in flink project: > {code:java} > private void activateHighAvailabilitySupport(ApplicationSubmissionContext > appContext) throws > InvocationTargetException, IllegalAccessException { > ApplicationSubmissionContextReflector reflector = > ApplicationSubmissionContextReflector.getInstance(); > > reflector.setKeepContainersAcrossApplicationAttempts(appContext, true); > reflector.setAttemptFailuresValidityInterval( > appContext, > > flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL)); > } > {code} > Flink HA jobs set KeepContainersAcrossApplicationAttempts to true. > Some properties in yarn-site.xml: > <property> > <name>yarn.resourcemanager.recovery.enabled</name> > <value>true</value> > </property> > > <property> > <name>yarn.resourcemanager.work-preserving-recovery.enabled</name> > <value>false</value> > </property> -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org