tim yu created YARN-10464: ----------------------------- Summary: Flink job on YARN with HA enabled crashes all RMs on attempt recovery Key: YARN-10464 URL: https://issues.apache.org/jira/browse/YARN-10464 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: some properties in yarn-site.xml:
<property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.work-preserving-recovery.enabled</name> <value>false</value> </property> Reporter: tim yu I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA enabled but when I test it out by killing the active RM it brings down the entire cluster. I have configured Flink's HA in flink-conf.yml. When I try to kill the active RM using kill -9, YARN correctly switches to the standby RM and I can see applications as ACCEPTED for a minute but soon the standby RM crashes throwing the following exception: 2020-10-18 15:39:36.112 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702) at java.lang.Thread.run(Thread.java:745) I found some code about submitting high-availability jobs in flink project: private void activateHighAvailabilitySupport(ApplicationSubmissionContext appContext) throws InvocationTargetException, IllegalAccessException { ApplicationSubmissionContextReflector reflector = ApplicationSubmissionContextReflector.getInstance(); reflector.setKeepContainersAcrossApplicationAttempts(appContext, true); reflector.setAttemptFailuresValidityInterval( appContext, flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL)); } Flink HA jobs set KeepContainersAcrossApplicationAttempts to true. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org