[
https://issues.apache.org/jira/browse/YARN-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
tim yu updated YARN-10464:
--------------------------
Description:
I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA
enabled but when I test it out by killing the active RM it brings down the
entire cluster.
I have configured Flink's HA in flink-conf.yml.
When I try to kill the active RM using kill -9, YARN correctly switches to the
standby RM and I can see applications as ACCEPTED for a minute but soon the
standby RM crashes throwing the following exception:
{code:java}
2020-10-18 15:39:36.112 FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
at java.lang.Thread.run(Thread.java:745){code}
I found some code about submitting high-availability jobs in flink project:
{code:java}
private void activateHighAvailabilitySupport(ApplicationSubmissionContext
appContext) throws
InvocationTargetException, IllegalAccessException {
ApplicationSubmissionContextReflector reflector =
ApplicationSubmissionContextReflector.getInstance();
reflector.setKeepContainersAcrossApplicationAttempts(appContext, true);
reflector.setAttemptFailuresValidityInterval(
appContext,
flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL));
}
{code}
Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.
Some properties in yarn-site.xml:
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
<value>false</value>
</property>
was:
I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA
enabled but when I test it out by killing the active RM it brings down the
entire cluster.
I have configured Flink's HA in flink-conf.yml.
When I try to kill the active RM using kill -9, YARN correctly switches to the
standby RM and I can see applications as ACCEPTED for a minute but soon the
standby RM crashes throwing the following exception:
2020-10-18 15:39:36.112 FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
at java.lang.Thread.run(Thread.java:745)
I found some code about submitting high-availability jobs in flink project:
private void activateHighAvailabilitySupport(ApplicationSubmissionContext
appContext) throws
InvocationTargetException, IllegalAccessException {
ApplicationSubmissionContextReflector reflector =
ApplicationSubmissionContextReflector.getInstance();
reflector.setKeepContainersAcrossApplicationAttempts(appContext, true);
reflector.setAttemptFailuresValidityInterval(
appContext,
flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL));
}
Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.
Environment: (was: some properties in yarn-site.xml:
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
<value>false</value>
</property>)
> Flink job on YARN with HA enabled crashes all RMs on attempt recovery
> ---------------------------------------------------------------------
>
> Key: YARN-10464
> URL: https://issues.apache.org/jira/browse/YARN-10464
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0
> Reporter: tim yu
> Priority: Critical
>
> I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA
> enabled but when I test it out by killing the active RM it brings down the
> entire cluster.
> I have configured Flink's HA in flink-conf.yml.
> When I try to kill the active RM using kill -9, YARN correctly switches to
> the standby RM and I can see applications as ACCEPTED for a minute but soon
> the standby RM crashes throwing the following exception:
> {code:java}
> 2020-10-18 15:39:36.112 FATAL
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
> at java.lang.Thread.run(Thread.java:745){code}
> I found some code about submitting high-availability jobs in flink project:
> {code:java}
> private void activateHighAvailabilitySupport(ApplicationSubmissionContext
> appContext) throws
> InvocationTargetException, IllegalAccessException {
> ApplicationSubmissionContextReflector reflector =
> ApplicationSubmissionContextReflector.getInstance();
>
> reflector.setKeepContainersAcrossApplicationAttempts(appContext, true);
> reflector.setAttemptFailuresValidityInterval(
> appContext,
>
> flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL));
> }
> {code}
> Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.
> Some properties in yarn-site.xml:
> <property>
> <name>yarn.resourcemanager.recovery.enabled</name>
> <value>true</value>
> </property>
>
> <property>
> <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
> <value>false</value>
> </property>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]