[jira] [Updated] (YARN-10464) Flink job on YARN with HA enabled crashes all RMs on attempt recovery

tim yu (Jira) Sun, 18 Oct 2020 09:07:10 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


tim yu updated YARN-10464:
--------------------------
    Description: 
I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA 
enabled but when I test it out by killing the active RM it brings down the 
entire cluster.
 I have configured Flink's HA in flink-conf.yml.
 When I try to kill the active RM using kill -9, YARN correctly switches to the 
standby RM and I can see applications as ACCEPTED for a minute but soon the 
standby RM crashes throwing the following exception:
{code:java}
2020-10-18 15:39:36.112 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
 at java.lang.Thread.run(Thread.java:745){code}
I found some code about submitting high-availability jobs in flink project:
{code:java}
private void activateHighAvailabilitySupport(ApplicationSubmissionContext 
appContext) throws
                        InvocationTargetException, IllegalAccessException {

                ApplicationSubmissionContextReflector reflector = 
ApplicationSubmissionContextReflector.getInstance();
                
reflector.setKeepContainersAcrossApplicationAttempts(appContext, true);
                reflector.setAttemptFailuresValidityInterval(
                                appContext,
                                
flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL));
        }
{code}
Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.

Some properties in yarn-site.xml： 

<property>
  <name>yarn.resourcemanager.recovery.enabled</name>
  <value>true</value>
 </property> 
 
 <property>
  <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
  <value>false</value>
 </property>

  was:
I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA 
enabled but when I test it out by killing the active RM it brings down the 
entire cluster.
I have configured Flink's HA in flink-conf.yml.
When I try to kill the active RM using kill -9, YARN correctly switches to the 
standby RM and I can see applications as ACCEPTED for a minute but soon the 
standby RM crashes throwing the following exception:
2020-10-18 15:39:36.112 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
 at java.lang.Thread.run(Thread.java:745)

I found some code about submitting high-availability jobs in flink project:

  private void activateHighAvailabilitySupport(ApplicationSubmissionContext 
appContext) throws
                        InvocationTargetException, IllegalAccessException {

                ApplicationSubmissionContextReflector reflector = 
ApplicationSubmissionContextReflector.getInstance();
                
reflector.setKeepContainersAcrossApplicationAttempts(appContext, true);
                reflector.setAttemptFailuresValidityInterval(
                                appContext,
                                
flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL));
        }
        
Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.

    Environment:     (was: some properties in yarn-site.xml： 

<property>
  <name>yarn.resourcemanager.recovery.enabled</name>
  <value>true</value>
 </property> 
 
 <property>
  <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
  <value>false</value>
 </property>)

> Flink job on YARN with HA enabled crashes all RMs on attempt recovery
> ---------------------------------------------------------------------
>
>                 Key: YARN-10464
>                 URL: https://issues.apache.org/jira/browse/YARN-10464
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: tim yu
>            Priority: Critical
>
> I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA 
> enabled but when I test it out by killing the active RM it brings down the 
> entire cluster.
>  I have configured Flink's HA in flink-conf.yml.
>  When I try to kill the active RM using kill -9, YARN correctly switches to 
> the standby RM and I can see applications as ACCEPTED for a minute but soon 
> the standby RM crashes throwing the following exception:
> {code:java}
> 2020-10-18 15:39:36.112 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_ADDED to the scheduler
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
>  at java.lang.Thread.run(Thread.java:745){code}
> I found some code about submitting high-availability jobs in flink project:
> {code:java}
> private void activateHighAvailabilitySupport(ApplicationSubmissionContext 
> appContext) throws
>                       InvocationTargetException, IllegalAccessException {
>               ApplicationSubmissionContextReflector reflector = 
> ApplicationSubmissionContextReflector.getInstance();
>               
> reflector.setKeepContainersAcrossApplicationAttempts(appContext, true);
>               reflector.setAttemptFailuresValidityInterval(
>                               appContext,
>                               
> flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL));
>       }
> {code}
> Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.
> Some properties in yarn-site.xml： 
> <property>
>   <name>yarn.resourcemanager.recovery.enabled</name>
>   <value>true</value>
>  </property> 
>  
>  <property>
>   <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
>   <value>false</value>
>  </property>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10464) Flink job on YARN with HA enabled crashes all RMs on attempt recovery

Reply via email to