[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743618#comment-16743618
 ] 

Dapeng Sun commented on YARN-9198:
----------------------------------

{quote}
Not restoring an application is irreversible. There is no way to get that 
application back. If that would be an application that had been running for 
some time (like days) processing petabytes of data not restoring the 
application could be far more costly than some extra down time.
{quote}

Yes, in this scenario, we should not skip the error application. 

How about adding an config, the key likes 
"xxx.resourcemanager.fair-scheduler.skip-error-apps", so that users could 
choose from the behaviors: "Stoping RM and Recover the error App" or "Skip 
Error and Continue Starting RM". The option could be false by default, when 
meet the exception, the log would show the id(s) of error applications, user 
could make the decision to "fix" or "skip" base on the logs.

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> --------------------------------------------------------------------------------------------
>
>                 Key: YARN-9198
>                 URL: https://issues.apache.org/jira/browse/YARN-9198
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, resourcemanager
>    Affects Versions: 3.1.0, 2.8.5
>            Reporter: Dapeng Sun
>            Assignee: Dapeng Sun
>            Priority: Major
>         Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to