Jian He commented on YARN-4032:

The problem in  YARN-2834 is  that if there is an app existing in state-store 
- app state = final state
- attempt state = null
RM will fail with NPE on recovery.

One approach is to delete this inconsistent state app from state-store, is that 
considered ?

Regarding the patch, it captures all exception in app.recover and return 
FAILED.  If the application previously ended as FINISHED, the app is changed to 
FAILD, which I think is inconsistent to user. Also, this exception will happen 
again and again whenever RM gets restarted.
I think what we can do is to check whether app is at FINAL state in 
RMAppAttemptImpl#AttemptRecoveredTransition, skip adding attempt into scheduler 
if it is. 

> Corrupted state from a previous version can still cause RM to fail with NPE 
> due to same reasons as YARN-2834
> ------------------------------------------------------------------------------------------------------------
>                 Key: YARN-4032
>                 URL: https://issues.apache.org/jira/browse/YARN-4032
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>            Priority: Critical
>         Attachments: YARN-4032.prelim.patch
> YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if 
> someone is upgrading from a previous version, the state can still be 
> inconsistent and then RM will still fail with NPE after upgrade to 2.6.0.

This message was sent by Atlassian JIRA

Reply via email to