[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

Jun Gong (JIRA) Wed, 13 Jan 2016 02:25:53 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095956#comment-15095956
 ]


Jun Gong commented on YARN-4497:
--------------------------------

[~rohithsharma] Thanks for the comments and suggestion.

{quote}
As a side note : since YARN-3840 removes the attempts from RMStateStore, it is 
very prone to get this issue (YARN-4584) nevertheless of without RM HA is 
configured and fail fast is false.
{quote}
As I commented in YARN-4584, "If attempt 1~28 are removed and attempt 29~31 has 
been saved to appstore successfully, there will be no NPE for RM recovery." I 
think we need analyze the RM log more. Removing attempts will cause NPE only 
when RM continues to run when failing to operate(e.g. store/remove) on 
RMStateStore. Is there any other case might cause NPE? Maybe we need fix it.

{quote}
About the solution, it is bit tricky to identify during recovery that 
whether-application-is-failed-to-store VS 
failed-attempts-were-removed-after-interval.
{quote}
I think we do not need to identify these two cases, because it makes no 
different for recovery.

{quote}
So I think you can club both your solution and Jian He's thought together, so 
that we can eliminate failed-attempts-were-removed-after-interval attempts. And 
assume that attempts recovered are of failed to store only. 
{quote}
In *RMAppImpl#createNewAttempt()*, the first new attempt id is *nextAttemptId* 
which is initialized to the minimum attempt ID in RMStateStore in 
*RMAppImpl#recover()*. So we have skipped recovering those 
*failed-attempts-were-removed-after-interval* attempts. 

{quote}
Regarding iterating appState.attempts, it can be sorted before iterating it. If 
attempts are sorted, then there should not be problem with nextAttemptId.
{quote}
Yes, we could sort it.I will update the patch if needed. 

{quote}
attempt.recoveredFinalStatus is being set to always to FAILED. These attempts 
might be KILLED/FINISHED also.
{quote}
These attempts might be KILLED actually, but we could not make sure about it. 
If it is not reasonable to set it to FAILED, how about adding another 
state(e.g. UNKOWN)? My concern that is it will make things complex.

{quote}
getNumFailedAppAttempts() is violated if attempt is failed to store since this 
attempt is removed from attempts. And also note that if attempts is failed to 
store, then many information such as getNumFailedAppAttempts also wont be exact 
number since attempt failure is taken from attempt.
{quote}
Yes, the number is not exact number. I have not figured out a good method to 
solve it now :(.  Since RM HA is not so often and removed attempts are kept in 
memory, it might be acceptable.

> RM might fail to restart when recovering apps whose attempts are missing
> ------------------------------------------------------------------------
>
>                 Key: YARN-4497
>                 URL: https://issues.apache.org/jira/browse/YARN-4497
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>            Priority: Critical
>         Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

Reply via email to