Jun Gong created YARN-4497:
------------------------------
Summary: RM might fail to restart when recovering apps whose
attempts are missing
Key: YARN-4497
URL: https://issues.apache.org/jira/browse/YARN-4497
Project: Hadoop YARN
Issue Type: Bug
Reporter: Jun Gong
Assignee: Jun Gong
Find following problem when discussing in YARN-3480.
If RM fails to store some attempts in RMStateStore, there will be missing
attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3,
RM successfully stored attempt1 and attempt3, but failed to store attempt2.
When RM restarts, in *RMAppImpl#recover*, we recover attempts one by one, for
this case, we will recover attmept1, then attempt2. When recovering attempt2,
we call *((RMAppAttemptImpl)this.currentAttempt).recover(state)*,
it will first find its ApplicationAttemptStateData, but it could not find it,
an error will come at *assert attemptState != null*(*RMAppAttemptImpl#recover*,
line 880).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)