[
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077178#comment-14077178
]
Jason Lowe commented on YARN-1354:
----------------------------------
Thanks for taking a look, Junping!
bq. what would happen if storeApplication(), finishApplication(),
removeApplication() failed with application related information get
inconsistent after restart?
If storeApplication fails then it will throw an IOException which will bubble
up and fail the container start request on the client. As long as we're unable
to store a new application, containers for that application will not start,
which I believe is the desired behavior. That prevents the state store from
being inconsistent in this particular scenario.
If finishApplication fails then the NM will proceed as if it did succeed but
the state store will still have the application present. This should be
corrected when the NM restarts and registers with the RM with those
applications still running. The RM should correct the situation by telling the
NM that the application has finished (see YARN-1885), and the NM will proceed
to perform application finish processing (e.g.: log aggregation, etc.). I
think worst-case it will upload all of the app container logs again, but when
it goes to rename to the final destination name that will fail because the name
already exists. Thus there could be some wasted work, but it should sort
itself out and not do something catastrophic.
If removeApplication fails then the NM will proceed as if it did succeed but
the state store will still have the application present. This should be
corrected when the NM finishes application processing (per above or if it was
already recorded as finished) and it will again try to remove it from the state
store. As above I think there could be some unnecessary work performed, but I
think in the end the application should eventually be removed from the NM on
restart. It could still remain in the state store if the second removal also
fails, but a subsequent restart should behave the same.
bq. Do we need special warning if get failed on deserializing credential here?
I'm not sure how credential processing is fundamentally all that different from
protocol buffer parsing which could also fail. If the credentials can't be
read then we can't recover the application. Currently recovery errors are
fatal to NM startup. Do you have something specific in mind for handling the
credentials if the writable changes (e.g.: some pseudo code to show the
approach)?
> Recover applications upon nodemanager restart
> ---------------------------------------------
>
> Key: YARN-1354
> URL: https://issues.apache.org/jira/browse/YARN-1354
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1354-v1.patch,
> YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch,
> YARN-1354-v4.patch, YARN-1354-v5.patch
>
>
> The set of active applications in the nodemanager context need to be
> recovered for work-preserving nodemanager restart
--
This message was sent by Atlassian JIRA
(v6.2#6252)