[
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037656#comment-14037656
]
Junping Du commented on YARN-1341:
----------------------------------
bq. I'm not sure I understand what you're requesting. Recovering the NM tokens
is one line of code (3 if we count the "if canRecover" part), and recovering
the container tokens in YARN-1342 will add one more line for that (inside the
same "if canRecover" block). I went ahead and factored this into a separate
method, however I'm not sure it matches what you were expecting as I don't see
where we're saving duplicated code. If what's in the updated patch isn't what
you expected, please provide some sample pseudo-code to demonstrate how we can
avoid duplication of code.
I think it is fine for now. However, I would like to refactor a bit on
NodeManager#serviceInit() when we finish all these recover work to avoid some
duplicate work, some code like: createNMContext(), we duplicated set some
handler. Anyway, we can do this later.
bq. The problem with throwing an exception is what to do with the exception –
do we take down the NM? That seems like a drastic answer since the NM will
likely chug along just fine without the key stored. It only becomes a problem
when the NM restarts and restores an old key. However if we rollback the old
key here then we take that only-breaks-if-we-happened-to-restart case and make
it an always-breaks scenario. Eventually the old key will no longer be valid to
the RM, and none of the AMs will be able to authenticate to the NM. Therefore I
thought it would be better to log the error, press onward, and hope we don't
restart before we store a valid key again (maybe store error was transient)
rather than either take down the NM or have things start failing even without a
restart
We already have similar tradeoff in RM side, if any exception happens in
RMStore then it will bring down RM. In NM case, if levelDB stop to work, I
think we should bring NM down to get rid of any inconsistent after NM restart.
Although I am not sure what weird things could happen in case of inconsistency
here, but considering it is cheaper to bring down NM, we should play more
safety in our case than RM. Actually, I bring up some thoughts on play more
risky in RM side at YARN-2019 which target to reduce RM service down time. But
here, I prefer to be safer. Jason, what do you think?
> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)