[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037656#comment-14037656
 ] 

Junping Du commented on YARN-1341:
----------------------------------


bq. I'm not sure I understand what you're requesting. Recovering the NM tokens 
is one line of code (3 if we count the "if canRecover" part), and recovering 
the container tokens in YARN-1342 will add one more line for that (inside the 
same "if canRecover" block). I went ahead and factored this into a separate 
method, however I'm not sure it matches what you were expecting as I don't see 
where we're saving duplicated code. If what's in the updated patch isn't what 
you expected, please provide some sample pseudo-code to demonstrate how we can 
avoid duplication of code.
I think it is fine for now. However, I would like to refactor a bit on 
NodeManager#serviceInit() when we finish all these recover work to avoid some 
duplicate work,  some code like: createNMContext(), we duplicated set some 
handler. Anyway, we can do this later.

bq. The problem with throwing an exception is what to do with the exception – 
do we take down the NM? That seems like a drastic answer since the NM will 
likely chug along just fine without the key stored. It only becomes a problem 
when the NM restarts and restores an old key. However if we rollback the old 
key here then we take that only-breaks-if-we-happened-to-restart case and make 
it an always-breaks scenario. Eventually the old key will no longer be valid to 
the RM, and none of the AMs will be able to authenticate to the NM. Therefore I 
thought it would be better to log the error, press onward, and hope we don't 
restart before we store a valid key again (maybe store error was transient) 
rather than either take down the NM or have things start failing even without a 
restart
We already have similar tradeoff in RM side, if any exception happens in 
RMStore then it will bring down RM. In NM case, if levelDB stop to work, I 
think we should bring NM down to get rid of any inconsistent after NM restart. 
Although I am not sure what weird things could happen in case of inconsistency 
here, but considering it is cheaper to bring down NM, we should play more 
safety in our case than RM. Actually, I bring up some thoughts on play more 
risky in RM side at YARN-2019 which target to reduce RM service down time. But 
here, I prefer to be safer. Jason, what do you think? 
 

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to