[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062861#comment-14062861
 ] 

Jason Lowe commented on YARN-1341:
----------------------------------

Thanks for commenting, Devaraj!  My apologies for the late reply, as I was on 
vacation and am still catching up.

bq. In addition to option 1), I'd think of making the NM down if NM fails to 
store RM keys for certain number of times(configurable) consecutively.

As for retries, I mentioned earlier that if retries are likely to help then the 
state store implementation should do so rather than have the common code do so. 
 For the leveldb implementation it is very unlikely that a retry is going to do 
anything other than just make the operation take longer to ultimately fail.  
The the firmware of the drive is already going to implement a large number of 
retries to attempt to recover from hardware errors, and non-hardware local 
filesystem errors are highly unlikely to be fixed by simply retrying 
immediately.  If that were the case then I'd expect retries to be implemented 
in many other places where the local filesystem is used by Hadoop code.

bq. And also we can make it(i.e. tear down NM or not) as configurable

I'd like to avoid adding yet more config options unless we think we really need 
them, but if people agree this needs to be configurable then we can do so.  
Also I assume in that scenario you would want the NM to shutdown while also 
tearing down containers, cleaning up, etc. as if it didn't support recovery.  
Tearing down the NM on a state store error just to have it start up again and 
try to recover with stale state seems pointless -- might as well have just kept 
running which is a better outcome.  Or am I missing a use case for that?


And thanks, Junping, for the recent comments!

bq. If you are also agree on this, we can separate this document effort to 
other JIRA (Umbrella or a dedicated one, whatever you like) and continue the 
discussion on this particular case.

Sure, we can discuss general error handling or an overall document for it 
either on YARN-1336 or a new JIRA.

bq. a. if currentMasterKey is stale, it can be updated and override soon with 
registering to RM later. Nothing is affected.
Correct, the NM should receive the current master key upon re-registration with 
the RM after it restarts.

bq. b. if previousMasterKey is stale, then the real previous master key is 
lost, so the affection is: AMs with real master key cannot connect to NM to 
launch containers.
AMs that have the current master key will still be able to connect because the 
NM just got the current master key as described in a).  AM's that have the 
previous master key will not be able to connect to the NM unless that 
particular master key also happened to be successfully associated with the 
attempt in the state store (related to case c).

bq. c. if applicationMasterKeys are stale, then previous old keys get tracked 
in applicationMasterKeys get lost after restart. The affection is: AMs with old 
keys cannot connect to NM to launch containers.
AMs that use an old key (i.e.: not the current or previous master key) would be 
unable to connect to the NM.

bq. Anything I am missing here?
I don't believe so.  The bottom line is that an AM may not be able to 
successfully connect to an NM after a restart with stale NM token state.

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to