[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039138#comment-14039138
 ] 

Jason Lowe commented on YARN-1341:
----------------------------------

bq. The worst case seems to me is: NM restart with partial state recovered, 
this inconsistent state is not aware by running containers which could bring 
some weird bugs.

Yes, you're correct.  The worst-case is likely where we come up, fail to 
realize a container is running, and therefore the container "leaks."

I think we should handle store errors on a case-by-case basis, based on the 
ramifications of how the system will recover without that information.  For 
containers, a container should fail to launch if state store errors occur to 
mark the container request and/or mark the container launch.  The YARN-1336 
prototype patch already does this when containers are requested and in 
ContainerLaunch.  That way the worst-case scenario for a container is that we 
throw an error for the container request or the container fails before launch 
due to a state store error.  We failed to launch a container, but the whole NM 
doesn't go down.  If we fail to mark the container completed in the store then 
worst-case scenario is that we try to recover a container that isn't there, 
which again will mark the container as failed and we'll report that to the RM.  
If the RM doesn't know about the failed container (because the container/app is 
long gone) then it will just ignore it.

For deletion service, if we fail to update the store then we may fail to delete 
something when we recover if we happened to restart in-between.  If we ignore 
the error then it's very likely the NM will _not_ restart before the deletion 
time expires and the file is deleted.  However if we tear down the NM on a 
store error then we will also fail to delete it when the NM restarts later 
since we failed to record it, meaning we made things purely worse -- we lost 
work _and_ leaked the thing we were supposed to delete.  Therefore for deletion 
tasks I think the current behavior is appropriate.

For localized resources failing to update the store means we could end up 
leaking a resource or thinking a resource is there when it's really not.  The 
latter isn't a huge problem because when we try to reference the resource again 
it checks if it's there, and if it isn't it re-localizes it again.  Not knowing 
a resource is there is a bigger issue, and there's a couple of ways to tackle 
that one -- either fail the localization of the resource when the state store 
error occurs or have the NM scan the local resource directories for "unknown" 
resources when it recovers.

For the RM master key, I see it very similar to the deletion task case.  If we 
fail to store it then the NM will update it in memory, and can keep going.  If 
we restart without recovering an older key (the current key will be obtained 
when the NM re-registers with the RM) then we may fail to let AMs connect that 
only have an older key.  Containers that were still on the NM will still 
continue.  If we take down the NM when the store hiccups then we lose work 
which seems worse than a possibility the AM could fail to connect to the NM 
(which can and does already happen today due to network cuts, etc.) 

bq.  May be we add some stale tag on NMStateStore and mark this when store 
failure happens and never load a staled store.

If we had an error storing then we're likely to have the same error trying to 
store a stale tag, or am I misunderstanding the proposal?  Also as I mentioned 
above, there are many cases where a partial recovery isn't a bad thing as the 
system can recover via other means (e.g.: trying to recover a container that 
already completed should be benign, trying to delete a container directory 
that's already deleted is benign, etc.).

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to