Junping Du commented on YARN-1341:

bq.  Restarts should be rare, and I'd rather not force a loss of work by taking 
the NM down instantly when the state store hiccups.
Yes. But considering rolling upgrade case, it (restart) should be much often 
than failed in state store (Correct me here if I am wrong as I am not levelDB 
expert). In this case, we always look forward to some work loss as even if we 
don't bring NM down now, we will suffer after NM restart in upgrade.

bq.  If the state store is missing some things, we might not be able to recover 
a localized resource, a token, a container, or possibly anything at all.
I am not worrying losing them all, but if we can only partially recover these, 
would it become a problem and break some assumptions we have? I don't know. But 
this seems to make things more complicated.

bq.  in the worst-case, the state store is so corrupted on startup that we 
don't even survive the NM restart and the NM crashes, which would have an end 
result just like if we took it down when the state store failed.
I am not sure if this is the worst case. The worst case seems to me is: NM 
restart with partial state recovered, this inconsistent state is not aware by 
running containers which could bring some weird bugs. I am not sure how 
possible it could happen here, please 

bq.  Therefore I'd rather not guarantee that we'll lose work by crashing the NM 
on any store error and instead try to preserve the work we have. The NM could 
theoretically recover (e.g.: if the error is transient then the next RM key 
store could succeed). If we take the NM down immediately then we're 
guaranteeing the work is lost. Is that really better?
I think it is better to guarantee the work get lost as the expectation to user 
is consistent. We don't know when new Token from RM come to refresh to stale 
one to make persevering work succeed in lucky. User shouldn't expect work still 
get preserved after NM restart if state store get failed sometime.

bq. May be a better approach is to have errors like this trigger an unhealthy 
state for the NM when we have the ability to do a graceful decommission. 
I agree. This could be a better approach.

In overall, I agree that we can keep log error here without breaking NM down 
(or we will have change previous code on update 
localizedResources/deletionServices) for reason you specified above. However, 
to get rid of loading inconsistent state and manage user's expectation. I think 
we shouldn't allow the state get loaded again if get some failure before in 
store. May be we add some stale tag on NMStateStore and mark this when store 
failure happens and never load a staled store. [~jlowe], what do you think?

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch

This message was sent by Atlassian JIRA

Reply via email to