Junping Du commented on YARN-1341:
bq. Restarts should be rare, and I'd rather not force a loss of work by taking
the NM down instantly when the state store hiccups.
Yes. But considering rolling upgrade case, it (restart) should be much often
than failed in state store (Correct me here if I am wrong as I am not levelDB
expert). In this case, we always look forward to some work loss as even if we
don't bring NM down now, we will suffer after NM restart in upgrade.
bq. If the state store is missing some things, we might not be able to recover
a localized resource, a token, a container, or possibly anything at all.
I am not worrying losing them all, but if we can only partially recover these,
would it become a problem and break some assumptions we have? I don't know. But
this seems to make things more complicated.
bq. in the worst-case, the state store is so corrupted on startup that we
don't even survive the NM restart and the NM crashes, which would have an end
result just like if we took it down when the state store failed.
I am not sure if this is the worst case. The worst case seems to me is: NM
restart with partial state recovered, this inconsistent state is not aware by
running containers which could bring some weird bugs. I am not sure how
possible it could happen here, please
bq. Therefore I'd rather not guarantee that we'll lose work by crashing the NM
on any store error and instead try to preserve the work we have. The NM could
theoretically recover (e.g.: if the error is transient then the next RM key
store could succeed). If we take the NM down immediately then we're
guaranteeing the work is lost. Is that really better?
I think it is better to guarantee the work get lost as the expectation to user
is consistent. We don't know when new Token from RM come to refresh to stale
one to make persevering work succeed in lucky. User shouldn't expect work still
get preserved after NM restart if state store get failed sometime.
bq. May be a better approach is to have errors like this trigger an unhealthy
state for the NM when we have the ability to do a graceful decommission.
I agree. This could be a better approach.
In overall, I agree that we can keep log error here without breaking NM down
(or we will have change previous code on update
localizedResources/deletionServices) for reason you specified above. However,
to get rid of loading inconsistent state and manage user's expectation. I think
we shouldn't allow the state get loaded again if get some failure before in
store. May be we add some stale tag on NMStateStore and mark this when store
failure happens and never load a staled store. [~jlowe], what do you think?
> Recover NMTokens upon nodemanager restart
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
This message was sent by Atlassian JIRA