[
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489295#comment-15489295
]
Junping Du commented on YARN-3359:
----------------------------------
Thanks [~gtCarrera9] for reply.
bq. The only challenge is when two or more then two collectors for the same
application got launched (because of some cluster partition, for example).
Therefore the RM needs to keep a version number for collectors, so that when
rebuilding app to collector mappings, it knows which collectors are stale and
which one is active.
That's a reasonable concern. Actually, there are more cases for life cycle
management of collectors. Beside duplicated collectors race condition caused by
cluster partition, other cases cause non-active collector could include:
- When NM stop expected (restart with work preserving) or unexpected, collector
will get shutdown as it is part of auxiliary service.
- Collector failure when NM is still alive due to thread/logic issue.
We should take care collector failure detect and relaunch somewhere in these
cases.
Of cause, these failure detect/relaunch efforts will have side-effect of
causing more duplicated collector cases. May be we should consider to separate
these issues out to a dedicated JIRA with design as a whole. [~vinodkv], what
do you think?
> Recover collector list in RM failed over
> ----------------------------------------
>
> Key: YARN-3359
> URL: https://issues.apache.org/jira/browse/YARN-3359
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Li Lu
> Labels: YARN-5355
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a
> separated JIRA.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]