[jira] [Commented] (YARN-3359) Recover collector list in RM failed over

Junping Du (JIRA) Tue, 13 Sep 2016 20:45:57 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489295#comment-15489295
 ]


Junping Du commented on YARN-3359:
----------------------------------

Thanks [~gtCarrera9] for reply.
bq. The only challenge is when two or more then two collectors for the same 
application got launched (because of some cluster partition, for example). 
Therefore the RM needs to keep a version number for collectors, so that when 
rebuilding app to collector mappings, it knows which collectors are stale and 
which one is active.
That's a reasonable concern. Actually, there are more cases for life cycle 
management of collectors. Beside duplicated collectors race condition caused by 
cluster partition, other cases cause non-active collector could include:
- When NM stop expected (restart with work preserving) or unexpected, collector 
will get shutdown as it is part of auxiliary service. 
- Collector failure when NM is still alive due to thread/logic issue.
We should take care collector failure detect and relaunch somewhere in these 
cases. 
Of cause, these failure detect/relaunch efforts will have side-effect of 
causing more duplicated collector cases. May be we should consider to separate 
these issues out to a dedicated JIRA with design as a whole. [~vinodkv], what 
do you think?

> Recover collector list in RM failed over
> ----------------------------------------
>
>                 Key: YARN-3359
>                 URL: https://issues.apache.org/jira/browse/YARN-3359
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: YARN-5355
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a 
> separated JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-3359) Recover collector list in RM failed over

Reply via email to