[ https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489295#comment-15489295 ]
Junping Du commented on YARN-3359: ---------------------------------- Thanks [~gtCarrera9] for reply. bq. The only challenge is when two or more then two collectors for the same application got launched (because of some cluster partition, for example). Therefore the RM needs to keep a version number for collectors, so that when rebuilding app to collector mappings, it knows which collectors are stale and which one is active. That's a reasonable concern. Actually, there are more cases for life cycle management of collectors. Beside duplicated collectors race condition caused by cluster partition, other cases cause non-active collector could include: - When NM stop expected (restart with work preserving) or unexpected, collector will get shutdown as it is part of auxiliary service. - Collector failure when NM is still alive due to thread/logic issue. We should take care collector failure detect and relaunch somewhere in these cases. Of cause, these failure detect/relaunch efforts will have side-effect of causing more duplicated collector cases. May be we should consider to separate these issues out to a dedicated JIRA with design as a whole. [~vinodkv], what do you think? > Recover collector list in RM failed over > ---------------------------------------- > > Key: YARN-3359 > URL: https://issues.apache.org/jira/browse/YARN-3359 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Junping Du > Assignee: Li Lu > Labels: YARN-5355 > > Per discussion in YARN-3039, split the recover work from RMStateStore in a > separated JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org