[
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485225#comment-15485225
]
Li Lu commented on YARN-3359:
-----------------------------
I've got some offline discussion with [~vinodkv] about this issue. We cannot
simply preserve collector states in the RM state store since this state is not
final, and updating this status frequently will block the RM. A natural
replacement place for the state store is the NM state store. That is to say, we
can rebuild RM's collector table by getting updates from the NMs. In summary,
we need to do the following things:
For NMs:
1. on collector launching, preserve collector address in its state store.
2. on removing collectors, remove the related item from state store.
3. on start up, recover collector addresses from state store.
4. on resync, send current collector address mapping to the RM.
For RMs, the only change needed is to rebuild the collector/address mapping
upon restart. This actually involves a pretty messy corner case: when one
application has two different attempts running (due to some network problems,
for example) and the RM is trying to rebuild collector status, the RM needs to
know which collector is for the latest app attempt and which one is for the
stale attempt. This requires some changes in collector IDs. Right now each
collector is mapped with an app ID, but to handle the state recover case, we
need to associate each collector with an attempt ID (and ideally a time stamp
to further distinguish collectors).
Not sure if we missed some critical points in this design. Thoughts?
> Recover collector list in RM failed over
> ----------------------------------------
>
> Key: YARN-3359
> URL: https://issues.apache.org/jira/browse/YARN-3359
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Li Lu
> Labels: YARN-5355
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a
> separated JIRA.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]