[
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489375#comment-15489375
]
Junping Du commented on YARN-3359:
----------------------------------
More specifically, I think there are several works need to be done for life
cycle management of collector:
- collector failure detect. We could extend current RPC between collector and
bottomed NM to be heartbeat with interval, so NM can detect collector failure
and try to relaunch or notify RM (and other NMs) the invalid of previous
collector.
- Beside launching collector via auxiliary service during AM launch, we should
allow RM to launch collector on NM (with optimized location) directly if
previous collector get failed for some reason (in case AM is still alive).
- RM should serve as arbiter in case of duplicated collectors. We should have
collector info with launch timestamp so when NM notify RM about collector info,
RM can do a judgment if this collector is latest launched and should be active.
If not, RM should notify back to NM to retire it. In case RM haven't sent
retire command back before failed over, we could either persist timestamp info
for collectors (RM state store space is pretty narrow though) or let RM wait a
while (configurable time, at least larger than NM retry interval + heartbeat
interval) to broadcast collector info during RM restart.
- Latest launched collector address info can be updated/discovered by NMs/AM
and even by containers in future (a combination effort with YARN-4758)
> Recover collector list in RM failed over
> ----------------------------------------
>
> Key: YARN-3359
> URL: https://issues.apache.org/jira/browse/YARN-3359
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Li Lu
> Labels: YARN-5355
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a
> separated JIRA.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]