[ 
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489375#comment-15489375
 ] 

Junping Du commented on YARN-3359:
----------------------------------

More specifically, I think there are several works need to be done for life 
cycle management of collector:
- collector failure detect. We could extend current RPC between collector and 
bottomed NM to be heartbeat with interval, so NM can detect collector failure 
and try to relaunch or notify RM (and other NMs) the invalid of previous 
collector.
- Beside launching collector via auxiliary service during AM launch, we should 
allow RM to launch collector on NM (with optimized location) directly if 
previous collector get failed for some reason (in case AM is still alive).
- RM should serve as arbiter in case of duplicated collectors. We should have 
collector info with launch timestamp so when NM notify RM about collector info, 
RM can do a judgment if this collector is latest launched and should be active. 
If not, RM should notify back to NM to retire it. In case RM haven't sent 
retire command back before failed over, we could either persist timestamp info 
for collectors (RM state store space is pretty narrow though) or let RM wait a 
while (configurable time, at least larger than NM retry interval + heartbeat 
interval) to broadcast collector info during RM restart.
- Latest launched collector address info can be updated/discovered by NMs/AM 
and even by containers in future (a combination effort with YARN-4758)

> Recover collector list in RM failed over
> ----------------------------------------
>
>                 Key: YARN-3359
>                 URL: https://issues.apache.org/jira/browse/YARN-3359
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: YARN-5355
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a 
> separated JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to