[ 
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485225#comment-15485225
 ] 

Li Lu commented on YARN-3359:
-----------------------------

I've got some offline discussion with [~vinodkv] about this issue. We cannot 
simply preserve collector states in the RM state store since this state is not 
final, and updating this status frequently will block the RM. A natural 
replacement place for the state store is the NM state store. That is to say, we 
can rebuild RM's collector table by getting updates from the NMs. In summary, 
we need to do the following things:

For NMs: 
1. on collector launching, preserve collector address in its state store. 
2. on removing collectors, remove the related item from state store. 
3. on start up, recover collector addresses from state store. 
4. on resync, send current collector address mapping to the RM. 

For RMs, the only change needed is to rebuild the collector/address mapping 
upon restart. This actually involves a pretty messy corner case: when one 
application has two different attempts running (due to some network problems, 
for example) and the RM is trying to rebuild collector status, the RM needs to 
know which collector is for the latest app attempt and which one is for the 
stale attempt. This requires some changes in collector IDs. Right now each 
collector is mapped with an app ID, but to handle the state recover case, we 
need to associate each collector with an attempt ID (and ideally a time stamp 
to further distinguish collectors). 

Not sure if we missed some critical points in this design. Thoughts? 

> Recover collector list in RM failed over
> ----------------------------------------
>
>                 Key: YARN-3359
>                 URL: https://issues.apache.org/jira/browse/YARN-3359
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: YARN-5355
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a 
> separated JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to