[ 
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637799#comment-15637799
 ] 

Li Lu commented on YARN-3359:
-----------------------------

bq. Also, we are sending all collectors known to this NM and not only the NMs' 
launched on it. Can't we keep some additional info for known collectors 
indicating if each one of those collectors is a collector launched on this NM. 
And hence NM reports only these collectors to RM on resync. Otherwise many NMs' 
may pretty much report similar set of collector infos in first NM HB on 
reconnection. This can be potentially optimized ? May not cause much impact 
though considering we only set collector info in RMAppImpl. Thoughts ?

Spent some time on this today and it looks like no trivial optimization. I'd 
incline not to introduce another collector set for local collectors since there 
are consistency concerns. The node managers do not know the exact list of AMs 
running on itself, so maintaining a list of "local collectors" may introduce 
some consistency issues. For example, when the same application has a second 
attempt on another node, and the new collector registration came back to the 
current node, it will have no idea about if the collector is local or not. 

An ideal solution for this will be checking if the collector's URL is local or 
not when re-registering. In this way we can only report local collectors. 
However, this will introduce some more work on the network side that I'm not 
very familiar with. We can make those improvements in a separate JIRA and do 
more intensive testing on this optimization before we put that in. For now, 
shall we move forward with this fix without the optimization?

I'll address other review comments. If there are more concerns please feel free 
to let me know. 

> Recover collector list in RM failed over
> ----------------------------------------
>
>                 Key: YARN-3359
>                 URL: https://issues.apache.org/jira/browse/YARN-3359
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: YARN-5355, oct16-medium
>         Attachments: YARN-3359-YARN-5355.001.patch, 
> YARN-3359-YARN-5355.002.patch, YARN-3359-YARN-5638.patch
>
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a 
> separated JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to