[ https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637799#comment-15637799 ]
Li Lu commented on YARN-3359: ----------------------------- bq. Also, we are sending all collectors known to this NM and not only the NMs' launched on it. Can't we keep some additional info for known collectors indicating if each one of those collectors is a collector launched on this NM. And hence NM reports only these collectors to RM on resync. Otherwise many NMs' may pretty much report similar set of collector infos in first NM HB on reconnection. This can be potentially optimized ? May not cause much impact though considering we only set collector info in RMAppImpl. Thoughts ? Spent some time on this today and it looks like no trivial optimization. I'd incline not to introduce another collector set for local collectors since there are consistency concerns. The node managers do not know the exact list of AMs running on itself, so maintaining a list of "local collectors" may introduce some consistency issues. For example, when the same application has a second attempt on another node, and the new collector registration came back to the current node, it will have no idea about if the collector is local or not. An ideal solution for this will be checking if the collector's URL is local or not when re-registering. In this way we can only report local collectors. However, this will introduce some more work on the network side that I'm not very familiar with. We can make those improvements in a separate JIRA and do more intensive testing on this optimization before we put that in. For now, shall we move forward with this fix without the optimization? I'll address other review comments. If there are more concerns please feel free to let me know. > Recover collector list in RM failed over > ---------------------------------------- > > Key: YARN-3359 > URL: https://issues.apache.org/jira/browse/YARN-3359 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Junping Du > Assignee: Li Lu > Labels: YARN-5355, oct16-medium > Attachments: YARN-3359-YARN-5355.001.patch, > YARN-3359-YARN-5355.002.patch, YARN-3359-YARN-5638.patch > > > Per discussion in YARN-3039, split the recover work from RMStateStore in a > separated JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org