[
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637799#comment-15637799
]
Li Lu commented on YARN-3359:
-----------------------------
bq. Also, we are sending all collectors known to this NM and not only the NMs'
launched on it. Can't we keep some additional info for known collectors
indicating if each one of those collectors is a collector launched on this NM.
And hence NM reports only these collectors to RM on resync. Otherwise many NMs'
may pretty much report similar set of collector infos in first NM HB on
reconnection. This can be potentially optimized ? May not cause much impact
though considering we only set collector info in RMAppImpl. Thoughts ?
Spent some time on this today and it looks like no trivial optimization. I'd
incline not to introduce another collector set for local collectors since there
are consistency concerns. The node managers do not know the exact list of AMs
running on itself, so maintaining a list of "local collectors" may introduce
some consistency issues. For example, when the same application has a second
attempt on another node, and the new collector registration came back to the
current node, it will have no idea about if the collector is local or not.
An ideal solution for this will be checking if the collector's URL is local or
not when re-registering. In this way we can only report local collectors.
However, this will introduce some more work on the network side that I'm not
very familiar with. We can make those improvements in a separate JIRA and do
more intensive testing on this optimization before we put that in. For now,
shall we move forward with this fix without the optimization?
I'll address other review comments. If there are more concerns please feel free
to let me know.
> Recover collector list in RM failed over
> ----------------------------------------
>
> Key: YARN-3359
> URL: https://issues.apache.org/jira/browse/YARN-3359
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Li Lu
> Labels: YARN-5355, oct16-medium
> Attachments: YARN-3359-YARN-5355.001.patch,
> YARN-3359-YARN-5355.002.patch, YARN-3359-YARN-5638.patch
>
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a
> separated JIRA.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]