[
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634344#comment-15634344
]
Varun Saxena commented on YARN-3359:
------------------------------------
Thanks [~gtCarrera9] for the patch. Code largely seems fine and consistent with
the approach adopted.
Coming to the approach, it seems on resync we will create a list of all the
known collectors but this is not sent on register request to RM but on
subsequent HB to RM. I guess this is done because if we send it on registration
this would mean protocol changes and would mean additional changes on RM side
too.
Also, we are sending all collectors known to this NM and not only the NMs'
launched on it. Can't we keep some additional info for known collectors
indicating if each one of those collectors is a collector launched on this NM.
And hence NM reports only these collectors to RM on resync. Otherwise all NMs'
may pretty much report the same collector info in first NM HB on reconnection.
This can be potentially optimized ? May not cause much impact though
considering we only set collector info in RMAppImpl. Thoughts ?
In NodeManager#reregisterCollectors, in the piece of code below, just to be
safe should we check for null after get from map ? Looking at the code it is
highly unlikely to have a NPE here as application is only removed from map when
aggregation finishes. But better to be safe considering
ConcurrentHashMap#entrySet is weakly consistent ?
{code}
if (!ApplicationState.FINISHED.equals(context.getApplications()
.get(entry.getKey()).getApplicationState())) {
{code}
Nit. In method NodeManager#resyncWithRM, line
{{context.getKnownCollectors().clear();}} has been commented out. It can be
removed.
> Recover collector list in RM failed over
> ----------------------------------------
>
> Key: YARN-3359
> URL: https://issues.apache.org/jira/browse/YARN-3359
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Li Lu
> Labels: YARN-5355, oct16-medium
> Attachments: YARN-3359-YARN-5355.001.patch,
> YARN-3359-YARN-5355.002.patch, YARN-3359-YARN-5638.patch
>
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a
> separated JIRA.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]