[jira] [Commented] (YARN-3359) Recover collector list in RM failed over

Varun Saxena (JIRA) Thu, 03 Nov 2016 14:25:24 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634344#comment-15634344
 ]


Varun Saxena commented on YARN-3359:
------------------------------------

Thanks [~gtCarrera9] for the patch. Code largely seems fine and consistent with 
the approach adopted.

Coming to the approach, it seems on resync we will create a list of all the 
known collectors but this is not sent on register request to RM but on 
subsequent HB to RM. I guess this is done because if we send it on registration 
this would mean protocol changes and would mean additional changes on RM side 
too.
Also, we are sending all collectors known to this NM and not only the NMs' 
launched on it. Can't we keep some additional info for known collectors 
indicating if each one of those collectors is a collector launched on this NM. 
And hence NM reports only these collectors to RM on resync. Otherwise all NMs' 
may pretty much report the same collector info in first NM HB on reconnection. 
This can be potentially optimized ? May not cause much impact though 
considering we only set collector info in RMAppImpl. Thoughts  ?

In NodeManager#reregisterCollectors, in the piece of code below, just to be 
safe should we check for null after get from map ? Looking at the code it is 
highly unlikely to have a NPE here as application is only removed from map when 
aggregation finishes. But better to be safe considering 
ConcurrentHashMap#entrySet is weakly consistent ? 
{code}
      if (!ApplicationState.FINISHED.equals(context.getApplications()
          .get(entry.getKey()).getApplicationState())) {
{code}

Nit. In method NodeManager#resyncWithRM, line 
{{context.getKnownCollectors().clear();}} has been commented out. It can be 
removed.

> Recover collector list in RM failed over
> ----------------------------------------
>
>                 Key: YARN-3359
>                 URL: https://issues.apache.org/jira/browse/YARN-3359
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: YARN-5355, oct16-medium
>         Attachments: YARN-3359-YARN-5355.001.patch, 
> YARN-3359-YARN-5355.002.patch, YARN-3359-YARN-5638.patch
>
>
> Per discussion in YARN-3039, split the recover work from RMStateStore in a 
> separated JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-3359) Recover collector list in RM failed over

Reply via email to