[
https://issues.apache.org/jira/browse/YARN-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566289#comment-13566289
]
Bikas Saha commented on YARN-275:
---------------------------------
I think there is some merit in making the change to have the NM heartbeat
interval come from the RM. (This patch also makes it configurable). So it gives
us a handle to increase the minimal update rate under adverse situations or
when hitting other scalability issues (while we work them out). Its a generic
load balancing feedback mechanism with strong foundations in control theory. If
needed we could enhance the RM logic to change the heartbeat dynamically or we
could use configuration (without potentially needing to restart all NM's). So
committing the current patch to have the RM send the heartbeat interval value
to the NM looks like a good thing to do irrespective.
For the current overload case, the alternative approach makes sense. Simply
aggregating the data upon message receipt instead of scheduling on every
message receipt. While it does not change the perf of the scheduler it does
reduce the problem complexity to O(#machines) instead of O(#messages). It also
improves the scheduling response time for requests by shifting it from node
heartbeat to a potentially as-and-when-needed approach. I am +1 for it. One
thing to be careful of here would be how to communicate between the scheduler
and RMNodes. We would like to avoid creation of a large number of event
objects. Currently I think its 2X of messages (1 to rmnode and 1 to scheduler).
IMO lets use this jira to make heartbeat interval configured and sent by the
RM. And use another sub-task to address the scheduler changes.
> Make NodeManagers to NOT blindly heartbeat irrespective of whether previous
> heartbeat is processed or not.
> ----------------------------------------------------------------------------------------------------------
>
> Key: YARN-275
> URL: https://issues.apache.org/jira/browse/YARN-275
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager, resourcemanager
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Xuan Gong
> Attachments: Prototype.txt, YARN-270.1.patch
>
>
> We need NMs to back off. The event handler mechanism is very scalable but not
> infinitely so :)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira