[ 
https://issues.apache.org/jira/browse/YARN-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566289#comment-13566289
 ] 

Bikas Saha commented on YARN-275:
---------------------------------

I think there is some merit in making the change to have the NM heartbeat 
interval come from the RM. (This patch also makes it configurable). So it gives 
us a handle to increase the minimal update rate under adverse situations or 
when hitting other scalability issues (while we work them out). Its a generic 
load balancing feedback mechanism with strong foundations in control theory. If 
needed we could enhance the RM logic to change the heartbeat dynamically or we 
could use configuration (without potentially needing to restart all NM's). So 
committing the current patch to have the RM send the heartbeat interval value 
to the NM looks like a good thing to do irrespective.
For the current overload case, the alternative approach makes sense. Simply 
aggregating the data upon message receipt instead of scheduling on every 
message receipt. While it does not change the perf of the scheduler it does 
reduce the problem complexity to O(#machines) instead of O(#messages). It also 
improves the scheduling response time for requests by shifting it from node 
heartbeat to a potentially as-and-when-needed approach. I am +1 for it. One 
thing to be careful of here would be how to communicate between the scheduler 
and RMNodes. We would like to avoid creation of a large number of event 
objects. Currently I think its 2X of messages (1 to rmnode and 1 to scheduler).
IMO lets use this jira to make heartbeat interval configured and sent by the 
RM. And use another sub-task to address the scheduler changes.
                
> Make NodeManagers to NOT blindly heartbeat irrespective of whether previous 
> heartbeat is processed or not.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-275
>                 URL: https://issues.apache.org/jira/browse/YARN-275
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Xuan Gong
>         Attachments: Prototype.txt, YARN-270.1.patch
>
>
> We need NMs to back off. The event handler mechanism is very scalable but not 
> infinitely so :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to