[ 
https://issues.apache.org/jira/browse/YARN-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540621#comment-13540621
 ] 

Bikas Saha commented on YARN-275:
---------------------------------

I briefly looked at the patch. The general approach seems promising. I have 
some comments on how we can structure this changes
We could break this work into 2 parts
1) protocol changes in heartbeat to transfer heartbeat control frequency from 
NM to RM. After this, in every heartbeat the RM will tell the NM when to send 
the next heartbeat. That value can be hardcoded (like it is currently) but 
preferably we can have an RM config that defines what the minimum heartbeat 
interval should be and use that. For this part, I dont think we need both 
backoff and heartbeatinterval in the heartbeat response. We can just have only 
heartbeatinterval that is always respected by the NM.
2) add some logic/heuristic to the RM so that it can dynamically change the 
heartbeat interval based on its current processing load/rate. This way the 
interval can be made longer when the RM is not keeping up with heartbeats.
If you think this break-up of works makes sense then we can create 2 sub-tasks 
under this jira for the 2 parts.

I have some additional ideas on part 1 also.
When a heartbeat comes at time T to the RM then it can choose to 
A) accept the request at time T and ask NM to heartbeat after time T+K with new 
information. This adds more load to the current RM load. This is what the 
current code does. So no change is required to do this.
B) reject the request at time T and ask NM to heartbeat after time T+K with 
current+new information. This does not increase load on RM but makes NM more 
complex because it needs to hold onto the last heartbeat data and merge in new 
data to it.
What do you think about these alternatives?
                
> Make NodeManagers to NOT blindly heartbeat irrespective of whether previous 
> heartbeat is processed or not.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-275
>                 URL: https://issues.apache.org/jira/browse/YARN-275
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Xuan Gong
>         Attachments: YARN-270.1.patch
>
>
> We need NMs to back off. The event handler mechanism is very scalable but not 
> infinitely so :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to