[ 
https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717831#comment-14717831
 ] 

Bikas Saha commented on YARN-4088:
----------------------------------

Why not on a 3K cluster? We could slowdown heartbeats to (say 10s) on a 3K node 
cluster. That should work though I agree that NM info would be stale for 
longer, if that's your point.

> RM should be able to process heartbeats from NM asynchronously
> --------------------------------------------------------------
>
>                 Key: YARN-4088
>                 URL: https://issues.apache.org/jira/browse/YARN-4088
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager, scheduler
>            Reporter: Srikanth Kandula
>
> Today, the RM sequentially processes one heartbeat after another. 
> Imagine a 3000 server cluster with each server heart-beating every 3s. This 
> gives the RM 1ms on average to process each NM heartbeat. That is tough.
> It is true that there are several underlying datastructures that will be 
> touched during heartbeat processing. So, it is non-trivial to parallelize the 
> NM heartbeat. Yet, it is quite doable...
> Parallelizing the NM heartbeat would substantially improve the scalability of 
> the RM, allowing it to either 
> a) run larger clusters or 
> b) support faster heartbeats or dynamic scaling of heartbeats
> c) take more asks from each application or 
> c) use cleverer/ more expensive algorithms such as node labels or better 
> packing or ...
> Indeed the RM's scalability limit has been cited as the motivating reason for 
> a variety of efforts which will become less needed if this can be solved. 
> Ditto for slow heartbeats.  See Sparrow and Mercury papers for example.
> Can we take a shot at this?
> If not, could we discuss why.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to