[ 
https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717887#comment-14717887
 ] 

Srikanth Kandula commented on YARN-4088:
----------------------------------------

See the problem with slower heartbeats is that if the tasks are short-running, 
there will be a cluster-wide throughput drop due to the feedback delay. This is 
one of the points that Sparrow (Spark) and Mercury hammer Yarn on... Of course, 
reusing containers *can* help but other ducks have to align well.  In general, 
slowing the heartbeat is not a good thing.

> RM should be able to process heartbeats from NM asynchronously
> --------------------------------------------------------------
>
>                 Key: YARN-4088
>                 URL: https://issues.apache.org/jira/browse/YARN-4088
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager, scheduler
>            Reporter: Srikanth Kandula
>
> Today, the RM sequentially processes one heartbeat after another. 
> Imagine a 3000 server cluster with each server heart-beating every 3s. This 
> gives the RM 1ms on average to process each NM heartbeat. That is tough.
> It is true that there are several underlying datastructures that will be 
> touched during heartbeat processing. So, it is non-trivial to parallelize the 
> NM heartbeat. Yet, it is quite doable...
> Parallelizing the NM heartbeat would substantially improve the scalability of 
> the RM, allowing it to either 
> a) run larger clusters or 
> b) support faster heartbeats or dynamic scaling of heartbeats
> c) take more asks from each application or 
> c) use cleverer/ more expensive algorithms such as node labels or better 
> packing or ...
> Indeed the RM's scalability limit has been cited as the motivating reason for 
> a variety of efforts which will become less needed if this can be solved. 
> Ditto for slow heartbeats.  See Sparrow and Mercury papers for example.
> Can we take a shot at this?
> If not, could we discuss why.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to