[ https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717831#comment-14717831 ]
Bikas Saha commented on YARN-4088: ---------------------------------- Why not on a 3K cluster? We could slowdown heartbeats to (say 10s) on a 3K node cluster. That should work though I agree that NM info would be stale for longer, if that's your point. > RM should be able to process heartbeats from NM asynchronously > -------------------------------------------------------------- > > Key: YARN-4088 > URL: https://issues.apache.org/jira/browse/YARN-4088 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, scheduler > Reporter: Srikanth Kandula > > Today, the RM sequentially processes one heartbeat after another. > Imagine a 3000 server cluster with each server heart-beating every 3s. This > gives the RM 1ms on average to process each NM heartbeat. That is tough. > It is true that there are several underlying datastructures that will be > touched during heartbeat processing. So, it is non-trivial to parallelize the > NM heartbeat. Yet, it is quite doable... > Parallelizing the NM heartbeat would substantially improve the scalability of > the RM, allowing it to either > a) run larger clusters or > b) support faster heartbeats or dynamic scaling of heartbeats > c) take more asks from each application or > c) use cleverer/ more expensive algorithms such as node labels or better > packing or ... > Indeed the RM's scalability limit has been cited as the motivating reason for > a variety of efforts which will become less needed if this can be solved. > Ditto for slow heartbeats. See Sparrow and Mercury papers for example. > Can we take a shot at this? > If not, could we discuss why. -- This message was sent by Atlassian JIRA (v6.3.4#6332)