[ https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Srikanth Kandula updated YARN-4088: ----------------------------------- Summary: RM should be able to process heartbeats from NM concurrently (was: RM should be able to process heartbeats from NM asynchronously) > RM should be able to process heartbeats from NM concurrently > ------------------------------------------------------------ > > Key: YARN-4088 > URL: https://issues.apache.org/jira/browse/YARN-4088 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, scheduler > Reporter: Srikanth Kandula > > Today, the RM sequentially processes one heartbeat after another. > Imagine a 3000 server cluster with each server heart-beating every 3s. This > gives the RM 1ms on average to process each NM heartbeat. That is tough. > It is true that there are several underlying datastructures that will be > touched during heartbeat processing. So, it is non-trivial to parallelize the > NM heartbeat. Yet, it is quite doable... > Parallelizing the NM heartbeat would substantially improve the scalability of > the RM, allowing it to either > a) run larger clusters or > b) support faster heartbeats or dynamic scaling of heartbeats > c) take more asks from each application or > c) use cleverer/ more expensive algorithms such as node labels or better > packing or ... > Indeed the RM's scalability limit has been cited as the motivating reason for > a variety of efforts which will become less needed if this can be solved. > Ditto for slow heartbeats. See Sparrow and Mercury papers for example. > Can we take a shot at this? > If not, could we discuss why. -- This message was sent by Atlassian JIRA (v6.3.4#6332)