[
https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Srikanth Kandula updated YARN-4088:
-----------------------------------
Summary: RM should be able to process heartbeats from NM concurrently
(was: RM should be able to process heartbeats from NM asynchronously)
> RM should be able to process heartbeats from NM concurrently
> ------------------------------------------------------------
>
> Key: YARN-4088
> URL: https://issues.apache.org/jira/browse/YARN-4088
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager, scheduler
> Reporter: Srikanth Kandula
>
> Today, the RM sequentially processes one heartbeat after another.
> Imagine a 3000 server cluster with each server heart-beating every 3s. This
> gives the RM 1ms on average to process each NM heartbeat. That is tough.
> It is true that there are several underlying datastructures that will be
> touched during heartbeat processing. So, it is non-trivial to parallelize the
> NM heartbeat. Yet, it is quite doable...
> Parallelizing the NM heartbeat would substantially improve the scalability of
> the RM, allowing it to either
> a) run larger clusters or
> b) support faster heartbeats or dynamic scaling of heartbeats
> c) take more asks from each application or
> c) use cleverer/ more expensive algorithms such as node labels or better
> packing or ...
> Indeed the RM's scalability limit has been cited as the motivating reason for
> a variety of efforts which will become less needed if this can be solved.
> Ditto for slow heartbeats. See Sparrow and Mercury papers for example.
> Can we take a shot at this?
> If not, could we discuss why.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)