Wangda Tan commented on YARN-2877:

Hi [~kkaranasos],
Thanks for reply:
bq. We are planning to address this by having smaller heartbeat intervals in 
the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the 
AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval 
to 200ms (in other words, we will only propagate to the RM only one in every 
four heartbeats).
Maybe you could also take a look at HADOOP-11552, which could possibly achieve 
better latency and reduce heartbeat frequency.

bq. This is a valid concern. The best way to minimize preemption is through the 
"top-k node list" technique described above. As the LocalRM will be placing the 
QUEUEABLE containers to the least loaded nodes, preemption will be minimized.
I think top-k node list technique cannot completely solve the over subscribe 
issue, in a production cluster, application comes in waves, it is possible that 
few large applications can exhaust all resources in a cluster within few 
seconds. Maybe another possible approach to mitigate the issue is: propagating 
queue-able containers from NM to RM periodically, so NM can still make decision 
but RM can also be aware of these queue-able containers.

bq. That said, as you also mention, QUEUEABLE containers are more suitable for 
short-running tasks, where the probability of a container being preempted is 
Ideally it's better to support all non-long-running-service tasks. LocalRM 
could allocate short-running queue-able tasks and RM an allocate other 
queue-able tasks.

> Extend YARN to support distributed scheduling
> ---------------------------------------------
>                 Key: YARN-2877
>                 URL: https://issues.apache.org/jira/browse/YARN-2877
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, resourcemanager
>            Reporter: Sriram Rao
>            Assignee: Konstantinos Karanasos
>         Attachments: distributed-scheduling-design-doc_v1.pdf
> This is an umbrella JIRA that proposes to extend YARN to support distributed 
> scheduling.  Briefly, some of the motivations for distributed scheduling are 
> the following:
> 1. Improve cluster utilization by opportunistically executing tasks otherwise 
> idle resources on individual machines.
> 2. Reduce allocation latency.  Tasks where the scheduling time dominates 
> (i.e., task execution time is much less compared to the time required for 
> obtaining a container from the RM).

This message was sent by Atlassian JIRA

Reply via email to