[ 
https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241618#comment-14241618
 ] 

Konstantinos Karanasos commented on YARN-2877:
----------------------------------------------

Thanks for the input, [~adhoot]. This is an interesting discussion.

There are indeed cases that distributed scheduling can hurt job latency. This 
is more pronounced in the following cases:
# Queueable containers are used both for short- and long-running tasks.
# For Jobs that have many tasks (chances that one of these tasks will get stuck 
in a queue are higher).
# Cluster load is higher.

Based on the above situations, a first observation is that queueable containers 
should be mostly used for short-running tasks, if job latency is of importance.
Moreover, when jobs have a big number of tasks, probably the AM policy should 
ask for optimistic containers only for a subset of them (even if they all are 
short-running).

Still though, as you also mention, corrective mechanisms should be used to 
further improve latency.
- One such mechanism is *queuing in multiple locations* as is done by Sparrow 
and Apollo. In that case the LocalRM should pick two nodes instead of one to 
queue the request. This is something we have not tried yet, but it may be 
useful to do so.
- Another mechanism we are proposing is *queue rebalancing*, that is, whenever 
some queues have much bigger load than others, we dequeue some of its requests 
and send them to a less loaded queue. Of course, we need to take care when to 
dequeue containers, because we may end up increasing the latency if we 
accidentally dequeue the same request many times.
- A last mechanism that seems interesting is *reordering of requests* within a 
queue, based on some policy (e.g., based on the submission time of the 
application the task belongs to).

More thoughts are definitely welcome.

> Extend YARN to support distributed scheduling
> ---------------------------------------------
>
>                 Key: YARN-2877
>                 URL: https://issues.apache.org/jira/browse/YARN-2877
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, resourcemanager
>            Reporter: Sriram Rao
>
> This is an umbrella JIRA that proposes to extend YARN to support distributed 
> scheduling.  Briefly, some of the motivations for distributed scheduling are 
> the following:
> 1. Improve cluster utilization by opportunistically executing tasks otherwise 
> idle resources on individual machines.
> 2. Reduce allocation latency.  Tasks where the scheduling time dominates 
> (i.e., task execution time is much less compared to the time required for 
> obtaining a container from the RM).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to