[ 
https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241570#comment-14241570
 ] 

Anubhav Dhoot commented on YARN-2877:
-------------------------------------

+1 for notion of distributed scheduling. I think it will go a long way for 
addressing both latency and scale goals for YARN.

In my experience with using similar distributed scheduling systems we can run 
into following types of issues
a) the node is currently full of running containers and the estimate of when 
capacity will free up for running queued requests could be hard/wrong. Your 
request might be queued a long time affecting latency of the queue-able 
container startup
b) multiple LocalRMs could race to grab available space on a NM and one might 
get queued behind other requests having similar effects as a).

For sake of discussion of mechanisms, I would suggest discussion of pros and 
cons for ability to 1) schedule queueable containers on multiple nodes, 2) 
ability to cancel  queued requests
Giving the power of at least 2 NM choices could address a lot of variability of 
queue-able container startup latency.
One way is keep the queue of requests in the NM, but if needed, NMs ultimately 
confirm with the requesting LocalRM to ensure that the queued request is still 
valid. 

> Extend YARN to support distributed scheduling
> ---------------------------------------------
>
>                 Key: YARN-2877
>                 URL: https://issues.apache.org/jira/browse/YARN-2877
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, resourcemanager
>            Reporter: Sriram Rao
>
> This is an umbrella JIRA that proposes to extend YARN to support distributed 
> scheduling.  Briefly, some of the motivations for distributed scheduling are 
> the following:
> 1. Improve cluster utilization by opportunistically executing tasks otherwise 
> idle resources on individual machines.
> 2. Reduce allocation latency.  Tasks where the scheduling time dominates 
> (i.e., task execution time is much less compared to the time required for 
> obtaining a container from the RM).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to