[ 
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583713#comment-13583713
 ] 

Bikas Saha commented on YARN-392:
---------------------------------

>From what I understand this seems to be tangentially going down the path of 
>the discussion that happened in YARN-371. The crucial point is that the YARN 
>resource scheduler is *not* a task scheduler. So introducing concepts that 
>directly or indirectly make it do task scheduling would be inconsistent with 
>the design. Its a coarse grained resource allocator that gives the app 
>containers that represent chunks of resources using which the app can schedule 
>its tasks. Different versions of the scheduler change the way the resource 
>sharing is being done. Fair/Capacity or otherwise. Ideally we should have only 
>1 scheduler that has hooks to change the sharing policy. The code kind off 
>reflects that because there is so much common code/logic between both 
>implementations.

Unfortunately, in both the Fair and Capacity Scheduler the implementations have 
mixed up 1) decision to allocate at and below a given topology level [say * 
level] with 2) whether there are resource requests at that level. E.g. when 
allocation cycle is started for an app, the logic starts at the * and checks if 
the resource requests count > 0. If yes then it goes into racks and then nodes. 
Which means that if an application wants resources only at a node then it has 
to create requests at the rack and * level too. This is because locality 
relaxation has gotten mixed up with being "schedulable", if you catch my drift. 
My strong belief is that if we can fix this overload then we wont need to fix 
this jira. However I can see that fixing the overload will be a very 
complicated knot to untie and perhaps impossible to do now because it may be 
inextricably linked with the API. Which is why I created this jira.

Now, if the problem is the * overloaded that I describe above, then the problem 
is the entanglement of delay scheduling (for locality). Here is an alternative 
proposal that addresses this problem. Lets make the delay of the delay 
scheduling specifiable by the application. So an application can specify how 
long to wait before relaxing its node requests to rack and *. When an app wants 
containers on specific nodes it basically means that it does not want the RM to 
automatically relax its locality - thus specifying a large value for the delay. 
The end result being allocation on specific nodes if resources become available 
on those nodes. This also serves as a useful extension of delay scheduling. 
Short apps can be aggressive in relaxing locality while long+large jobs can be 
more conservative in trading of scheduling speed with network IO.
The catch in the proposal is that such requests have to be made at a different 
priority level. Resource requests at the same priority level get aggregated and 
we dont want to aggregate relaxable resource requests with non-relaxable 
resource requests. I think this is a good thing to do anyways because it makes 
the application think and decide which kind of tasks it needs to get running 
first.

An extension of this approach also ties in nicely with the API enhancement 
suggested by YARN-394. The RM could actually inform the app that it has not 
been able to allocate a resource request on a node and the time limit has 
elapsed. At which point, the app could cancel that request and ask for an 
alternative set of nodes. I agree I am hand-waving in this paragraph.

Thoughts?
                
> Make it possible to schedule to specific nodes without dropping locality
> ------------------------------------------------------------------------
>
>                 Key: YARN-392
>                 URL: https://issues.apache.org/jira/browse/YARN-392
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Sandy Ryza
>         Attachments: YARN-392.patch
>
>
> Currently its not possible to specify scheduling requests for specific nodes 
> and nowhere else. The RM automatically relaxes locality to rack and * and 
> assigns non-specified machines to the app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to