[
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583713#comment-13583713
]
Bikas Saha commented on YARN-392:
---------------------------------
>From what I understand this seems to be tangentially going down the path of
>the discussion that happened in YARN-371. The crucial point is that the YARN
>resource scheduler is *not* a task scheduler. So introducing concepts that
>directly or indirectly make it do task scheduling would be inconsistent with
>the design. Its a coarse grained resource allocator that gives the app
>containers that represent chunks of resources using which the app can schedule
>its tasks. Different versions of the scheduler change the way the resource
>sharing is being done. Fair/Capacity or otherwise. Ideally we should have only
>1 scheduler that has hooks to change the sharing policy. The code kind off
>reflects that because there is so much common code/logic between both
>implementations.
Unfortunately, in both the Fair and Capacity Scheduler the implementations have
mixed up 1) decision to allocate at and below a given topology level [say *
level] with 2) whether there are resource requests at that level. E.g. when
allocation cycle is started for an app, the logic starts at the * and checks if
the resource requests count > 0. If yes then it goes into racks and then nodes.
Which means that if an application wants resources only at a node then it has
to create requests at the rack and * level too. This is because locality
relaxation has gotten mixed up with being "schedulable", if you catch my drift.
My strong belief is that if we can fix this overload then we wont need to fix
this jira. However I can see that fixing the overload will be a very
complicated knot to untie and perhaps impossible to do now because it may be
inextricably linked with the API. Which is why I created this jira.
Now, if the problem is the * overloaded that I describe above, then the problem
is the entanglement of delay scheduling (for locality). Here is an alternative
proposal that addresses this problem. Lets make the delay of the delay
scheduling specifiable by the application. So an application can specify how
long to wait before relaxing its node requests to rack and *. When an app wants
containers on specific nodes it basically means that it does not want the RM to
automatically relax its locality - thus specifying a large value for the delay.
The end result being allocation on specific nodes if resources become available
on those nodes. This also serves as a useful extension of delay scheduling.
Short apps can be aggressive in relaxing locality while long+large jobs can be
more conservative in trading of scheduling speed with network IO.
The catch in the proposal is that such requests have to be made at a different
priority level. Resource requests at the same priority level get aggregated and
we dont want to aggregate relaxable resource requests with non-relaxable
resource requests. I think this is a good thing to do anyways because it makes
the application think and decide which kind of tasks it needs to get running
first.
An extension of this approach also ties in nicely with the API enhancement
suggested by YARN-394. The RM could actually inform the app that it has not
been able to allocate a resource request on a node and the time limit has
elapsed. At which point, the app could cancel that request and ask for an
alternative set of nodes. I agree I am hand-waving in this paragraph.
Thoughts?
> Make it possible to schedule to specific nodes without dropping locality
> ------------------------------------------------------------------------
>
> Key: YARN-392
> URL: https://issues.apache.org/jira/browse/YARN-392
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Sandy Ryza
> Attachments: YARN-392.patch
>
>
> Currently its not possible to specify scheduling requests for specific nodes
> and nowhere else. The RM automatically relaxes locality to rack and * and
> assigns non-specified machines to the app.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira