[
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107009#comment-15107009
]
Nathan Roberts commented on YARN-1011:
--------------------------------------
bq. Welcome any thoughts/suggestions on handling promotion if we allow
applications to ask for only guaranteed containers. I ll continue
brain-storming. We want to have a simple mechanism, if possible; complex
protocols seem to find a way to hoard bugs.
I agree that we want something simple and this probably doesn’t qualify, but
below are some thoughts anyway.
This seems like a difficult problem. Maybe a webex would make sense at some
point to go over the design and work through some of these issues????
Maybe we need to run two schedulers, conceptually anyway. One of them is
exactly what we have today, call it the “GUARANTEED” scheduler. The second one
is responsible for the “OPPORTUNISTIC” space. What I like about this sort of
approach is that we aren’t changing the way the GUARANTEED scheduler would do
things. The GUARANTEED scheduler assigns containers in the same order as it
always has, regardless of whether or not opportunistic containers are being
allocated in the background. By having separate schedulers, we’re not
perturbing the way user_limits, capacity limits, reservations, preemption, and
other scheduler-specific fairness algorithms deal with opportunistic capacity
(I’m concerned we’ll have lots of bugs in this area). The only difference is
that the OPPORTUNISTIC side might already be running a container when the
GUARANTEED scheduler gets around to the same piece of work (the promotion
problem). What I don't like is that it's obviously not simple.
- The OPPORTUNISTIC scheduler could behave very differently from the GUARANTEED
scheduler (e.g. it could only consider applications in certain queues, it could
heavily favor applications with quick running containers, it could randomly
select applications to fairly use OPPORTUNISTIC space, it could ignore
reservations, it could ignore user limits, it could work extra hard to get good
container locality, etc.)
- When the OPPORTUNISTIC scheduler launches a container, it modifies the ask to
indicate this portion has been launched opportunistically, the size of the ask
does not change (this means the application needs to be aware that it is
launching an OPPORTUNISTIC container)
- Like Bikas already mentioned, we have to promote opportunistic containers,
even if it means shooting an opportunistic one and launching a guaranteed one
somewhere else.
- If the GUARANTEED scheduler decides to assign a container y to a portion of
an ask that has already been opportunistically launched with container x, the
AM is asked to migrate container x to container y. If x and y are on the same
host, great, the AM asks the NM to convert x to y (mostly bookkeeping); if not
the AM kills x and launches y. Probably need a new state to track the migration.
- Maybe locality would make the killing of opportunistic containers a rare
event? If both schedulers are working hard to get locality (e.g. YARN-80 gets
us to about 80% node local), then it seems like the GUARANTEED scheduler is
going to usually pick the same nodes as the OPPORTUNISTIC scheduler, resulting
in very simple container conversions with no lost work.
- I don’t see how we can get away from occasionally shooting an opportunistic
container so that a guaranteed one can run somewhere else. Given that we want
opportunistic space to be used for both SLA and non-SLA work, we can’t wait
around for a low priority opportunistic container on a busy node. Ideally the
OPPORTUNISTIC scheduler would be good at picking containers that almost never
get shot.
- When the GUARANTEED scheduler assigns a container to a node, the
over-allocate thresholds could be violated, in this case OPPORTUNISTIC
containers on the node need to be shot. It would be good if this didn’t happen
if a simple conversion was going to occur anyway.
Given the complexities of this problem, we're going to experiment with a
simpler approach of over-allocating up-to 2-3X on memory with the NM shooting
containers (preemptable containers first) when resources are dangerously low.
The over-allocate will be dynamic based on current node usage (when node is
idle, no over-allocate; basically there has to be some evidence that
over-allocating will be successful before we actually over-allocate). This type
of approach might not satisfy all use cases but it might turn out to be very
simple and mostly effective. We'll report back on how this type of approach
works out.
> [Umbrella] Schedule containers based on utilization of currently allocated
> containers
> -------------------------------------------------------------------------------------
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf,
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated
> containers and, if appropriate, allocate more (speculative?) containers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)