[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2020-01-07 Thread Brahma Reddy Battula (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010301#comment-17010301
 ] 

Brahma Reddy Battula commented on YARN-1011:


[~haibochen] , Looks most of the jira's are closed..Any plan to merge to 
trunk.. I am planning for 3.3.0 release so please let me know.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun Murthy
>Assignee: Karthik Kambatla
>Priority: Major
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2018-09-24 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626288#comment-16626288
 ] 

Haibo Chen commented on YARN-1011:
--

The tests look good locally. I have pushed my local branch upstream. Let me 
know if you see issues [~asuresh].

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Karthik Kambatla
>Priority: Major
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2018-09-24 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626084#comment-16626084
 ] 

Haibo Chen commented on YARN-1011:
--

Sure. I am testing my local rebased branch. Will push it once the tests finish 
without failures.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Karthik Kambatla
>Priority: Major
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2018-09-21 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624032#comment-16624032
 ] 

Arun Suresh commented on YARN-1011:
---

I was trying to rebase the branch with trunk..
Got a couple of merge conflicts, mostly with some FS* classes.
[~haibochen], can you take a look ?

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Karthik Kambatla
>Priority: Major
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2018-09-20 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622767#comment-16622767
 ] 

Haibo Chen commented on YARN-1011:
--

I'm +1 on moving this.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Karthik Kambatla
>Priority: Major
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2018-09-20 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622450#comment-16622450
 ] 

Arun Suresh commented on YARN-1011:
---

Planning on spending more cycles on this now.
Looking at the SubTasks, it looks like some of them are already committed - 
mostly the ones pertaining to ResourceUtilization plumbing and NM CGroups based 
improvements.. Wondering if it is ok to move those into another umbrella JIRA ?

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Karthik Kambatla
>Priority: Major
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2017-07-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087794#comment-16087794
 ] 

Wangda Tan commented on YARN-1011:
--

Thanks [~kasha] / [~haibo.chen] for the proposal.  

I just left some comments in YARN-6808: 
https://issues.apache.org/jira/browse/YARN-6808?focusedCommentId=16087782=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16087782
 It may related to this feature as well, so could you please take a look when 
you have time? 

I think it might be better to move some JIRAs to a new umbrella: "improve using 
opportunistic container for normal use cases". YARN-1014/YARN-6674 like I 
suggested in YARN-6808.

Thoughts?

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Karthik Kambatla
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-08-09 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413696#comment-15413696
 ] 

Karthik Kambatla commented on YARN-1011:


[~anshul.pundir] - as responded on email, we welcome all contributions. I am in 
the process of updating the design doc in consolidation with YARN-2883 so we 
could get the commits in. I would encourage getting familiarized with the 
process of contribution by picking some newbie or minor bugs in the interim. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Karthik Kambatla
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-08-05 Thread Anshul Pundir (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409217#comment-15409217
 ] 

Anshul Pundir commented on YARN-1011:
-

Hi [~kasha],

This feature is of quite a bit of interest to the hadoop team at my company. 
Wondering if I can collaborate with you on this and help get this in ?

-Anshul

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Karthik Kambatla
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-04-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255967#comment-15255967
 ] 

Karthik Kambatla commented on YARN-1011:


The prototype I have been working on is here: 
https://github.com/kambatla/hadoop/commits/dev-1011-public. It is a little 
hacky, particularly on the NM-side, and needs integration with YARN-2883 and 
other cgroup changes for a clean implementation. Fixed issues found in basic 
testing. 

I will be running some more involved tests in the next few weeks to identify 
any major design/implementation short-comings. Will report back and attempt to 
answer any unanswered questions asked so far. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-28 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171244#comment-15171244
 ] 

Bikas Saha commented on YARN-1011:
--

bq. If it absolutely wants a guaranteed container, we should allocate a 
guaranteed container and kill the opportunistic one. If it does not want, we 
can let the opportunistic container continue to run.
I get that. The question is which of the 2 is the default behavior?
Next question to consider: If we let the opportunistic container run and 
consider it as part of guaranteed capacity then what prevents the node from 
killing it when the node resources get actually over-used and something needs 
to be killed by the node. And how does the rest of the system (yarn + app) 
react to losing a guaranteed container?

bq. The overall cluster utilization is an implementation detail, its sole 
purpose is to reduce the chances of running into cases that need cross-node 
promotion.
I am sorry I could not understand how that is so?


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171114#comment-15171114
 ] 

Karthik Kambatla commented on YARN-1011:


bq. So we are going to add promotion notification to the AM RM protocol, right? 
By corollary, we would be adding a flag to the initial allocation that shows if 
it was guaranteed or opportunistic, right?
Yes and yes.

bq. Its very likely that an app may have a guaranteed and an opportunistic 
container. And when it gives up a guaranteed container then we will need to 
allocate another one to it. That may be the opportunistic container or a new 
one.
Yes, particularly when it has unmet demand outside of the opportunistic 
allocation. If it has no other demand, theoretically we should promote the 
opportunistic container. Promoting across nodes is not always desirable. I feel 
we should let the app tell us what to do. If it absolutely wants a guaranteed 
container, we should allocate a guaranteed container and kill the opportunistic 
one. If it does not want, we can let the opportunistic container continue to 
run. 

bq.  The relationship between overall cluster utilization and node 
over-allocation is not clear. Like you say, for an under allocated cluster, it 
would likely be easy to find a guaranteed container. So I am not sure if we 
should go ahead and make this tenuous link formal by adding it as a config in 
the code. Regardless of cluster allocation state, a node could be 
over-allocated.
The overall cluster utilization is an implementation detail, its sole purpose 
is to reduce the chances of running into cases that need cross-node promotion. 
Node over-allocation continues to be the primary configuration admins fiddle 
with. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-22 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157482#comment-15157482
 ] 

Bikas Saha commented on YARN-1011:
--

2.3 The relationship between overall cluster utilization and node 
over-allocation is not clear. Like you say, for an under allocated cluster, it 
would likely be easy to find a guaranteed container. So I am not sure if we 
should go ahead and make this tenuous link formal by adding it as a config in 
the code. Regardless of cluster allocation state, a node could be 
over-allocated. 

3 So we are going to add promotion notification to the AM RM protocol, right? 
By corollary, we would be adding a flag to the initial allocation that shows if 
it was guaranteed or opportunistic, right?

3.2 I agree that cross node promotion is complex. But I am afraid, it does not 
look like something that can be deferred for later. Because its likely not 
possible. Its very likely that an app may have a guaranteed and an 
opportunistic container. And when it gives up a guaranteed container then we 
will need to allocate another one to it. That may be the opportunistic 
container or a new one. So its ok to defer any advanced stuff at this time, but 
in the minimum, for the sake of a complete logic definition, we will need some 
default behavior. The obvious default behavior would be to ignore the 
opportunistic container and let it run until it finishes or it preempted 
(because the node becomes busy). This aligns with the philosophy of 
opportunistic allocation being a secondary scheduling loop. Perhaps this is 
what we had in mind and if so, then my request is to call it for the sake of 
completeness.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157253#comment-15157253
 ] 

Karthik Kambatla commented on YARN-1011:


[~elgoiri] - for work that we want to get into trunk/branch-2, I think we 
should file separate JIRAs so rest of the community is aware of them. Filed 
YARN-4718 for the SchedulerNode changes (and assigned it to you). Filed 
YARN-4719 for the helper library. Let us discuss the details in the individual 
JIRAs. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157248#comment-15157248
 ] 

Karthik Kambatla commented on YARN-1011:


bq. Could you clarify 2.2? What would trigger this if not nodeUpdate()?
We could just periodically go through all the nodes and allocate containers. 
This is very similar to continuous/asynchronous scheduling in Fair/Capacity 
schedulers. 

bq. 2.3. Whats the reasoning behind this? Over-allocating a node seems to be a 
local decision based on the nodes expected and actual utilization. So I would 
expect the logic to be something similar to 1) Node is already 100% allocated 
2) Actual utilization is < 80% 3) Over-allocate to bring actual utilization 
~=80%.
There is a node config that determines if the node allows oversubscription and 
by how much. That said, the RM still has to decide when/where to allocate 
opportunistic containers. When the overall cluster utilization is low, it is 
highly likely the RM would find a guaranteed container soon after it allocates 
an opportunistic container for a ResourceRequest. By waiting for this 
utilization to be over a threshold, we are avoiding having to promote 
containers right after allocating them. This shouldn't be a problem in 
practice, because we expect over-allocation to help improve the utilization on 
a fully-allocated cluster. 

{quote}
3. What is the AM/RM interaction in this promotion?
3.2. Not clear what is actually happening here? Will new container be allocated 
and the opportunistic container allowed to continue till is exits or is 
preempted?
{quote}
We don't know yet. :)

Promoting a container on the same node is fairly straight-forward: the node 
just promotes the container and the AM can be informed that a running container 
has been promoted should it want to differentiate between opportunistic and 
guaranteed containers. 

I am not actively thinking about promotion across nodes. Given the additional 
complexity, I feel we should see some numbers before going further. And, the 
rest of the work is required anyway. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-20 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155867#comment-15155867
 ] 

Bikas Saha commented on YARN-1011:
--

2.3. Whats the reasoning behind this? Over-allocating a node seems to be a 
local decision based on the nodes expected and actual utilization. So I would 
expect the logic to be something similar to 1) Node is already 100% allocated 
2) Actual utilization is < 80% 3) Over-allocate to bring actual utilization 
~=80%.

3. What is the AM/RM interaction in this promotion?
3.2. Not clear what is actually happening here? Will new container be allocated 
and the opportunistic container allowed to continue till is exits or is 
preempted?

How does all this interact with preemption?

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-20 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155829#comment-15155829
 ] 

Inigo Goiri commented on YARN-1011:
---

[~kasha}, I think those can apply easily to {{CapacityScheduler}}. To get 
started, I would take a first try in YARN-4511 with:
# Get the oversubscription ratio into {{SchedulerNode}}
# Add methods {{getUsedUtilizationResource()}} and 
{{getAvailableUtilizationResource()}} for {{containersUtilization}} (which is 
already available from YARN-3980 and would also use the oversubscription)
# Rename {{getUsedResource()}} to {{getUsedAllocationResource()}}, 
{{getAvailableResource()}} to {{getAvailableAllocationResource()}}, and 
{{getTotalResource()}} to {{getTotalAllocationResource()}} (I don't think these 
are external APIs so it should be fine to change the names)

For your 2, 3, and 4, we can add them to {{FairScheduler}} in YARN-1015 as a 
first prototype and then move as much as possible to YARN-4511.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-20 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155822#comment-15155822
 ] 

Inigo Goiri commented on YARN-1011:
---

Could you clarify 2.2? What would trigger this if not {{nodeUpdate()}}?

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155792#comment-15155792
 ] 

Karthik Kambatla commented on YARN-1011:


Looked more closely into how we would implement this for the FairScheduler. 
[~wangda], [~elgoiri] - could you guys please look if this is similar to what 
is needed in the CapacityScheduler. I would like for us to make changes in the 
common parts as much as possible, so the schedulers don't diverge further. 
While at it, there might be value in moving more of the common logic to common 
data structures. 
# SchedulerNode needs to track utilized/unutilized resources in addition to 
allocated (called used today) and unallocated (called available today). It 
might make sense to update the names of variables that exist today in trunk 
itself.
# It would be handy to have a single class (may be, NodeTracker or 
NodeListManager) that tracks properties about all nodes and exposes some 
convenient APIs - return nodes matching a particular condition (filter) or 
returning a sorted list of NodeIds based on a custom comparator. We need the 
latter to iterate through nodes in the right order for continuous/asynchronous 
scheduling and the newly added opportunistic scheduling. The former could be 
used for locality and label-matching in the future.
# SchedulerApplicationAttempt should expose unmetDemandGuaranteed and 
unmetDemandOpportunistic to be used by the two schedulers. 
# There will be two queue hierarchies - one each for guaranteed and 
opportunistic allocation. This translates to adding a second list of child 
queues/apps in each queue. We could build on what we have in each scheduler, or 
reconcile some of this code into common queue. 


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155692#comment-15155692
 ] 

Karthik Kambatla commented on YARN-1011:


Had offline discussions with [~jlowe], [~nroberts], [~elgoiri], [~kkaranasos] 
and [~asuresh]. Take-aways:
# To ensure the guaranteed containers continue to be allocated exactly the same 
way as today, we leave that scheduling logic as is. 
# A "second scheduler" is responsible for allocating opportunistic containers. 
## This "second scheduler" could be another method that is called during node 
update, or just another thread that runs asynchronously.
## Using an asynchronous thread allows us to process the nodes in the order of 
unused resources instead of node heartbeat.
## Opportunistic scheduling could trigger only after the cluster allocation is 
over a threshold - initially, we could hard code it to 80% of cluster capacity. 
# When the scheduler comes around to allocate a guaranteed container for a 
previously allocated opportunistic container, that container is promoted.
## Promotion on the same node is straight-forward and always desirable.
## Promotion across nodes is more complicated and leads to resource wastage. 
However, not promoting could lead to an application getting resources later 
than what it would have with oversubscription turned off. Accordingly, we could 
have a policy to enable/disable cross-node promotion. To begin with, it would 
be disabled. We could always add the option of enabling it in the future. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-29 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124489#comment-15124489
 ] 

Inigo Goiri commented on YARN-1011:
---

The second scheduling loop makes sense. I'd like the design doc to be updated 
with this new approach and a couple examples on how containers would be started.

I think the next step would be to start YARN-4511 and maybe create a new JIRA 
for the overallocation scheduling in the NM.
After that, we could try to implement the scheduling approach in YARN-1013 or 
YARN-1015.

We would still miss the interface to mark containers/application as supporting 
opportunistic; is this being tracked in any other JIRA?

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-27 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119825#comment-15119825
 ] 

Karthik Kambatla commented on YARN-1011:


[~nroberts] - interesting ideas. The notion of two schedulers definitely frees 
us from the confines we have today. 

How about doing the following on node update (along the lines of two scheduler 
suggestion):
{code}
process_newly_launched_containers();
process_completed_containers();
while (guaranteed_resources_on_node_are_still_available) {
  app = pickApp();
  if (app.hasOpportunisticContainersRunningOnTheNode()) {
promote_container_to_guaranteed();
  } else {
allocate_guaranteed_container_as_we_do_today(); // includes reservation
  }
}
while (opportunistic_resources_on_node_are_still_available) {
  app = pickAppOkayWithOpportunisticContainers();
  allocate_opportunistic_container(); // reservations are not allowed
}
{code}

This way:
# Apps that can't tolerate using opportunistic containers can state their 
preference to have only guaranteed containers.
# For apps that are okay with opportunistic containers, a container is promoted 
only when the app deserves guaranteed containers. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-27 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119830#comment-15119830
 ] 

Karthik Kambatla commented on YARN-1011:


BTW, sorry for the delay in circling back here. Was out of the country last 
week. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-19 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107009#comment-15107009
 ] 

Nathan Roberts commented on YARN-1011:
--

bq. Welcome any thoughts/suggestions on handling promotion if we allow 
applications to ask for only guaranteed containers. I ll continue 
brain-storming. We want to have a simple mechanism, if possible; complex 
protocols seem to find a way to hoard bugs.

I agree that we want something simple and this probably doesn’t qualify, but 
below are some thoughts anyway. 

This seems like a difficult problem. Maybe a webex would make sense at some 
point to go over the design and work through some of these issues

Maybe we need to run two schedulers, conceptually anyway. One of them is 
exactly what we have today, call it the “GUARANTEED” scheduler. The second one 
is responsible for the “OPPORTUNISTIC” space. What I like about this sort of 
approach is that we aren’t changing the way the GUARANTEED scheduler would do 
things. The GUARANTEED scheduler assigns containers in the same order as it 
always has, regardless of whether or not opportunistic containers are being 
allocated in the background. By having separate schedulers, we’re not 
perturbing the way user_limits, capacity limits, reservations, preemption, and 
other scheduler-specific fairness algorithms deal with opportunistic capacity 
(I’m concerned we’ll have lots of bugs in this area). The only difference is 
that the OPPORTUNISTIC side might already be running a container when the 
GUARANTEED scheduler gets around to the same piece of work (the promotion 
problem). What I don't like is that it's obviously not simple.
- The OPPORTUNISTIC scheduler could behave very differently from the GUARANTEED 
scheduler (e.g. it could only consider applications in certain queues, it could 
heavily favor applications with quick running containers, it could randomly 
select applications to fairly use OPPORTUNISTIC space, it could ignore 
reservations, it could ignore user limits, it could work extra hard to get good 
container locality, etc.)
- When the OPPORTUNISTIC scheduler launches a container, it modifies the ask to 
indicate this portion has been launched opportunistically, the size of the ask 
does not change (this means the application needs to be aware that it is 
launching an OPPORTUNISTIC container) 
- Like Bikas already mentioned, we have to promote opportunistic containers, 
even if it means shooting an opportunistic one and launching a guaranteed one 
somewhere else.
- If the GUARANTEED scheduler decides to assign a container y to a portion of 
an ask that has already been opportunistically launched with container x, the 
AM is asked to migrate container x to container y. If x and y are on the same 
host, great, the AM asks the NM to convert x to y (mostly bookkeeping); if not 
the AM kills x and launches y. Probably need a new state to track the migration.
- Maybe locality would make the killing of opportunistic containers a rare 
event? If both schedulers are working hard to get locality (e.g. YARN-80 gets 
us to about 80% node local), then it seems like the GUARANTEED scheduler is 
going to usually pick the same nodes as the OPPORTUNISTIC scheduler, resulting 
in very simple container conversions with no lost work.
- I don’t see how we can get away from occasionally shooting an opportunistic 
container so that a guaranteed one can run somewhere else. Given that we want 
opportunistic space to be used for both SLA and non-SLA work, we can’t wait 
around for a low priority opportunistic container on a busy node. Ideally the 
OPPORTUNISTIC scheduler would be good at picking containers that almost never 
get shot. 
- When the GUARANTEED scheduler assigns a container to a node, the 
over-allocate thresholds could be violated, in this case OPPORTUNISTIC 
containers on the node need to be shot.  It would be good if this didn’t happen 
if a simple conversion was going to occur anyway. 

Given the complexities of this problem, we're going to experiment with a 
simpler approach of over-allocating up-to 2-3X on memory with the NM shooting 
containers (preemptable containers first) when resources are dangerously low. 
The over-allocate will be dynamic based on current node usage (when node is 
idle, no over-allocate; basically there has to be some evidence that  
over-allocating will be successful before we actually over-allocate). This type 
of approach might not satisfy all use cases but it might turn out to be very 
simple and mostly effective. We'll report back on how this type of approach 
works out.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  

[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098511#comment-15098511
 ] 

Karthik Kambatla commented on YARN-1011:


[~nroberts], [~leftnoteasy] - reasonable concerns. I am looking into allowing 
the app ask only for guaranteed containers. Scheduling will likely remain 
simple: in our loop, we just skip an application if it is not interested in 
opportunistic containers. Promotion, though, becomes tricky: we should hold off 
on promoting a container until all higher-"priority" applications that want 
only guaranteed containers get them.

Welcome any thoughts/suggestions on handling promotion if we allow applications 
to ask for only guaranteed containers. I ll continue brain-storming. We want to 
have a simple mechanism, if possible; complex protocols seem to find a way to 
hoard bugs. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101106#comment-15101106
 ] 

Wangda Tan commented on YARN-1011:
--

bq. Welcome any thoughts/suggestions on handling promotion if we allow 
applications to ask for only guaranteed containers. I ll continue 
brain-storming. We want to have a simple mechanism, if possible; complex 
protocols seem to find a way to hoard bugs.
Agree :)

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-13 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096938#comment-15096938
 ] 

Nathan Roberts commented on YARN-1011:
--

Thanks for the update [~kasha]. I have a few questions but I'll start at a high 
level since the answers may clear up others.
bq. On SchedulerNode#allocateContainer, mark the container OPPORTUNISTIC if 
allocating this container would take the allocation over the advertised 
capacity.  And, keep track of opportunistic containers in a list separate from 
launchedContainers.

It sounds like any container from any queue (assuming CapacityScheduler here 
but similar constructs will exist in other schedulers) could be marked 
OPPORTUNISTIC. Does this mean a queue with SLA sensitive jobs could get 
OPPORTUNISTIC containers? I just don't see how these jobs will be able to meet 
their SLAs with the uncertainty as to the resources they'll actually get, and 
the likelihood they'll get shot by the NM. This is precisely the reason there 
is the ability to designate queues as preemptable since SLA sensitive jobs just 
don't do well in preemtable queues. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-13 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097009#comment-15097009
 ] 

Inigo Goiri commented on YARN-1011:
---

[~kasha] can you add some examples to the doc for clarification?

For example, if a node has 10 CPUs, 
yarn.nodemanager.overallocation.allocation-threshold is OT=0.5 and 
yarn.nodemanager.overallocation.preemption-threshold is PT=1.0. In the regular 
case we have 2 containers with an allocation of 5 CPUs each and each uses only 
2.5 CPUs. With this proposal, we could add a third container with an allocation 
of 5 CPUs which also uses 2.5s CPUs. When would we preempt? How would this 
evolved when the utilization changes? How would it promote?


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096987#comment-15096987
 ] 

Karthik Kambatla commented on YARN-1011:


bq. Does this mean a queue with SLA sensitive jobs could get OPPORTUNISTIC 
containers?
You bring up a good point. 

Let us consider an example - an app requests for a container at time 0 that 
would run for 30 seconds. Today, let us say, the app would have gotten the 
container 10s from submission. With opportunistic scheduling, let us say we are 
able to schedule an OPPORTUNISTIC container at 5s. Now, if we have resource 
contention within the next 5 seconds, we would just preempt this OPPORTUNISTIC 
container and it would get scheduled at 10s as it would have otherwise. If the 
contention were to surface at 15s from submission, we would have to preempt and 
re-allocate losing 5s compared to what would have happened today. I was hoping 
the ability to tune allocation and preemption thresholds would greatly reduce 
the likelihood of preemptions, but we can't rely on that for SLA jobs. 

I am wary of allowing applications to ask for only GUARANTEED containers, as 
that would complicate the scheduler significantly. Since applications are 
notified of the ExecutionType, is it okay to leave it upto them to reject 
OPPORTUNISTIC containers. We could add this to AMRMClient and MapReduce.




> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097690#comment-15097690
 ] 

Wangda Tan commented on YARN-1011:
--

Thanks for updating docs, [~kasha]. Most part of the design doc makes sense to 
me, I have the same concern regarding SLA of opportunistic containers:

I'm not very familiar with CPU scheduling in CGroup. Do you know what will 
happen for the following example:
Let's say a node (strict resource usage disabled) has 60% of CPU utilization, 
which consumed by 10 processes, each of them has CPU share = 1024. Now there's 
a new process comes with CPU share = 1, it wants 100% of CPU, will the system 
gives 40% of the idle CPU to the new process or the new process can only get 
small proportion of the rest idle resources?

If the process can full leverage idle resources regardless of CPU share, I'm 
fine with setting lowest CPU-share for opportunistic containers.

bq. Won't the scheduler just try to assign it to that application again? Seems 
like they'll fight one another. "Here's a container".."No, I don't like this 
container".."No really, you should like this container, take it."
I think this is a valid concern, [~kasha], do you think scheduler should smart 
enough to avoid allocating opportunistic containers to an app who doesn't want 
such opportunistic container?

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-13 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097040#comment-15097040
 ] 

Nathan Roberts commented on YARN-1011:
--

Won't the scheduler just try to assign it to that application again? Seems like 
they'll fight one another. "Here's a container".."No, I don't like this 
container".."No really, you should like this container, take it."

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-06 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085677#comment-15085677
 ] 

Nathan Roberts commented on YARN-1011:
--

bq. This is one of the reasons I was proposing the notion of a max threshold 
which is less than 1 If the utilization goes to 100%, we clearly know there is 
contention. Since we measure resource utilization in resource-seconds (if not, 
we should update it), bursty spikes alone wouldn't take utilization over 100%. 
So, we shouldn't see a utilization greater than 100%.

Just to make sure I understand. When you say max threshold < 1 are you saying 
an NM could not advertise 48 vcores if there are only 24 vcores physically 
available? I think we have to support going above 1.0. We already go above 1.0 
on our clusters, even without this feature. What I'm thinking this feature will 
allow us to do is to go significantly above 1.0, especially for resources like 
memory where we have to be much more careful about not hitting 100%. 

One use case that I'm really hoping this feature can support is a batch cluster 
(loose SLAs) with very high utilization. For this use case, I'd like the 
following to be true:
- nodes can be at 100% CPU, 100% Network, or 100% Disk for long periods of time 
(several minutes). Memory could get to something like 80% before corrective 
action would be required. During these periods, no containers get shot to shed 
load. Nodemanagers might reduce their available resource advertised to the RM, 
but nothing would need to be killed.
- Both GUARANTEED and OPPORTUNISTIC containers get their fair share of 
resources. They're both drawing from the same capacity and user-limit from the 
RM's point of view so I feel like they should be given their fair set of 
resources on the nodes they execute on. The real point of being designated 
OPPORTUNISTIC in this use case is that the NM knows which containers to kill 
when it needs to shed load.  

Another use case is where you have a mixture of jobs, some with tight SLAs, 
some with looser SLAs. This one is mentioned in previous comments and is also 
very important. It requires a different set of thresholds and a different level 
of fairness controls. 

So, I just think things have to be configurable enough to handle both types of 
clusters. 


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085738#comment-15085738
 ] 

Karthik Kambatla commented on YARN-1011:


bq, Just to make sure I understand. When you say max threshold < 1 are you 
saying an NM could not advertise 48 vcores if there are only 24 vcores 
physically available?
You can continue to advertise more vcores. 

Consider a cluster with nodes of 1 physical core. Let us say each node 
advertises 10 *vcores*. Today, let us say your CPU utilization under these 
settings is 50% running 10 containers. All these containers in this context 
would be GUARANTEED containers. I am proposing we set a max threshold for the 
RM over-allocating containers to 95%.This essentially means, the RM allocates 
OPPORTUNISTIC containers on this node (that has been previously fully 
allocated) until we hit the utilization threshold of 95% - say, running 19 
containers. At this point if one container's usage goes higher taking us beyond 
95%, we kill enough OPPORTUNISTIC containers to bring this under 95%. May be, 
the max allowed threshold could be higher - 99%. I am wary of setting it to 
100% unless we have some other way of differentiating "running comfortably at 
100%" vs "contention at 100%" because both look the same.  Also, I am assuming 
people would be very happy with 95% utilization if we achieve that :)

bq. nodes can be at 100% CPU, 100% Network, or 100% Disk for long periods of 
time (several minutes). Memory could get to something like 80% before 
corrective action would be required. 
I am beginning to see the need for different thresholds for different 
resources. While I wouldn't necessarily shoot for 100, I can see someone 
configuring it to 95% CPU, 85% network (as this could spike significantly with 
shuffle etc.), 90% disk, 80% memory. And, we would stop over-allocating the 
moment we hit *any one* of these thresholds. 

Should we keep it simple to begin with and have one config, and add other 
configs in the future? Or, do you think the config-per-resource should be there 
from the get go? 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085840#comment-15085840
 ] 

Karthik Kambatla commented on YARN-1011:


bq. Lets say job capacity is 1 container and the job asks for 2. Its get 1 
normal container and 1 opportunistic container. Now it releases its 1 normal 
container. At this point what happens to the opportunistic container. It is 
clearly running at lower priority on the node and as such we are not giving the 
job its guaranteed capacity. 
Momentarily, yes. The RM/NM ensemble (let us discuss that separately) realizes 
this and adjusts by promoting the opportunistic container. Is this different 
from what happens today? Today, the job is allocated one container since that 
is its capacity. Once that is done, it allocates another. Between the first one 
finishing and second one launching, we are not giving the job its guaranteed 
capacity. 

bq. The question is not about finding an optimal solution for this problem (and 
there may not be one). The issue here is to crisply define the semantics around 
scheduling in the design. Whatever the semantics are, we should clearly know 
what they are. IMO, the exact semantics of scheduling should be in the docs.
Agree. I ll add something to the design doc once we capture everyone's 
concerns/suggestions here on JIRA, and may be we could iterate. 

bq. Because of that complexity, I'm not 100% convinced that disfavoring 
OPPORTUNISTIC containers (e.g. low value for cpu_shares) is something that buys 
us very much. 
I don't necessarily see it as disfavoring OPPORTUNISTIC containers. Without 
over-allocation these containers wouldn't even have started. While we are 
optimizing for utilization and throughput, we are just making sure we don't 
adversely affect containers that have been launched prior with promises of 
isolation. 

The low value of cpu_shares only kicks in when the node is highly contended, 
and is intended to be a fail-safe. As long as there are free resources (which I 
believe is the most common case), these OPPORTUNISTIC containers should get a 
sizeable CPU share. No? 

bq. So, hopefully we can make the policy quite configurable so that the amount 
of disfavoring can be tuned for various workloads.
I agree that we might eventually need a configurable policy, but making the 
policy configurable might not be as straight-forward. I am definitely open to 
inputs on simple ways of doing this. Also, it is hard to comment on the 
effectiveness of a simple-but-not-so-configurable policy without implementing 
it and running sample workloads against it.

The simple policy I had in mind was:
# Update the SchedulerNode#getAvailable to include resources that could be 
opportunistically allocated. i.e., max(what_it_says_today, threshold * 
resource). It should be easy to support per-resource thresholds here. 
# At allocate time, label an allocation OPPORTUNISTIC if it takes the 
cumulative allocation over the advertised capacity.
# When space frees up on nodes, NMs send candidate containers for promotion on 
the heartbeat. The RM consults a policy to come up with a list of yes/no 
decisions for each of these candidates. Initially, I would like for the default 
to be yes without any reconsiderations. This favors continuing the execution of 
a container over preempting them. 

Based on what we see, we could tweak this simple policy or come up with more 
sophisticated policies. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085868#comment-15085868
 ] 

Karthik Kambatla commented on YARN-1011:


We might be better of calling this overallocation instead of oversubscription 
as the latter could be mistaken for oversubscription through the 
yarn.nodemanager.resource.* configs. I ll go ahead and use overallocation in 
patches like for YARN-4512, unless someone expresses reservations here or on 
YARN-4512. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085872#comment-15085872
 ] 

Karthik Kambatla commented on YARN-1011:


BTW, if we agree on the simple policy, I believe we should be able to pull off 
a scheduler-agnostic implementation. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-06 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086326#comment-15086326
 ] 

Bikas Saha commented on YARN-1011:
--

Some of what I am saying emanates from prior experience with a different Hadoop 
like system. You can read more about it here. 
http://research.microsoft.com/pubs/232978/osdi14-paper-boutin.pdf

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-06 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086325#comment-15086325
 ] 

Bikas Saha commented on YARN-1011:
--

bq. At this point what happens to the opportunistic container. It is clearly 
running at lower priority on the node and as such we are not giving the job its 
guaranteed capacity.
bq. At this point what happens to the opportunistic container. It is clearly 
running at lower priority on the node and as such we are not giving the job its 
guaranteed capacity.
Yes. the difference is that the opportunistic container may not be convertible 
into a normal container because that node is still over-allocated. So at this 
point, what should be done? Should this container be terminated and run 
somewhere else as normal (because capacity is now available)? Should some other 
container be preempted on this node to make this container normal? Should the 
RM allocate a normal container and give it to the app in addition to the 
running opportunistic container in case the app can do the transfer internally?

Also, with this feature in place, should we run all containers beyond 
guaranteed capacity as opportunistic containers? This would ensure that any 
excess containers that we give to a job will not affect performance of the 
guaranteed containers of other jobs. This would also make the scheduling and 
allocation more consistent in that the guaranteed containers always run at 
normal priority and extra containers run at lower priority. The extra container 
could be extra over capacity (but without over-subscription) or extra 
over-subscription. Because of this I feel that running tasks at lower priority 
could be an independent (but related) work item.

Staying on this topic and addition configuration to it. It may make sense to 
add some way by which an application can say that dont oversubscribe nodes when 
my containers run on it. Putting cgroups or docker in this context, would these 
mechanism support over-allocating resources like cpu or memory?

bq. When space frees up on nodes, NMs send candidate containers for promotion 
on the heartbeat.
That shouldn't be necessary since the RM will get to know about free capacity 
and run its scheduling cycle for that node - at which point it will be able to 
take action like allocation a new container or upgrading an existing one. There 
isnt anything the NM can tell the RM (which the RM already does not know) 
except for the current utilization of the node.

Some of what I am saying emanates from prior experience with a different Hadoop 
like system. You can read more about it here. 
http://research.microsoft.com/pubs/232978/osdi14-paper-boutin.pdf


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083223#comment-15083223
 ] 

Karthik Kambatla commented on YARN-1011:


We would run an opportunistic container on a node only if the actual 
utilization is less than the allocation by a margin bigger than the allocation 
of said opportunistic container. We reactively preempt the opportunistic 
container if the actual utilization goes over a threshold. To address spikes in 
usage where our reactive measures are too slow to kick in, we run the 
opportunistic containers at a strictly lower priority. 

bq. the app got opportunistic containers and their perf wasnt the same as 
normal containers - so it ran slower. 
As soon as we realize the perf is slower because the node has higher usage than 
we had anticipated, we preempt the container and retry allocation (guaranteed 
or opportunistic depending on the new cluster state). So, it shouldn't run 
slower for longer than our monitoring interval. Is this assumption okay? 

bq. However, things get complicated because a node with an opportunistic 
container may continue to run its normal containers while space frees up for 
guaranteed capacity on other nodes.
The opportunistic container will continue to run on this node so long as it is 
getting the resources it needs. If there is any sort of resource contention, it 
is preempted and is up for allocation on one of the free nodes. 

bq. This would require that the system upgrade opportunistic containers in the 
same order as it would allocate containers.
bq. IMO, the NM cannot make a local choice about upgrading its opportunistic 
containers because this is effectively a resource allocation decision and only 
the RM has the info to do that.
The RM schedules the next highest priority "task" for which it couldn't find a 
guaranteed container as an opportunistic container. This task continues to run 
as long as it is not getting enough resources. If there is no resource 
contention, the task continues to run. If guaranteed resources free up on the 
node it is running, isn't it fair to promote the container to Guaranteed. After 
all, if the resources unused were not hidden behind other containers' 
allocation and actually available as guaranteed capacity on that node 
initially, the RM would just have scheduled a guaranteed container in the first 
place.

I should probably clarify that the proposal here targets those cases where 
users' estimates are significantly off reality and there are enough free 
resources per node to run additional task(s) without causing any resource 
contention. Even though this is the norm, we want to guard against spikes in 
usage to avoid perf regressions. In practice, I expect admins to come up with a 
reasonable threshold for over-subscription: e.g. 0.8 - we use only 
oversubscribe upto 80% of capacity advertised through 
{{yarn.nodemanger.resource.*}}


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083226#comment-15083226
 ] 

Karthik Kambatla commented on YARN-1011:


Since the plan is develop this on a branch, we could get started on some of the 
sub-tasks and adjust things as the design evolves. I am creating YARN-1011 
branch to track this work. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-05 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083683#comment-15083683
 ] 

Bikas Saha commented on YARN-1011:
--

Good points but let me play the devils advocate to get some more clarity :)
bq. As soon as we realize the perf is slower because the node has higher usage 
than we had anticipated, we preempt the container and retry allocation 
(guaranteed or opportunistic depending on the new cluster state). So, it 
shouldn't run slower for longer than our monitoring interval. Is this 
assumption okay?
How do we determine that the perf is slower? The CPU would never exceed 100% 
even under over-allocation. Is preempting always necessary? If we are sure that 
the OS is going to starve the opportunistic containers, then can assume that 
when the node is fully utilized, then only our guaranteed containers are using 
resources? So we can let the opportunistic containers be so that they can start 
soaking up excess capacity after the normal containers have stopped spiking. 
Perhaps some experiments will shed some light on this.

bq. The opportunistic container will continue to run on this node so long as it 
is getting the resources it needs. If there is any sort of resource contention, 
it is preempted and is up for allocation on one of the free nodes.
Lets say job capacity is 1 container and the job asks for 2. Its get 1 normal 
container and 1 opportunistic container. Now it releases its 1 normal 
container. At this point what happens to the opportunistic container. It is 
clearly running at lower priority on the node and as such we are not giving the 
job its guaranteed capacity. The question is not about finding an optimal 
solution for this problem (and there may not be one). The issue here is to 
crisply define the semantics around scheduling in the design. Whatever the 
semantics are, we should clearly know what they are. IMO, the exact semantics 
of scheduling should be in the docs.

bq. The RM schedules the next highest priority "task" for which it couldn't 
find a guaranteed container as an opportunistic container. This task continues 
to run as long as it is not getting enough resources. If there is no resource 
contention, the task continues to run. If guaranteed resources free up on the 
node it is running, isn't it fair to promote the container to Guaranteed.
Sure. And thats why the system should upgrade opportunistic containers in the 
order in which they were allocated. However, the decision must be made at the 
RM and not the NM since the NMs dont know about total capacity and multiple NMs 
locally upgrading their opportunistic containers might end up over-allocating 
for a job. Further, the queue sharing state may have changed since the 
opportunistic allocation, and hence assuming that the opportunistic container 
"would have" gotten that allocation anyways, at a later point in time, may not 
be valid.

In summary, what we need in the document is a clear definition of the 
scheduling policy around this - whatever that policy may be.


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-05 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083957#comment-15083957
 ] 

Bikas Saha commented on YARN-1011:
--

I agree with natural container churn in favor of preemption to avoid lost work 
though the issue of clearly defining scheduler policy still remains.

bq.  If we were oversubscribing 10X then I'd probably want it for sure, but if 
it's at most 2X capacity then worst case is a container only gets 50% of the 
resource it had requested. Obviously for something like memory this has to be 
closely controlled because going over the physical capabilities of the machine 
has very significant consequences. But for CPU, I'd definitely be inclined to 
live with the occasional 50% worst case for all containers, in order to avoid 
the 1/1024th worst case for OPPORTUNISTIC containers on a busy node.
I did not understand this. Does this mean, its ok for normal containers to run 
50% slower in the presence of opportunistic containers? If yes, then there are 
scenarios where this may not be a valid choice. E.g. when a cluster is running 
a mix of SLA and non-SLA jobs. Non-SLA jobs are ok if there containers got 
slowed down to increase cluster utilization by running opportunistic containers 
because we are getting higher overall throughput. But SLA jobs are not ok with 
missing deadlines because there tasks ran 50% slower. 

IMO, the litmus test for a feature like this would be to take an existing 
cluster (with low utilization because tasks are asking for more resources than 
what they need 100% of the time). Then turn this feature on and get better 
cluster utilization and throughput without affecting the existing workload. 
Whatever be the internal implementation details. Agree?

bq. 50% of maximum-under-utilized resource of past 30 min for each NM can be 
used to allocate opportunistic containers.
These are heuristics and may all be valid under different circumstances. What 
we should step back and see is what is the source of this optimization.
Observation : Cluster is under-utilized despite being fully allocated
Possible reasons : 
1) Tasks are incorrectly over-allocated. Will never use the resources they ask 
for and hence we can safely run additional opportunistic containers. So this 
feature is used to compensate for poorly configured applications. Probably a 
valid scenario but is it common?
2) Tasks are correctly allocated but dont use their capacity to the limit all 
the time. E.g. Terasort will use high cpu only during the sorting but not 
during the entire length of the job. But its containers will ask for enough CPU 
to run the sort in the desired time. This is a typical application behavior 
where resource usage varies over time. So this feature is used to soak up the 
fallow resources in the cluster while tasks are not using their quoted capacity.

The arguments and assumptions we make need to be considered in the light of 
which of 1 or 2 is the common case and where this feature will be useful.

While its useful to have configuration knobs, for a complex dynamic feature 
like this that is basically reacting to runtime observations, it may be quite 
hard to be able to configure this statically using manual configuration. While 
some limits about max over-allocation limit etc. are easy and probably required 
to configure, we should look at making this feature work by itself instead of 
relying exclusively on configuration (hell :P) for users to make this feature 
usable.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083974#comment-15083974
 ] 

Jason Lowe commented on YARN-1011:
--

bq. Tasks are incorrectly over-allocated. Will never use the resources they ask 
for and hence we can safely run additional opportunistic containers. So this 
feature is used to compensate for poorly configured applications. Probably a 
valid scenario but is it common?

In my experience this is fairly common.  Users tend to twiddle with config 
values until something is working then they don't bother to revisit until 
there's a problem.  And it's easier to over allocate than to spend the time to 
carefully tune the task size.  Even if the user is interested in tuning they 
can't always tune optimally.  Some examples are data skew or other 
task-specific issues where a few tasks need a lot of memory but the vast 
majority of the others do not.  Many frameworks only allow the task sizes to be 
configured as a group, so the user has to run all the tasks in the group with 
the worst-case container size even though most of them don't need it.  Pig on 
MapReduce is another example, where it will spawn multiple jobs but the user 
can only configure the memory settings once in the script and they apply to all 
jobs launched by the script.  Therefore the user has to set it to the 
worst-case size across all the script's jobs, and all but one of the jobs runs 
with oversized map containers.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-05 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083880#comment-15083880
 ] 

Nathan Roberts commented on YARN-1011:
--

Very excited about this feature and agree that we should make this as simple as 
possible in the first go around. I have a couple of initial questions. 

bq. As soon as we realize the perf is slower because the node has higher usage 
than we had anticipated, we preempt the container and retry allocation 
(guaranteed or opportunistic depending on the new cluster state). So, it 
shouldn't run slower for longer than our monitoring interval. Is this 
assumption okay?

This seems hard. ([~bikassaha] comment above). 

All of this basically boils down to the fact that preempting a container means 
lost work, so the decision to preempt something shouldn't be taken lightly. For 
resources like memory we have to react quickly, and that's fine. But for things 
like CPU, I'm personally ok with latency on the order of single digit minutes 
so that natural container churn almost always avoids preemption.

Because of that complexity, I'm not 100% convinced that disfavoring 
OPPORTUNISTIC containers (e.g. low value for cpu_shares) is something that buys 
us very much. If we were oversubscribing 10X then I'd probably want it for 
sure, but if it's at most 2X capacity then worst case is a container only gets 
50% of the resource it had requested. Obviously for something like memory this 
has to be closely controlled because going over the physical capabilities of 
the machine has very significant consequences. But for CPU, I'd definitely be 
inclined to live with the occasional 50% worst case for all containers, in 
order to avoid the 1/1024th worst case for OPPORTUNISTIC containers on a busy 
node.

So, hopefully we can make the policy quite configurable so that the amount of 
disfavoring can be tuned for various workloads.

bq. In practice, I expect admins to come up with a reasonable threshold for 
over-subscription: e.g. 0.8 - we use only oversubscribe upto 80% of capacity 
advertised through yarn.nodemanger.resource.*. Thinking more about this, this 
threshold should have an upper limit - 0.95? 
Can we make this per-resource? (80% memory, 120% CPU)?


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-05 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083905#comment-15083905
 ] 

Wangda Tan commented on YARN-1011:
--

bq. So, hopefully we can make the policy quite configurable so that the amount 
of disfavoring can be tuned for various workloads.
I also +1 to make this policy configurable. This is way I filed YARN-4511.

Instead of threshold of "current" instant usage, I think we can use the 
threshold of "record" usage. For example, only 50% of *maximum-under-utilized* 
resource of past 30 min for each NM can be used to allocate opportunistic 
containers. This could reduce number of opportunistic containers we can 
allocate, but also this could reduce risks that opportunistic containers affect 
other containers.

bq. Can we make this per-resource? (80% memory, 120% CPU)?
+1


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084543#comment-15084543
 ] 

Karthik Kambatla commented on YARN-1011:


Dropping a quick note here. (Traveling - will answer other comments tomorrow.)

bq. How do we determine that the perf is slower? The CPU would never exceed 
100% even under over-allocation.
This is one of the reasons I was proposing the notion of a max threshold which 
is less than 1 :) If the utilization goes to 100%, we clearly know there is 
contention. Since we measure resource utilization in resource-seconds (if not, 
we should update it), bursty spikes alone wouldn't take utilization over 100%. 
So, we shouldn't see a utilization greater than 100%.

bq. Can we make this per-resource? (80% memory, 120% CPU)?
I am open to per-resource configuration. That said, I am not too keen 
especially my above comment on utilization never going over 100% holds. 

bq. Tasks are incorrectly over-allocated. Will never use the resources they ask 
for and hence we can safely run additional opportunistic containers. So this 
feature is used to compensate for poorly configured applications. Probably a 
valid scenario but is it common?
It is quite common for folks to borrow a Hive or Pig script from a colleague. 


> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081950#comment-15081950
 ] 

Bikas Saha commented on YARN-1011:
--

In Tez we always try to allocated the most important work to the next allocated 
container. So doing opportunistic containers without providing the AM with the 
ability to know about it and use it judiciously may not be something that can 
be delayed to a second phase.

Being able to choose only guaranteed or non-guaranteed containers only covers 
half the problem (and probably the less relevant one) in which an application 
should always finish in 1min using guaranteed capacity but may sometimes finish 
in 30s because it got opportunistic containers. The other side is probably more 
important where a regression is caused due to opportunistic containers. 
1) the app got opportunistic containers and their perf wasnt the same as normal 
containers - so it ran slower. This may be mitigated by the system guaranteeing 
that only excess container beyond guaranteed capacity would be opportunistic. 
This would require that the system upgrade opportunistic containers in the same 
order as it would allocate containers. However, things get complicated because 
a node with an opportunistic container may continue to run its normal 
containers while space frees up for guaranteed capacity on other nodes. At this 
point, which container becomes guaranteed - the new one on a free node or the 
opportunistic one that is already doing work? Which one should be preempted?
2) the app suffered because its guaranteed containers got slowed down due to 
competition from opportunistic containers. This needs strong support for lower 
priority resource consumption for opportunistic containers.

IMO, the NM cannot make a local choice about upgrading its opportunistic 
containers because this is effectively a resource allocation decision and only 
the RM has the info to do that. The NM does not know if this would exceed 
guaranteed capacity and in total, a bunch of NMs making this choice locally can 
lead to excessive over-allocation of guaranteed resources.



> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-04 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082305#comment-15082305
 ] 

Subru Krishnan commented on YARN-1011:
--

[~kasha], I had an offline discussion with [~curino] and [~chris.douglas] 
regarding auto promotion by NM. To be aligned with YARN-2877, we feel it will 
be good if NM can express it's preference to the RM and let the RM make the 
decision as only it can ensure the global invariants based on the current state 
of the cluster. The preference can be based on whether the opportunistic 
container has been started or not, it's resources have been localized or not, 
how long it has been running, how much progress it has made, etc.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15076750#comment-15076750
 ] 

Karthik Kambatla commented on YARN-1011:


Forgot to respond to one comment:

bq. when terminating opportunistic containers will the RM ask the AM about 
which containers to kill?
Don't think we should. NM --> RM --> AM --> NM is a long communication thread. 
Our preemption should kick in much faster that that. What do you think of 
preempting the last opportunistic container that was started, since it is 
likely that far away from promotion. 



> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-01-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15076749#comment-15076749
 ] 

Karthik Kambatla commented on YARN-1011:


Thanks for chiming in, [~bikassaha]. 

bq. It is essential to run opportunistic tasks at lower OS cpu priority so that 
they never obstruct progress of normal tasks.
bq. In fact, this is the litmus test for opportunistic scheduling.
Good point. Guaranteed containers should get priority for resources: 
Opportunistic containers should only use left-over resources. We should do this 
for CPU, disk and network. I am not aware of the latest on disk and network 
isolation, but we should create sub-tasks for those too. /cc [~vvasudev] 

bq. Handling opportunistic tasks raises questions on the involvement of the AMs.
bq. In that sense it would be instructive to consider opportunistic scheduling 
in a similar light as preemption.

I wasn't sure the AM needs to know a container's execution type:

As you mention, this is very similar to preemption. From an AM's standpoint, 
the container would be preempted if those resources are not available to that 
application any more. In case of preemption, this can happen if other high 
priority queues have outstanding demand or the cluster lost a couple of nodes. 
Here, it is possible Guaranteed containers actually need the resources.  In 
that sense, the AM doesn't have to do anything different for Guaranteed vs 
Opportunistic containers.

Predictability: Allowing applications to specify only Guaranteed containers vs 
Guaranteed or Opportunistic containers should take care of this. However, 
between getting no resources and getting opportunistic resources, are there 
cases where the applications prefer the latter? The applications "should" get 
guaranteed containers at the same point in time irrespective of whether they 
use opportunistic resources in the interim. Note that allowing applications to 
specify whether they are okay with getting opportunistic containers complicates 
the scheduling - the scheduler needs to look through the higher priority apps 
that don't allow opportunistic containers before getting to those that need. 
And, when resources are available on that node, the RM will need to schedule 
containers for higher priority apps prolonging the duration for which 
opportunistic containers stay opportunistic. 

Given this complication, I would prefer we do not involve AMs in the 
decision-making process. Based on the need and usecases, we could revisit this 
at a later time. Note that YARN-4335 adds this to ResourceRequest for 
distributed scheduling, and even there they are not entirely sure if it needs 
to be a part. 

bq. does the AM need to know that a newly allocated container was 
opportunistic. E.g. so that it does not schedule the highest priority work on 
that container.
Valid concern. May be, we should intimate the AM of whether a container is 
opportunistic, and later when it gets promoted to guaranteed. That said, I am 
not sure if this is essential to oversubscription being useful. Thoughts on 
punting it to Phase-2? 

bq. will opportunistic containers be given only when for containers that are 
beyond queue capacity such that we dont break any guarantees on their 
liveliness. ie. an AM will not expect to lose any container that is within its 
queue capacity but opportunistic containers can be killed at any time.
Yes. This probably needs to be clear in the doc. Will update it. 

bq. will conversion of opportunistic containers to regular containers be 
automatically done by the RM? 
By some combination of RM/NM, definitely yes. Initially, I thought the RM can 
be the only one doing this. The RM could keep track of opportunistic containers 
in SchedulerNode. Today, we already track launchedContainers. The scheduler 
could go through this list and promote containers before allocating new 
containers. 

Does this add an unnecessary delay in the promotion though? If the scheduler 
allocated opportunistic containers based on the same prioritization it uses for 
guaranteed containers, can the NM just promote the oldest opportunistic 
container running on that node and update the RM accordingly? 

Another thing to consider here: the promotion process here should work with 
that in YARN-2877. [~subru], [~kkaranasos], [~asuresh] - is it okay for the NM 
to automatically promote some opportunistic containers. May be, we could add a 
flag to the launch context to differentiate between those opportunistic 
containers that can be automatically promoted vs those that can not be. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New 

[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2015-12-27 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072107#comment-15072107
 ] 

Inigo Goiri commented on YARN-1011:
---

The doc looks good. I have a couple questions:
# What would be the first policy to implement? I guess we can define it in 
YARN-1015.
# Would it make sense to make over-subscription a global property set by the RM 
instead of per-node?
I think we need a sub-task under this umbrella for the over-subscription 
property.



> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2015-12-27 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072240#comment-15072240
 ] 

Karthik Kambatla commented on YARN-1011:


bq. For resource oversubscription enable/disable for individual nodes, I think 
it's very important since some nodes could be more important than others. Do 
you think is it fine to add a configuration item to each NM's yarn-site.xml?
That is exactly the intent. Let us continue this conversation on YARN-4512. 

bq. For scheduler-side implementation, instead of modifying individual 
scheduler, I think we should try to add over-subscription policy to common 
scheduling layer since it doesn't sounds very related to specific scheduler 
implementation.
Makes sense. Doubt there is any scheduler-specific smarts here. If at all we 
need to do them separately, it is most likely because our scheduler 
abstractions are not clean. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2015-12-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072227#comment-15072227
 ] 

Wangda Tan commented on YARN-1011:
--

Thanks [~kasha] and also comments from [~elgoiri]. Looked at doc, it looks good.

Some questions/comments:
- For resource oversubscription enable/disable for individual nodes, I think 
it's very important since some nodes could be more important than others. Do 
you think is it fine to add a configuration item to each NM's yarn-site.xml?
- For scheduler-side implementation, instead of modifying individual scheduler, 
I think we should try to add over-subscription policy to common scheduling 
layer since it doesn't sounds very related to specific scheduler implementation.

I also agree for the first implementation, we can simply assume nodes have more 
resource to use. CS shouldn't have issue with this assumption.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2015-12-27 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072223#comment-15072223
 ] 

Karthik Kambatla commented on YARN-1011:


bq. Would it make sense to make over-subscription a global property set by the 
RM instead of per-node?
Good question. I thought about it quite some. Here is my reasoning for doing on 
the NM side. We can always switch back to defining it to the RM if that makes 
more sense.
# Even if we have the knob on the RM, the node still has to support it: monitor 
the resource usage on the node and kill the OPPORTUNISTIC containers if need 
be. On a cluster with NMs of different versions (say, during a rolling 
upgrade), the RM will have to keep track of NMs that support over-subscription. 
So, we do need some config for the NM anyway. Further, there could be 
node-specific conditions - hardware, other services running on the node etc. - 
that could affect the over-subscription capacity of the node. For instance, it 
might be okay to sign up for 90% of the advertised capacity on node A, but only 
80% on the node B. And, this ability to soak up extra work could change over 
time. 
# In terms of implementation, the node already sends its capacity and its 
aggregate-container-utilization. It might as well send an 
oversubscription-percentage over, which is interpreted as the fraction of its 
advertised capacity. e.g. A node with 64 GB memory could advertise its capacity 
as 50 GB and oversubscription-percentage 0.9. The RM could schedule upto 45 GB 
of utilization. An oversubscription-percentage <= 0 would indicate the feature 
is turned off. 

bq. What would be the first policy to implement? I guess we can define it in 
YARN-1015.
The simplest policy would likely be just assuming there are more resources on 
the node, and continue allocating with the same policies we use today for 
free/unallocated resources. 
This should work okay for the FairScheduler. I am less familiar with the 
intricate details of CS, but would think it should apply there as well. 
[~leftnoteasy] - thoughts? 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2015-12-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072228#comment-15072228
 ] 

Wangda Tan commented on YARN-1011:
--

I just created YARN-4511 to track common scheduling policy for resource 
over-subscription.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2015-12-27 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072279#comment-15072279
 ] 

Bikas Saha commented on YARN-1011:
--

In my prior experience, something like this is not practical without pro-active 
cpu management (which has been delegated to future work in the document). It is 
essential to run opportunistic tasks at lower OS cpu priority so that they 
never obstruct progress of normal tasks. Typically we will find that the 
machine is under-allocated the most in cpu usage since most processing has 
bursty cpu. When a normal task has a cpu burst then it should have to contend 
with an opportunistic task since this will be detrimental to the expected 
performance of that task. Without this, jobs will not run predictably in the 
cluster. From what I have seen, users prefer predictability over most other 
things. ie. having a 1 min job run in 1 min all the time vs making that job run 
in 30s 85% of the time and but in 2 mins for 5% of the time because that makes 
it really hard to establish SLAs. In fact, this is the litmus test for 
opportunistic scheduling. It should be able to raise the utilization of a 
cluster from (say 50%) to (say 75%) without affecting the latency of the jobs 
compared to when the cluster was running at 50%.

For memory, in fact, its ok to share and reach 100% capacity but its important 
to check that the machine does not start thrashing. Most well written tasks 
will run within their memory limits and start spilling etc. Opportunistic tasks 
are trying to occupy the memory that these tasks thought they could use but are 
not using (or that these tasks are keeping in buffer on purpose). The crucial 
thing to consider here is to look for stats that signify the onset of memory 
paging activity (or overall memory over-subscription at the OS level). At that 
point, even normal tasks that are within their limit will be adversely affected 
because the OS will start paging memory to disk. So we need to start 
proactively killing opportunistic tasks before the such paging activity gets 
triggered.

Handling opportunistic tasks raises questions on the involvement of the AMs. 
Unless I missed something, this is not called out clearly in the doc. In that 
sense it would be instructive to consider opportunistic scheduling in a similar 
light as preemption. App got container that it should not have gotten at that 
time if we had been strict but got it because we decided to loosen the strings 
(of queue capacity or machine capacity resp).
- will opportunistic containers be given only when for containers that are 
beyond queue capacity such that we dont break any guarantees on their 
liveliness. ie. an AM will not expect to lose any container that is within its 
queue capacity but opportunistic containers can be killed at any time.
- does the AM need to know that a newly allocated container was opportunistic. 
E.g. so that it does not schedule the highest priority work on that container. 
- will conversion of opportunistic containers to regular containers be 
automatically done by the RM? Will the RM notify the AM about such conversions?
- when terminating opportunistic containers will the RM ask the AM about which 
containers to kill? Given the above perf related scenarios this may not be a 
viable option.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2015-12-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072302#comment-15072302
 ] 

Wangda Tan commented on YARN-1011:
--

Thanks,

bq. Makes sense. Doubt there is any scheduler-specific smarts here. If at all 
we need to do them separately, it is most likely because our scheduler 
abstractions are not clean.
Agree!

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: yarn-1011-design-v0.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2015-12-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056483#comment-15056483
 ] 

Wangda Tan commented on YARN-1011:
--

Thanks [~kasha], count me in :)! I could help with reviewing/implementation.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)