[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010301#comment-17010301 ] Brahma Reddy Battula commented on YARN-1011: [~haibochen] , Looks most of the jira's are closed..Any plan to merge to trunk.. I am planning for 3.3.0 release so please let me know. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun Murthy >Assignee: Karthik Kambatla >Priority: Major > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626288#comment-16626288 ] Haibo Chen commented on YARN-1011: -- The tests look good locally. I have pushed my local branch upstream. Let me know if you see issues [~asuresh]. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Karthik Kambatla >Priority: Major > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626084#comment-16626084 ] Haibo Chen commented on YARN-1011: -- Sure. I am testing my local rebased branch. Will push it once the tests finish without failures. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Karthik Kambatla >Priority: Major > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624032#comment-16624032 ] Arun Suresh commented on YARN-1011: --- I was trying to rebase the branch with trunk.. Got a couple of merge conflicts, mostly with some FS* classes. [~haibochen], can you take a look ? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Karthik Kambatla >Priority: Major > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622767#comment-16622767 ] Haibo Chen commented on YARN-1011: -- I'm +1 on moving this. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Karthik Kambatla >Priority: Major > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622450#comment-16622450 ] Arun Suresh commented on YARN-1011: --- Planning on spending more cycles on this now. Looking at the SubTasks, it looks like some of them are already committed - mostly the ones pertaining to ResourceUtilization plumbing and NM CGroups based improvements.. Wondering if it is ok to move those into another umbrella JIRA ? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Karthik Kambatla >Priority: Major > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087794#comment-16087794 ] Wangda Tan commented on YARN-1011: -- Thanks [~kasha] / [~haibo.chen] for the proposal. I just left some comments in YARN-6808: https://issues.apache.org/jira/browse/YARN-6808?focusedCommentId=16087782=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16087782 It may related to this feature as well, so could you please take a look when you have time? I think it might be better to move some JIRAs to a new umbrella: "improve using opportunistic container for normal use cases". YARN-1014/YARN-6674 like I suggested in YARN-6808. Thoughts? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Karthik Kambatla > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf, yarn-1011-design-v3.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413696#comment-15413696 ] Karthik Kambatla commented on YARN-1011: [~anshul.pundir] - as responded on email, we welcome all contributions. I am in the process of updating the design doc in consolidation with YARN-2883 so we could get the commits in. I would encourage getting familiarized with the process of contribution by picking some newbie or minor bugs in the interim. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Karthik Kambatla > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409217#comment-15409217 ] Anshul Pundir commented on YARN-1011: - Hi [~kasha], This feature is of quite a bit of interest to the hadoop team at my company. Wondering if I can collaborate with you on this and help get this in ? -Anshul > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Karthik Kambatla > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255967#comment-15255967 ] Karthik Kambatla commented on YARN-1011: The prototype I have been working on is here: https://github.com/kambatla/hadoop/commits/dev-1011-public. It is a little hacky, particularly on the NM-side, and needs integration with YARN-2883 and other cgroup changes for a clean implementation. Fixed issues found in basic testing. I will be running some more involved tests in the next few weeks to identify any major design/implementation short-comings. Will report back and attempt to answer any unanswered questions asked so far. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171244#comment-15171244 ] Bikas Saha commented on YARN-1011: -- bq. If it absolutely wants a guaranteed container, we should allocate a guaranteed container and kill the opportunistic one. If it does not want, we can let the opportunistic container continue to run. I get that. The question is which of the 2 is the default behavior? Next question to consider: If we let the opportunistic container run and consider it as part of guaranteed capacity then what prevents the node from killing it when the node resources get actually over-used and something needs to be killed by the node. And how does the rest of the system (yarn + app) react to losing a guaranteed container? bq. The overall cluster utilization is an implementation detail, its sole purpose is to reduce the chances of running into cases that need cross-node promotion. I am sorry I could not understand how that is so? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171114#comment-15171114 ] Karthik Kambatla commented on YARN-1011: bq. So we are going to add promotion notification to the AM RM protocol, right? By corollary, we would be adding a flag to the initial allocation that shows if it was guaranteed or opportunistic, right? Yes and yes. bq. Its very likely that an app may have a guaranteed and an opportunistic container. And when it gives up a guaranteed container then we will need to allocate another one to it. That may be the opportunistic container or a new one. Yes, particularly when it has unmet demand outside of the opportunistic allocation. If it has no other demand, theoretically we should promote the opportunistic container. Promoting across nodes is not always desirable. I feel we should let the app tell us what to do. If it absolutely wants a guaranteed container, we should allocate a guaranteed container and kill the opportunistic one. If it does not want, we can let the opportunistic container continue to run. bq. The relationship between overall cluster utilization and node over-allocation is not clear. Like you say, for an under allocated cluster, it would likely be easy to find a guaranteed container. So I am not sure if we should go ahead and make this tenuous link formal by adding it as a config in the code. Regardless of cluster allocation state, a node could be over-allocated. The overall cluster utilization is an implementation detail, its sole purpose is to reduce the chances of running into cases that need cross-node promotion. Node over-allocation continues to be the primary configuration admins fiddle with. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157482#comment-15157482 ] Bikas Saha commented on YARN-1011: -- 2.3 The relationship between overall cluster utilization and node over-allocation is not clear. Like you say, for an under allocated cluster, it would likely be easy to find a guaranteed container. So I am not sure if we should go ahead and make this tenuous link formal by adding it as a config in the code. Regardless of cluster allocation state, a node could be over-allocated. 3 So we are going to add promotion notification to the AM RM protocol, right? By corollary, we would be adding a flag to the initial allocation that shows if it was guaranteed or opportunistic, right? 3.2 I agree that cross node promotion is complex. But I am afraid, it does not look like something that can be deferred for later. Because its likely not possible. Its very likely that an app may have a guaranteed and an opportunistic container. And when it gives up a guaranteed container then we will need to allocate another one to it. That may be the opportunistic container or a new one. So its ok to defer any advanced stuff at this time, but in the minimum, for the sake of a complete logic definition, we will need some default behavior. The obvious default behavior would be to ignore the opportunistic container and let it run until it finishes or it preempted (because the node becomes busy). This aligns with the philosophy of opportunistic allocation being a secondary scheduling loop. Perhaps this is what we had in mind and if so, then my request is to call it for the sake of completeness. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157253#comment-15157253 ] Karthik Kambatla commented on YARN-1011: [~elgoiri] - for work that we want to get into trunk/branch-2, I think we should file separate JIRAs so rest of the community is aware of them. Filed YARN-4718 for the SchedulerNode changes (and assigned it to you). Filed YARN-4719 for the helper library. Let us discuss the details in the individual JIRAs. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157248#comment-15157248 ] Karthik Kambatla commented on YARN-1011: bq. Could you clarify 2.2? What would trigger this if not nodeUpdate()? We could just periodically go through all the nodes and allocate containers. This is very similar to continuous/asynchronous scheduling in Fair/Capacity schedulers. bq. 2.3. Whats the reasoning behind this? Over-allocating a node seems to be a local decision based on the nodes expected and actual utilization. So I would expect the logic to be something similar to 1) Node is already 100% allocated 2) Actual utilization is < 80% 3) Over-allocate to bring actual utilization ~=80%. There is a node config that determines if the node allows oversubscription and by how much. That said, the RM still has to decide when/where to allocate opportunistic containers. When the overall cluster utilization is low, it is highly likely the RM would find a guaranteed container soon after it allocates an opportunistic container for a ResourceRequest. By waiting for this utilization to be over a threshold, we are avoiding having to promote containers right after allocating them. This shouldn't be a problem in practice, because we expect over-allocation to help improve the utilization on a fully-allocated cluster. {quote} 3. What is the AM/RM interaction in this promotion? 3.2. Not clear what is actually happening here? Will new container be allocated and the opportunistic container allowed to continue till is exits or is preempted? {quote} We don't know yet. :) Promoting a container on the same node is fairly straight-forward: the node just promotes the container and the AM can be informed that a running container has been promoted should it want to differentiate between opportunistic and guaranteed containers. I am not actively thinking about promotion across nodes. Given the additional complexity, I feel we should see some numbers before going further. And, the rest of the work is required anyway. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155867#comment-15155867 ] Bikas Saha commented on YARN-1011: -- 2.3. Whats the reasoning behind this? Over-allocating a node seems to be a local decision based on the nodes expected and actual utilization. So I would expect the logic to be something similar to 1) Node is already 100% allocated 2) Actual utilization is < 80% 3) Over-allocate to bring actual utilization ~=80%. 3. What is the AM/RM interaction in this promotion? 3.2. Not clear what is actually happening here? Will new container be allocated and the opportunistic container allowed to continue till is exits or is preempted? How does all this interact with preemption? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155829#comment-15155829 ] Inigo Goiri commented on YARN-1011: --- [~kasha}, I think those can apply easily to {{CapacityScheduler}}. To get started, I would take a first try in YARN-4511 with: # Get the oversubscription ratio into {{SchedulerNode}} # Add methods {{getUsedUtilizationResource()}} and {{getAvailableUtilizationResource()}} for {{containersUtilization}} (which is already available from YARN-3980 and would also use the oversubscription) # Rename {{getUsedResource()}} to {{getUsedAllocationResource()}}, {{getAvailableResource()}} to {{getAvailableAllocationResource()}}, and {{getTotalResource()}} to {{getTotalAllocationResource()}} (I don't think these are external APIs so it should be fine to change the names) For your 2, 3, and 4, we can add them to {{FairScheduler}} in YARN-1015 as a first prototype and then move as much as possible to YARN-4511. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155822#comment-15155822 ] Inigo Goiri commented on YARN-1011: --- Could you clarify 2.2? What would trigger this if not {{nodeUpdate()}}? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155792#comment-15155792 ] Karthik Kambatla commented on YARN-1011: Looked more closely into how we would implement this for the FairScheduler. [~wangda], [~elgoiri] - could you guys please look if this is similar to what is needed in the CapacityScheduler. I would like for us to make changes in the common parts as much as possible, so the schedulers don't diverge further. While at it, there might be value in moving more of the common logic to common data structures. # SchedulerNode needs to track utilized/unutilized resources in addition to allocated (called used today) and unallocated (called available today). It might make sense to update the names of variables that exist today in trunk itself. # It would be handy to have a single class (may be, NodeTracker or NodeListManager) that tracks properties about all nodes and exposes some convenient APIs - return nodes matching a particular condition (filter) or returning a sorted list of NodeIds based on a custom comparator. We need the latter to iterate through nodes in the right order for continuous/asynchronous scheduling and the newly added opportunistic scheduling. The former could be used for locality and label-matching in the future. # SchedulerApplicationAttempt should expose unmetDemandGuaranteed and unmetDemandOpportunistic to be used by the two schedulers. # There will be two queue hierarchies - one each for guaranteed and opportunistic allocation. This translates to adding a second list of child queues/apps in each queue. We could build on what we have in each scheduler, or reconcile some of this code into common queue. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155692#comment-15155692 ] Karthik Kambatla commented on YARN-1011: Had offline discussions with [~jlowe], [~nroberts], [~elgoiri], [~kkaranasos] and [~asuresh]. Take-aways: # To ensure the guaranteed containers continue to be allocated exactly the same way as today, we leave that scheduling logic as is. # A "second scheduler" is responsible for allocating opportunistic containers. ## This "second scheduler" could be another method that is called during node update, or just another thread that runs asynchronously. ## Using an asynchronous thread allows us to process the nodes in the order of unused resources instead of node heartbeat. ## Opportunistic scheduling could trigger only after the cluster allocation is over a threshold - initially, we could hard code it to 80% of cluster capacity. # When the scheduler comes around to allocate a guaranteed container for a previously allocated opportunistic container, that container is promoted. ## Promotion on the same node is straight-forward and always desirable. ## Promotion across nodes is more complicated and leads to resource wastage. However, not promoting could lead to an application getting resources later than what it would have with oversubscription turned off. Accordingly, we could have a policy to enable/disable cross-node promotion. To begin with, it would be disabled. We could always add the option of enabling it in the future. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124489#comment-15124489 ] Inigo Goiri commented on YARN-1011: --- The second scheduling loop makes sense. I'd like the design doc to be updated with this new approach and a couple examples on how containers would be started. I think the next step would be to start YARN-4511 and maybe create a new JIRA for the overallocation scheduling in the NM. After that, we could try to implement the scheduling approach in YARN-1013 or YARN-1015. We would still miss the interface to mark containers/application as supporting opportunistic; is this being tracked in any other JIRA? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119825#comment-15119825 ] Karthik Kambatla commented on YARN-1011: [~nroberts] - interesting ideas. The notion of two schedulers definitely frees us from the confines we have today. How about doing the following on node update (along the lines of two scheduler suggestion): {code} process_newly_launched_containers(); process_completed_containers(); while (guaranteed_resources_on_node_are_still_available) { app = pickApp(); if (app.hasOpportunisticContainersRunningOnTheNode()) { promote_container_to_guaranteed(); } else { allocate_guaranteed_container_as_we_do_today(); // includes reservation } } while (opportunistic_resources_on_node_are_still_available) { app = pickAppOkayWithOpportunisticContainers(); allocate_opportunistic_container(); // reservations are not allowed } {code} This way: # Apps that can't tolerate using opportunistic containers can state their preference to have only guaranteed containers. # For apps that are okay with opportunistic containers, a container is promoted only when the app deserves guaranteed containers. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119830#comment-15119830 ] Karthik Kambatla commented on YARN-1011: BTW, sorry for the delay in circling back here. Was out of the country last week. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107009#comment-15107009 ] Nathan Roberts commented on YARN-1011: -- bq. Welcome any thoughts/suggestions on handling promotion if we allow applications to ask for only guaranteed containers. I ll continue brain-storming. We want to have a simple mechanism, if possible; complex protocols seem to find a way to hoard bugs. I agree that we want something simple and this probably doesn’t qualify, but below are some thoughts anyway. This seems like a difficult problem. Maybe a webex would make sense at some point to go over the design and work through some of these issues Maybe we need to run two schedulers, conceptually anyway. One of them is exactly what we have today, call it the “GUARANTEED” scheduler. The second one is responsible for the “OPPORTUNISTIC” space. What I like about this sort of approach is that we aren’t changing the way the GUARANTEED scheduler would do things. The GUARANTEED scheduler assigns containers in the same order as it always has, regardless of whether or not opportunistic containers are being allocated in the background. By having separate schedulers, we’re not perturbing the way user_limits, capacity limits, reservations, preemption, and other scheduler-specific fairness algorithms deal with opportunistic capacity (I’m concerned we’ll have lots of bugs in this area). The only difference is that the OPPORTUNISTIC side might already be running a container when the GUARANTEED scheduler gets around to the same piece of work (the promotion problem). What I don't like is that it's obviously not simple. - The OPPORTUNISTIC scheduler could behave very differently from the GUARANTEED scheduler (e.g. it could only consider applications in certain queues, it could heavily favor applications with quick running containers, it could randomly select applications to fairly use OPPORTUNISTIC space, it could ignore reservations, it could ignore user limits, it could work extra hard to get good container locality, etc.) - When the OPPORTUNISTIC scheduler launches a container, it modifies the ask to indicate this portion has been launched opportunistically, the size of the ask does not change (this means the application needs to be aware that it is launching an OPPORTUNISTIC container) - Like Bikas already mentioned, we have to promote opportunistic containers, even if it means shooting an opportunistic one and launching a guaranteed one somewhere else. - If the GUARANTEED scheduler decides to assign a container y to a portion of an ask that has already been opportunistically launched with container x, the AM is asked to migrate container x to container y. If x and y are on the same host, great, the AM asks the NM to convert x to y (mostly bookkeeping); if not the AM kills x and launches y. Probably need a new state to track the migration. - Maybe locality would make the killing of opportunistic containers a rare event? If both schedulers are working hard to get locality (e.g. YARN-80 gets us to about 80% node local), then it seems like the GUARANTEED scheduler is going to usually pick the same nodes as the OPPORTUNISTIC scheduler, resulting in very simple container conversions with no lost work. - I don’t see how we can get away from occasionally shooting an opportunistic container so that a guaranteed one can run somewhere else. Given that we want opportunistic space to be used for both SLA and non-SLA work, we can’t wait around for a low priority opportunistic container on a busy node. Ideally the OPPORTUNISTIC scheduler would be good at picking containers that almost never get shot. - When the GUARANTEED scheduler assigns a container to a node, the over-allocate thresholds could be violated, in this case OPPORTUNISTIC containers on the node need to be shot. It would be good if this didn’t happen if a simple conversion was going to occur anyway. Given the complexities of this problem, we're going to experiment with a simpler approach of over-allocating up-to 2-3X on memory with the NM shooting containers (preemptable containers first) when resources are dangerously low. The over-allocate will be dynamic based on current node usage (when node is idle, no over-allocate; basically there has to be some evidence that over-allocating will be successful before we actually over-allocate). This type of approach might not satisfy all use cases but it might turn out to be very simple and mostly effective. We'll report back on how this type of approach works out. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN >
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098511#comment-15098511 ] Karthik Kambatla commented on YARN-1011: [~nroberts], [~leftnoteasy] - reasonable concerns. I am looking into allowing the app ask only for guaranteed containers. Scheduling will likely remain simple: in our loop, we just skip an application if it is not interested in opportunistic containers. Promotion, though, becomes tricky: we should hold off on promoting a container until all higher-"priority" applications that want only guaranteed containers get them. Welcome any thoughts/suggestions on handling promotion if we allow applications to ask for only guaranteed containers. I ll continue brain-storming. We want to have a simple mechanism, if possible; complex protocols seem to find a way to hoard bugs. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101106#comment-15101106 ] Wangda Tan commented on YARN-1011: -- bq. Welcome any thoughts/suggestions on handling promotion if we allow applications to ask for only guaranteed containers. I ll continue brain-storming. We want to have a simple mechanism, if possible; complex protocols seem to find a way to hoard bugs. Agree :) > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096938#comment-15096938 ] Nathan Roberts commented on YARN-1011: -- Thanks for the update [~kasha]. I have a few questions but I'll start at a high level since the answers may clear up others. bq. On SchedulerNode#allocateContainer, mark the container OPPORTUNISTIC if allocating this container would take the allocation over the advertised capacity. And, keep track of opportunistic containers in a list separate from launchedContainers. It sounds like any container from any queue (assuming CapacityScheduler here but similar constructs will exist in other schedulers) could be marked OPPORTUNISTIC. Does this mean a queue with SLA sensitive jobs could get OPPORTUNISTIC containers? I just don't see how these jobs will be able to meet their SLAs with the uncertainty as to the resources they'll actually get, and the likelihood they'll get shot by the NM. This is precisely the reason there is the ability to designate queues as preemptable since SLA sensitive jobs just don't do well in preemtable queues. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097009#comment-15097009 ] Inigo Goiri commented on YARN-1011: --- [~kasha] can you add some examples to the doc for clarification? For example, if a node has 10 CPUs, yarn.nodemanager.overallocation.allocation-threshold is OT=0.5 and yarn.nodemanager.overallocation.preemption-threshold is PT=1.0. In the regular case we have 2 containers with an allocation of 5 CPUs each and each uses only 2.5 CPUs. With this proposal, we could add a third container with an allocation of 5 CPUs which also uses 2.5s CPUs. When would we preempt? How would this evolved when the utilization changes? How would it promote? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096987#comment-15096987 ] Karthik Kambatla commented on YARN-1011: bq. Does this mean a queue with SLA sensitive jobs could get OPPORTUNISTIC containers? You bring up a good point. Let us consider an example - an app requests for a container at time 0 that would run for 30 seconds. Today, let us say, the app would have gotten the container 10s from submission. With opportunistic scheduling, let us say we are able to schedule an OPPORTUNISTIC container at 5s. Now, if we have resource contention within the next 5 seconds, we would just preempt this OPPORTUNISTIC container and it would get scheduled at 10s as it would have otherwise. If the contention were to surface at 15s from submission, we would have to preempt and re-allocate losing 5s compared to what would have happened today. I was hoping the ability to tune allocation and preemption thresholds would greatly reduce the likelihood of preemptions, but we can't rely on that for SLA jobs. I am wary of allowing applications to ask for only GUARANTEED containers, as that would complicate the scheduler significantly. Since applications are notified of the ExecutionType, is it okay to leave it upto them to reject OPPORTUNISTIC containers. We could add this to AMRMClient and MapReduce. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097690#comment-15097690 ] Wangda Tan commented on YARN-1011: -- Thanks for updating docs, [~kasha]. Most part of the design doc makes sense to me, I have the same concern regarding SLA of opportunistic containers: I'm not very familiar with CPU scheduling in CGroup. Do you know what will happen for the following example: Let's say a node (strict resource usage disabled) has 60% of CPU utilization, which consumed by 10 processes, each of them has CPU share = 1024. Now there's a new process comes with CPU share = 1, it wants 100% of CPU, will the system gives 40% of the idle CPU to the new process or the new process can only get small proportion of the rest idle resources? If the process can full leverage idle resources regardless of CPU share, I'm fine with setting lowest CPU-share for opportunistic containers. bq. Won't the scheduler just try to assign it to that application again? Seems like they'll fight one another. "Here's a container".."No, I don't like this container".."No really, you should like this container, take it." I think this is a valid concern, [~kasha], do you think scheduler should smart enough to avoid allocating opportunistic containers to an app who doesn't want such opportunistic container? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097040#comment-15097040 ] Nathan Roberts commented on YARN-1011: -- Won't the scheduler just try to assign it to that application again? Seems like they'll fight one another. "Here's a container".."No, I don't like this container".."No really, you should like this container, take it." > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085677#comment-15085677 ] Nathan Roberts commented on YARN-1011: -- bq. This is one of the reasons I was proposing the notion of a max threshold which is less than 1 If the utilization goes to 100%, we clearly know there is contention. Since we measure resource utilization in resource-seconds (if not, we should update it), bursty spikes alone wouldn't take utilization over 100%. So, we shouldn't see a utilization greater than 100%. Just to make sure I understand. When you say max threshold < 1 are you saying an NM could not advertise 48 vcores if there are only 24 vcores physically available? I think we have to support going above 1.0. We already go above 1.0 on our clusters, even without this feature. What I'm thinking this feature will allow us to do is to go significantly above 1.0, especially for resources like memory where we have to be much more careful about not hitting 100%. One use case that I'm really hoping this feature can support is a batch cluster (loose SLAs) with very high utilization. For this use case, I'd like the following to be true: - nodes can be at 100% CPU, 100% Network, or 100% Disk for long periods of time (several minutes). Memory could get to something like 80% before corrective action would be required. During these periods, no containers get shot to shed load. Nodemanagers might reduce their available resource advertised to the RM, but nothing would need to be killed. - Both GUARANTEED and OPPORTUNISTIC containers get their fair share of resources. They're both drawing from the same capacity and user-limit from the RM's point of view so I feel like they should be given their fair set of resources on the nodes they execute on. The real point of being designated OPPORTUNISTIC in this use case is that the NM knows which containers to kill when it needs to shed load. Another use case is where you have a mixture of jobs, some with tight SLAs, some with looser SLAs. This one is mentioned in previous comments and is also very important. It requires a different set of thresholds and a different level of fairness controls. So, I just think things have to be configurable enough to handle both types of clusters. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085738#comment-15085738 ] Karthik Kambatla commented on YARN-1011: bq, Just to make sure I understand. When you say max threshold < 1 are you saying an NM could not advertise 48 vcores if there are only 24 vcores physically available? You can continue to advertise more vcores. Consider a cluster with nodes of 1 physical core. Let us say each node advertises 10 *vcores*. Today, let us say your CPU utilization under these settings is 50% running 10 containers. All these containers in this context would be GUARANTEED containers. I am proposing we set a max threshold for the RM over-allocating containers to 95%.This essentially means, the RM allocates OPPORTUNISTIC containers on this node (that has been previously fully allocated) until we hit the utilization threshold of 95% - say, running 19 containers. At this point if one container's usage goes higher taking us beyond 95%, we kill enough OPPORTUNISTIC containers to bring this under 95%. May be, the max allowed threshold could be higher - 99%. I am wary of setting it to 100% unless we have some other way of differentiating "running comfortably at 100%" vs "contention at 100%" because both look the same. Also, I am assuming people would be very happy with 95% utilization if we achieve that :) bq. nodes can be at 100% CPU, 100% Network, or 100% Disk for long periods of time (several minutes). Memory could get to something like 80% before corrective action would be required. I am beginning to see the need for different thresholds for different resources. While I wouldn't necessarily shoot for 100, I can see someone configuring it to 95% CPU, 85% network (as this could spike significantly with shuffle etc.), 90% disk, 80% memory. And, we would stop over-allocating the moment we hit *any one* of these thresholds. Should we keep it simple to begin with and have one config, and add other configs in the future? Or, do you think the config-per-resource should be there from the get go? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085840#comment-15085840 ] Karthik Kambatla commented on YARN-1011: bq. Lets say job capacity is 1 container and the job asks for 2. Its get 1 normal container and 1 opportunistic container. Now it releases its 1 normal container. At this point what happens to the opportunistic container. It is clearly running at lower priority on the node and as such we are not giving the job its guaranteed capacity. Momentarily, yes. The RM/NM ensemble (let us discuss that separately) realizes this and adjusts by promoting the opportunistic container. Is this different from what happens today? Today, the job is allocated one container since that is its capacity. Once that is done, it allocates another. Between the first one finishing and second one launching, we are not giving the job its guaranteed capacity. bq. The question is not about finding an optimal solution for this problem (and there may not be one). The issue here is to crisply define the semantics around scheduling in the design. Whatever the semantics are, we should clearly know what they are. IMO, the exact semantics of scheduling should be in the docs. Agree. I ll add something to the design doc once we capture everyone's concerns/suggestions here on JIRA, and may be we could iterate. bq. Because of that complexity, I'm not 100% convinced that disfavoring OPPORTUNISTIC containers (e.g. low value for cpu_shares) is something that buys us very much. I don't necessarily see it as disfavoring OPPORTUNISTIC containers. Without over-allocation these containers wouldn't even have started. While we are optimizing for utilization and throughput, we are just making sure we don't adversely affect containers that have been launched prior with promises of isolation. The low value of cpu_shares only kicks in when the node is highly contended, and is intended to be a fail-safe. As long as there are free resources (which I believe is the most common case), these OPPORTUNISTIC containers should get a sizeable CPU share. No? bq. So, hopefully we can make the policy quite configurable so that the amount of disfavoring can be tuned for various workloads. I agree that we might eventually need a configurable policy, but making the policy configurable might not be as straight-forward. I am definitely open to inputs on simple ways of doing this. Also, it is hard to comment on the effectiveness of a simple-but-not-so-configurable policy without implementing it and running sample workloads against it. The simple policy I had in mind was: # Update the SchedulerNode#getAvailable to include resources that could be opportunistically allocated. i.e., max(what_it_says_today, threshold * resource). It should be easy to support per-resource thresholds here. # At allocate time, label an allocation OPPORTUNISTIC if it takes the cumulative allocation over the advertised capacity. # When space frees up on nodes, NMs send candidate containers for promotion on the heartbeat. The RM consults a policy to come up with a list of yes/no decisions for each of these candidates. Initially, I would like for the default to be yes without any reconsiderations. This favors continuing the execution of a container over preempting them. Based on what we see, we could tweak this simple policy or come up with more sophisticated policies. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085868#comment-15085868 ] Karthik Kambatla commented on YARN-1011: We might be better of calling this overallocation instead of oversubscription as the latter could be mistaken for oversubscription through the yarn.nodemanager.resource.* configs. I ll go ahead and use overallocation in patches like for YARN-4512, unless someone expresses reservations here or on YARN-4512. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085872#comment-15085872 ] Karthik Kambatla commented on YARN-1011: BTW, if we agree on the simple policy, I believe we should be able to pull off a scheduler-agnostic implementation. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086326#comment-15086326 ] Bikas Saha commented on YARN-1011: -- Some of what I am saying emanates from prior experience with a different Hadoop like system. You can read more about it here. http://research.microsoft.com/pubs/232978/osdi14-paper-boutin.pdf > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086325#comment-15086325 ] Bikas Saha commented on YARN-1011: -- bq. At this point what happens to the opportunistic container. It is clearly running at lower priority on the node and as such we are not giving the job its guaranteed capacity. bq. At this point what happens to the opportunistic container. It is clearly running at lower priority on the node and as such we are not giving the job its guaranteed capacity. Yes. the difference is that the opportunistic container may not be convertible into a normal container because that node is still over-allocated. So at this point, what should be done? Should this container be terminated and run somewhere else as normal (because capacity is now available)? Should some other container be preempted on this node to make this container normal? Should the RM allocate a normal container and give it to the app in addition to the running opportunistic container in case the app can do the transfer internally? Also, with this feature in place, should we run all containers beyond guaranteed capacity as opportunistic containers? This would ensure that any excess containers that we give to a job will not affect performance of the guaranteed containers of other jobs. This would also make the scheduling and allocation more consistent in that the guaranteed containers always run at normal priority and extra containers run at lower priority. The extra container could be extra over capacity (but without over-subscription) or extra over-subscription. Because of this I feel that running tasks at lower priority could be an independent (but related) work item. Staying on this topic and addition configuration to it. It may make sense to add some way by which an application can say that dont oversubscribe nodes when my containers run on it. Putting cgroups or docker in this context, would these mechanism support over-allocating resources like cpu or memory? bq. When space frees up on nodes, NMs send candidate containers for promotion on the heartbeat. That shouldn't be necessary since the RM will get to know about free capacity and run its scheduling cycle for that node - at which point it will be able to take action like allocation a new container or upgrading an existing one. There isnt anything the NM can tell the RM (which the RM already does not know) except for the current utilization of the node. Some of what I am saying emanates from prior experience with a different Hadoop like system. You can read more about it here. http://research.microsoft.com/pubs/232978/osdi14-paper-boutin.pdf > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083223#comment-15083223 ] Karthik Kambatla commented on YARN-1011: We would run an opportunistic container on a node only if the actual utilization is less than the allocation by a margin bigger than the allocation of said opportunistic container. We reactively preempt the opportunistic container if the actual utilization goes over a threshold. To address spikes in usage where our reactive measures are too slow to kick in, we run the opportunistic containers at a strictly lower priority. bq. the app got opportunistic containers and their perf wasnt the same as normal containers - so it ran slower. As soon as we realize the perf is slower because the node has higher usage than we had anticipated, we preempt the container and retry allocation (guaranteed or opportunistic depending on the new cluster state). So, it shouldn't run slower for longer than our monitoring interval. Is this assumption okay? bq. However, things get complicated because a node with an opportunistic container may continue to run its normal containers while space frees up for guaranteed capacity on other nodes. The opportunistic container will continue to run on this node so long as it is getting the resources it needs. If there is any sort of resource contention, it is preempted and is up for allocation on one of the free nodes. bq. This would require that the system upgrade opportunistic containers in the same order as it would allocate containers. bq. IMO, the NM cannot make a local choice about upgrading its opportunistic containers because this is effectively a resource allocation decision and only the RM has the info to do that. The RM schedules the next highest priority "task" for which it couldn't find a guaranteed container as an opportunistic container. This task continues to run as long as it is not getting enough resources. If there is no resource contention, the task continues to run. If guaranteed resources free up on the node it is running, isn't it fair to promote the container to Guaranteed. After all, if the resources unused were not hidden behind other containers' allocation and actually available as guaranteed capacity on that node initially, the RM would just have scheduled a guaranteed container in the first place. I should probably clarify that the proposal here targets those cases where users' estimates are significantly off reality and there are enough free resources per node to run additional task(s) without causing any resource contention. Even though this is the norm, we want to guard against spikes in usage to avoid perf regressions. In practice, I expect admins to come up with a reasonable threshold for over-subscription: e.g. 0.8 - we use only oversubscribe upto 80% of capacity advertised through {{yarn.nodemanger.resource.*}} > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083226#comment-15083226 ] Karthik Kambatla commented on YARN-1011: Since the plan is develop this on a branch, we could get started on some of the sub-tasks and adjust things as the design evolves. I am creating YARN-1011 branch to track this work. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083683#comment-15083683 ] Bikas Saha commented on YARN-1011: -- Good points but let me play the devils advocate to get some more clarity :) bq. As soon as we realize the perf is slower because the node has higher usage than we had anticipated, we preempt the container and retry allocation (guaranteed or opportunistic depending on the new cluster state). So, it shouldn't run slower for longer than our monitoring interval. Is this assumption okay? How do we determine that the perf is slower? The CPU would never exceed 100% even under over-allocation. Is preempting always necessary? If we are sure that the OS is going to starve the opportunistic containers, then can assume that when the node is fully utilized, then only our guaranteed containers are using resources? So we can let the opportunistic containers be so that they can start soaking up excess capacity after the normal containers have stopped spiking. Perhaps some experiments will shed some light on this. bq. The opportunistic container will continue to run on this node so long as it is getting the resources it needs. If there is any sort of resource contention, it is preempted and is up for allocation on one of the free nodes. Lets say job capacity is 1 container and the job asks for 2. Its get 1 normal container and 1 opportunistic container. Now it releases its 1 normal container. At this point what happens to the opportunistic container. It is clearly running at lower priority on the node and as such we are not giving the job its guaranteed capacity. The question is not about finding an optimal solution for this problem (and there may not be one). The issue here is to crisply define the semantics around scheduling in the design. Whatever the semantics are, we should clearly know what they are. IMO, the exact semantics of scheduling should be in the docs. bq. The RM schedules the next highest priority "task" for which it couldn't find a guaranteed container as an opportunistic container. This task continues to run as long as it is not getting enough resources. If there is no resource contention, the task continues to run. If guaranteed resources free up on the node it is running, isn't it fair to promote the container to Guaranteed. Sure. And thats why the system should upgrade opportunistic containers in the order in which they were allocated. However, the decision must be made at the RM and not the NM since the NMs dont know about total capacity and multiple NMs locally upgrading their opportunistic containers might end up over-allocating for a job. Further, the queue sharing state may have changed since the opportunistic allocation, and hence assuming that the opportunistic container "would have" gotten that allocation anyways, at a later point in time, may not be valid. In summary, what we need in the document is a clear definition of the scheduling policy around this - whatever that policy may be. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083957#comment-15083957 ] Bikas Saha commented on YARN-1011: -- I agree with natural container churn in favor of preemption to avoid lost work though the issue of clearly defining scheduler policy still remains. bq. If we were oversubscribing 10X then I'd probably want it for sure, but if it's at most 2X capacity then worst case is a container only gets 50% of the resource it had requested. Obviously for something like memory this has to be closely controlled because going over the physical capabilities of the machine has very significant consequences. But for CPU, I'd definitely be inclined to live with the occasional 50% worst case for all containers, in order to avoid the 1/1024th worst case for OPPORTUNISTIC containers on a busy node. I did not understand this. Does this mean, its ok for normal containers to run 50% slower in the presence of opportunistic containers? If yes, then there are scenarios where this may not be a valid choice. E.g. when a cluster is running a mix of SLA and non-SLA jobs. Non-SLA jobs are ok if there containers got slowed down to increase cluster utilization by running opportunistic containers because we are getting higher overall throughput. But SLA jobs are not ok with missing deadlines because there tasks ran 50% slower. IMO, the litmus test for a feature like this would be to take an existing cluster (with low utilization because tasks are asking for more resources than what they need 100% of the time). Then turn this feature on and get better cluster utilization and throughput without affecting the existing workload. Whatever be the internal implementation details. Agree? bq. 50% of maximum-under-utilized resource of past 30 min for each NM can be used to allocate opportunistic containers. These are heuristics and may all be valid under different circumstances. What we should step back and see is what is the source of this optimization. Observation : Cluster is under-utilized despite being fully allocated Possible reasons : 1) Tasks are incorrectly over-allocated. Will never use the resources they ask for and hence we can safely run additional opportunistic containers. So this feature is used to compensate for poorly configured applications. Probably a valid scenario but is it common? 2) Tasks are correctly allocated but dont use their capacity to the limit all the time. E.g. Terasort will use high cpu only during the sorting but not during the entire length of the job. But its containers will ask for enough CPU to run the sort in the desired time. This is a typical application behavior where resource usage varies over time. So this feature is used to soak up the fallow resources in the cluster while tasks are not using their quoted capacity. The arguments and assumptions we make need to be considered in the light of which of 1 or 2 is the common case and where this feature will be useful. While its useful to have configuration knobs, for a complex dynamic feature like this that is basically reacting to runtime observations, it may be quite hard to be able to configure this statically using manual configuration. While some limits about max over-allocation limit etc. are easy and probably required to configure, we should look at making this feature work by itself instead of relying exclusively on configuration (hell :P) for users to make this feature usable. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083974#comment-15083974 ] Jason Lowe commented on YARN-1011: -- bq. Tasks are incorrectly over-allocated. Will never use the resources they ask for and hence we can safely run additional opportunistic containers. So this feature is used to compensate for poorly configured applications. Probably a valid scenario but is it common? In my experience this is fairly common. Users tend to twiddle with config values until something is working then they don't bother to revisit until there's a problem. And it's easier to over allocate than to spend the time to carefully tune the task size. Even if the user is interested in tuning they can't always tune optimally. Some examples are data skew or other task-specific issues where a few tasks need a lot of memory but the vast majority of the others do not. Many frameworks only allow the task sizes to be configured as a group, so the user has to run all the tasks in the group with the worst-case container size even though most of them don't need it. Pig on MapReduce is another example, where it will spawn multiple jobs but the user can only configure the memory settings once in the script and they apply to all jobs launched by the script. Therefore the user has to set it to the worst-case size across all the script's jobs, and all but one of the jobs runs with oversized map containers. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083880#comment-15083880 ] Nathan Roberts commented on YARN-1011: -- Very excited about this feature and agree that we should make this as simple as possible in the first go around. I have a couple of initial questions. bq. As soon as we realize the perf is slower because the node has higher usage than we had anticipated, we preempt the container and retry allocation (guaranteed or opportunistic depending on the new cluster state). So, it shouldn't run slower for longer than our monitoring interval. Is this assumption okay? This seems hard. ([~bikassaha] comment above). All of this basically boils down to the fact that preempting a container means lost work, so the decision to preempt something shouldn't be taken lightly. For resources like memory we have to react quickly, and that's fine. But for things like CPU, I'm personally ok with latency on the order of single digit minutes so that natural container churn almost always avoids preemption. Because of that complexity, I'm not 100% convinced that disfavoring OPPORTUNISTIC containers (e.g. low value for cpu_shares) is something that buys us very much. If we were oversubscribing 10X then I'd probably want it for sure, but if it's at most 2X capacity then worst case is a container only gets 50% of the resource it had requested. Obviously for something like memory this has to be closely controlled because going over the physical capabilities of the machine has very significant consequences. But for CPU, I'd definitely be inclined to live with the occasional 50% worst case for all containers, in order to avoid the 1/1024th worst case for OPPORTUNISTIC containers on a busy node. So, hopefully we can make the policy quite configurable so that the amount of disfavoring can be tuned for various workloads. bq. In practice, I expect admins to come up with a reasonable threshold for over-subscription: e.g. 0.8 - we use only oversubscribe upto 80% of capacity advertised through yarn.nodemanger.resource.*. Thinking more about this, this threshold should have an upper limit - 0.95? Can we make this per-resource? (80% memory, 120% CPU)? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083905#comment-15083905 ] Wangda Tan commented on YARN-1011: -- bq. So, hopefully we can make the policy quite configurable so that the amount of disfavoring can be tuned for various workloads. I also +1 to make this policy configurable. This is way I filed YARN-4511. Instead of threshold of "current" instant usage, I think we can use the threshold of "record" usage. For example, only 50% of *maximum-under-utilized* resource of past 30 min for each NM can be used to allocate opportunistic containers. This could reduce number of opportunistic containers we can allocate, but also this could reduce risks that opportunistic containers affect other containers. bq. Can we make this per-resource? (80% memory, 120% CPU)? +1 > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084543#comment-15084543 ] Karthik Kambatla commented on YARN-1011: Dropping a quick note here. (Traveling - will answer other comments tomorrow.) bq. How do we determine that the perf is slower? The CPU would never exceed 100% even under over-allocation. This is one of the reasons I was proposing the notion of a max threshold which is less than 1 :) If the utilization goes to 100%, we clearly know there is contention. Since we measure resource utilization in resource-seconds (if not, we should update it), bursty spikes alone wouldn't take utilization over 100%. So, we shouldn't see a utilization greater than 100%. bq. Can we make this per-resource? (80% memory, 120% CPU)? I am open to per-resource configuration. That said, I am not too keen especially my above comment on utilization never going over 100% holds. bq. Tasks are incorrectly over-allocated. Will never use the resources they ask for and hence we can safely run additional opportunistic containers. So this feature is used to compensate for poorly configured applications. Probably a valid scenario but is it common? It is quite common for folks to borrow a Hive or Pig script from a colleague. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081950#comment-15081950 ] Bikas Saha commented on YARN-1011: -- In Tez we always try to allocated the most important work to the next allocated container. So doing opportunistic containers without providing the AM with the ability to know about it and use it judiciously may not be something that can be delayed to a second phase. Being able to choose only guaranteed or non-guaranteed containers only covers half the problem (and probably the less relevant one) in which an application should always finish in 1min using guaranteed capacity but may sometimes finish in 30s because it got opportunistic containers. The other side is probably more important where a regression is caused due to opportunistic containers. 1) the app got opportunistic containers and their perf wasnt the same as normal containers - so it ran slower. This may be mitigated by the system guaranteeing that only excess container beyond guaranteed capacity would be opportunistic. This would require that the system upgrade opportunistic containers in the same order as it would allocate containers. However, things get complicated because a node with an opportunistic container may continue to run its normal containers while space frees up for guaranteed capacity on other nodes. At this point, which container becomes guaranteed - the new one on a free node or the opportunistic one that is already doing work? Which one should be preempted? 2) the app suffered because its guaranteed containers got slowed down due to competition from opportunistic containers. This needs strong support for lower priority resource consumption for opportunistic containers. IMO, the NM cannot make a local choice about upgrading its opportunistic containers because this is effectively a resource allocation decision and only the RM has the info to do that. The NM does not know if this would exceed guaranteed capacity and in total, a bunch of NMs making this choice locally can lead to excessive over-allocation of guaranteed resources. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082305#comment-15082305 ] Subru Krishnan commented on YARN-1011: -- [~kasha], I had an offline discussion with [~curino] and [~chris.douglas] regarding auto promotion by NM. To be aligned with YARN-2877, we feel it will be good if NM can express it's preference to the RM and let the RM make the decision as only it can ensure the global invariants based on the current state of the cluster. The preference can be based on whether the opportunistic container has been started or not, it's resources have been localized or not, how long it has been running, how much progress it has made, etc. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15076750#comment-15076750 ] Karthik Kambatla commented on YARN-1011: Forgot to respond to one comment: bq. when terminating opportunistic containers will the RM ask the AM about which containers to kill? Don't think we should. NM --> RM --> AM --> NM is a long communication thread. Our preemption should kick in much faster that that. What do you think of preempting the last opportunistic container that was started, since it is likely that far away from promotion. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15076749#comment-15076749 ] Karthik Kambatla commented on YARN-1011: Thanks for chiming in, [~bikassaha]. bq. It is essential to run opportunistic tasks at lower OS cpu priority so that they never obstruct progress of normal tasks. bq. In fact, this is the litmus test for opportunistic scheduling. Good point. Guaranteed containers should get priority for resources: Opportunistic containers should only use left-over resources. We should do this for CPU, disk and network. I am not aware of the latest on disk and network isolation, but we should create sub-tasks for those too. /cc [~vvasudev] bq. Handling opportunistic tasks raises questions on the involvement of the AMs. bq. In that sense it would be instructive to consider opportunistic scheduling in a similar light as preemption. I wasn't sure the AM needs to know a container's execution type: As you mention, this is very similar to preemption. From an AM's standpoint, the container would be preempted if those resources are not available to that application any more. In case of preemption, this can happen if other high priority queues have outstanding demand or the cluster lost a couple of nodes. Here, it is possible Guaranteed containers actually need the resources. In that sense, the AM doesn't have to do anything different for Guaranteed vs Opportunistic containers. Predictability: Allowing applications to specify only Guaranteed containers vs Guaranteed or Opportunistic containers should take care of this. However, between getting no resources and getting opportunistic resources, are there cases where the applications prefer the latter? The applications "should" get guaranteed containers at the same point in time irrespective of whether they use opportunistic resources in the interim. Note that allowing applications to specify whether they are okay with getting opportunistic containers complicates the scheduling - the scheduler needs to look through the higher priority apps that don't allow opportunistic containers before getting to those that need. And, when resources are available on that node, the RM will need to schedule containers for higher priority apps prolonging the duration for which opportunistic containers stay opportunistic. Given this complication, I would prefer we do not involve AMs in the decision-making process. Based on the need and usecases, we could revisit this at a later time. Note that YARN-4335 adds this to ResourceRequest for distributed scheduling, and even there they are not entirely sure if it needs to be a part. bq. does the AM need to know that a newly allocated container was opportunistic. E.g. so that it does not schedule the highest priority work on that container. Valid concern. May be, we should intimate the AM of whether a container is opportunistic, and later when it gets promoted to guaranteed. That said, I am not sure if this is essential to oversubscription being useful. Thoughts on punting it to Phase-2? bq. will opportunistic containers be given only when for containers that are beyond queue capacity such that we dont break any guarantees on their liveliness. ie. an AM will not expect to lose any container that is within its queue capacity but opportunistic containers can be killed at any time. Yes. This probably needs to be clear in the doc. Will update it. bq. will conversion of opportunistic containers to regular containers be automatically done by the RM? By some combination of RM/NM, definitely yes. Initially, I thought the RM can be the only one doing this. The RM could keep track of opportunistic containers in SchedulerNode. Today, we already track launchedContainers. The scheduler could go through this list and promote containers before allocating new containers. Does this add an unnecessary delay in the promotion though? If the scheduler allocated opportunistic containers based on the same prioritization it uses for guaranteed containers, can the NM just promote the oldest opportunistic container running on that node and update the RM accordingly? Another thing to consider here: the promotion process here should work with that in YARN-2877. [~subru], [~kkaranasos], [~asuresh] - is it okay for the NM to automatically promote some opportunistic containers. May be, we could add a flag to the launch context to differentiate between those opportunistic containers that can be automatically promoted vs those that can not be. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072107#comment-15072107 ] Inigo Goiri commented on YARN-1011: --- The doc looks good. I have a couple questions: # What would be the first policy to implement? I guess we can define it in YARN-1015. # Would it make sense to make over-subscription a global property set by the RM instead of per-node? I think we need a sub-task under this umbrella for the over-subscription property. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072240#comment-15072240 ] Karthik Kambatla commented on YARN-1011: bq. For resource oversubscription enable/disable for individual nodes, I think it's very important since some nodes could be more important than others. Do you think is it fine to add a configuration item to each NM's yarn-site.xml? That is exactly the intent. Let us continue this conversation on YARN-4512. bq. For scheduler-side implementation, instead of modifying individual scheduler, I think we should try to add over-subscription policy to common scheduling layer since it doesn't sounds very related to specific scheduler implementation. Makes sense. Doubt there is any scheduler-specific smarts here. If at all we need to do them separately, it is most likely because our scheduler abstractions are not clean. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072227#comment-15072227 ] Wangda Tan commented on YARN-1011: -- Thanks [~kasha] and also comments from [~elgoiri]. Looked at doc, it looks good. Some questions/comments: - For resource oversubscription enable/disable for individual nodes, I think it's very important since some nodes could be more important than others. Do you think is it fine to add a configuration item to each NM's yarn-site.xml? - For scheduler-side implementation, instead of modifying individual scheduler, I think we should try to add over-subscription policy to common scheduling layer since it doesn't sounds very related to specific scheduler implementation. I also agree for the first implementation, we can simply assume nodes have more resource to use. CS shouldn't have issue with this assumption. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072223#comment-15072223 ] Karthik Kambatla commented on YARN-1011: bq. Would it make sense to make over-subscription a global property set by the RM instead of per-node? Good question. I thought about it quite some. Here is my reasoning for doing on the NM side. We can always switch back to defining it to the RM if that makes more sense. # Even if we have the knob on the RM, the node still has to support it: monitor the resource usage on the node and kill the OPPORTUNISTIC containers if need be. On a cluster with NMs of different versions (say, during a rolling upgrade), the RM will have to keep track of NMs that support over-subscription. So, we do need some config for the NM anyway. Further, there could be node-specific conditions - hardware, other services running on the node etc. - that could affect the over-subscription capacity of the node. For instance, it might be okay to sign up for 90% of the advertised capacity on node A, but only 80% on the node B. And, this ability to soak up extra work could change over time. # In terms of implementation, the node already sends its capacity and its aggregate-container-utilization. It might as well send an oversubscription-percentage over, which is interpreted as the fraction of its advertised capacity. e.g. A node with 64 GB memory could advertise its capacity as 50 GB and oversubscription-percentage 0.9. The RM could schedule upto 45 GB of utilization. An oversubscription-percentage <= 0 would indicate the feature is turned off. bq. What would be the first policy to implement? I guess we can define it in YARN-1015. The simplest policy would likely be just assuming there are more resources on the node, and continue allocating with the same policies we use today for free/unallocated resources. This should work okay for the FairScheduler. I am less familiar with the intricate details of CS, but would think it should apply there as well. [~leftnoteasy] - thoughts? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072228#comment-15072228 ] Wangda Tan commented on YARN-1011: -- I just created YARN-4511 to track common scheduling policy for resource over-subscription. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072279#comment-15072279 ] Bikas Saha commented on YARN-1011: -- In my prior experience, something like this is not practical without pro-active cpu management (which has been delegated to future work in the document). It is essential to run opportunistic tasks at lower OS cpu priority so that they never obstruct progress of normal tasks. Typically we will find that the machine is under-allocated the most in cpu usage since most processing has bursty cpu. When a normal task has a cpu burst then it should have to contend with an opportunistic task since this will be detrimental to the expected performance of that task. Without this, jobs will not run predictably in the cluster. From what I have seen, users prefer predictability over most other things. ie. having a 1 min job run in 1 min all the time vs making that job run in 30s 85% of the time and but in 2 mins for 5% of the time because that makes it really hard to establish SLAs. In fact, this is the litmus test for opportunistic scheduling. It should be able to raise the utilization of a cluster from (say 50%) to (say 75%) without affecting the latency of the jobs compared to when the cluster was running at 50%. For memory, in fact, its ok to share and reach 100% capacity but its important to check that the machine does not start thrashing. Most well written tasks will run within their memory limits and start spilling etc. Opportunistic tasks are trying to occupy the memory that these tasks thought they could use but are not using (or that these tasks are keeping in buffer on purpose). The crucial thing to consider here is to look for stats that signify the onset of memory paging activity (or overall memory over-subscription at the OS level). At that point, even normal tasks that are within their limit will be adversely affected because the OS will start paging memory to disk. So we need to start proactively killing opportunistic tasks before the such paging activity gets triggered. Handling opportunistic tasks raises questions on the involvement of the AMs. Unless I missed something, this is not called out clearly in the doc. In that sense it would be instructive to consider opportunistic scheduling in a similar light as preemption. App got container that it should not have gotten at that time if we had been strict but got it because we decided to loosen the strings (of queue capacity or machine capacity resp). - will opportunistic containers be given only when for containers that are beyond queue capacity such that we dont break any guarantees on their liveliness. ie. an AM will not expect to lose any container that is within its queue capacity but opportunistic containers can be killed at any time. - does the AM need to know that a newly allocated container was opportunistic. E.g. so that it does not schedule the highest priority work on that container. - will conversion of opportunistic containers to regular containers be automatically done by the RM? Will the RM notify the AM about such conversions? - when terminating opportunistic containers will the RM ask the AM about which containers to kill? Given the above perf related scenarios this may not be a viable option. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072302#comment-15072302 ] Wangda Tan commented on YARN-1011: -- Thanks, bq. Makes sense. Doubt there is any scheduler-specific smarts here. If at all we need to do them separately, it is most likely because our scheduler abstractions are not clean. Agree! > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056483#comment-15056483 ] Wangda Tan commented on YARN-1011: -- Thanks [~kasha], count me in :)! I could help with reviewing/implementation. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)