[jira] [Commented] (YARN-9656) Plugin to avoid scheduling jobs on node which are not in "schedulable" state, but are healthy otherwise.
[ https://issues.apache.org/jira/browse/YARN-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951393#comment-16951393 ] Mayank Bansal commented on YARN-9656: - [~wangda] We should not make full cluster unhealthy otherwise its very hard to distinguish the case between unhealthy and stressed. We would not want everybody to be removed from scheduling cycle otherwise its a cluster wide outage. We would want to see how many nodes can be stressed in one cycle and just avoid those small number of nodes > Plugin to avoid scheduling jobs on node which are not in "schedulable" state, > but are healthy otherwise. > > > Key: YARN-9656 > URL: https://issues.apache.org/jira/browse/YARN-9656 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Affects Versions: 2.9.1, 3.1.2 >Reporter: Prashant Golash >Assignee: Prashant Golash >Priority: Major > Attachments: 2.patch > > > Creating this Jira to get idea from the community if this is something > helpful which can be done in YARN. Some times the nodes go in a bad state for > e.g. (H/W problem: I/O is bad; Fan problem). In some other scenarios, if > CGroup is not enabled, nodes may be running very high on CPU and the jobs > scheduled on them will suffer. > > The idea is three-fold: > # Gather relevant metrics from node-managers and put in some form (for e.g. > exclude file). > # RM loads the files and put the nodes as part of the blacklist. > # Once the node becomes good, they can again be put in the whitelist. > Various optimizations can be done here, but I would like to understand if > this is something which could be helpful as an upstream feature in YARN. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4161) Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration
[ https://issues.apache.org/jira/browse/YARN-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-4161: Attachment: YARN-4161.patch Attaching patch > Capacity Scheduler : Assign single or multiple containers per heart beat > driven by configuration > > > Key: YARN-4161 > URL: https://issues.apache.org/jira/browse/YARN-4161 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-4161.patch > > > Capacity Scheduler right now schedules multiple containers per heart beat if > there are more resources available in the node. > This approach works fine however in some cases its not distribute the load > across the cluster hence throughput of the cluster suffers. I am adding > feature to drive that using configuration by that we can control the number > of containers assigned per heart beat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4161) Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration
[ https://issues.apache.org/jira/browse/YARN-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-4161: Description: Capacity Scheduler right now schedules multiple containers per heart beat if there are more resources available in the node. This approach works fine however in some cases its not distribute the load across the cluster hence throughput of the cluster suffers. I am adding feature to drive that using configuration by that we can control the number of containers assigned per heart beat. was:Capacity > Capacity Scheduler : Assign single or multiple containers per heart beat > driven by configuration > > > Key: YARN-4161 > URL: https://issues.apache.org/jira/browse/YARN-4161 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Mayank Bansal >Assignee: Mayank Bansal > > Capacity Scheduler right now schedules multiple containers per heart beat if > there are more resources available in the node. > This approach works fine however in some cases its not distribute the load > across the cluster hence throughput of the cluster suffers. I am adding > feature to drive that using configuration by that we can control the number > of containers assigned per heart beat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4161) Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration
[ https://issues.apache.org/jira/browse/YARN-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-4161: Description: Capacity > Capacity Scheduler : Assign single or multiple containers per heart beat > driven by configuration > > > Key: YARN-4161 > URL: https://issues.apache.org/jira/browse/YARN-4161 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Mayank Bansal >Assignee: Mayank Bansal > > Capacity -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4161) Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration
[ https://issues.apache.org/jira/browse/YARN-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-4161: Fix Version/s: (was: 2.1.0-beta) > Capacity Scheduler : Assign single or multiple containers per heart beat > driven by configuration > > > Key: YARN-4161 > URL: https://issues.apache.org/jira/browse/YARN-4161 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Mayank Bansal >Assignee: Mayank Bansal > > Capacity -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4161) Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration
[ https://issues.apache.org/jira/browse/YARN-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-4161: Description: (was: This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities.) > Capacity Scheduler : Assign single or multiple containers per heart beat > driven by configuration > > > Key: YARN-4161 > URL: https://issues.apache.org/jira/browse/YARN-4161 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Fix For: 2.1.0-beta > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4161) Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration
Mayank Bansal created YARN-4161: --- Summary: Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration Key: YARN-4161 URL: https://issues.apache.org/jira/browse/YARN-4161 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Mayank Bansal Assignee: Mayank Bansal This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4161) Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration
[ https://issues.apache.org/jira/browse/YARN-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-4161: Labels: (was: BB2015-05-TBR) > Capacity Scheduler : Assign single or multiple containers per heart beat > driven by configuration > > > Key: YARN-4161 > URL: https://issues.apache.org/jira/browse/YARN-4161 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Fix For: 2.1.0-beta > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-2.6-11.patch Uploading the 2.6 patch Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Labels: BB2015-05-TBR Attachments: YARN-2069-2.6-11.patch, YARN-2069-trunk-1.patch, YARN-2069-trunk-10.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, YARN-2069-trunk-8.patch, YARN-2069-trunk-9.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279056#comment-14279056 ] Mayank Bansal commented on YARN-2933: - Thanks [~jianhe] and [~wangda] for the review bq. looks good overall, we should use priority.AMCONTAINER here ? It was Confusing by name , I changed the names and updated accordingly. bq. it's better to use enum type instead of int in mockContainer, which can avoid call getValue() from enum. Priority is been override in multiple tests differently so didn't want to change the signature of the functions, Moreover its same. Uploading the updated patch Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch, YARN-2933-7.patch, YARN-2933-8.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-9.patch Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch, YARN-2933-7.patch, YARN-2933-8.patch, YARN-2933-9.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-8.patch Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch, YARN-2933-7.patch, YARN-2933-8.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277328#comment-14277328 ] Mayank Bansal commented on YARN-2933: - Thanks [~wangda] for review. bq. 1) ProportionalCapacityPreemptionPolicy.setNodeLabels is too simple to be a method, it's better to remove it. Getter and setters are usually simple but its good practice to have. I think we should keep it. bq. 2) It's better to use enum here instead of integer Done. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch, YARN-2933-7.patch, YARN-2933-8.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-7.patch Thanks [~wangda][~jianhe]and [~sunilg] for reviews Updated the patch Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch, YARN-2933-7.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275989#comment-14275989 ] Mayank Bansal commented on YARN-2933: - This test failure is not due to this patch. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch, YARN-2933-7.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269784#comment-14269784 ] Mayank Bansal commented on YARN-2933: - Thanks [~wangda] and Sunil for review. bq. In addition to previously comment, I think we put incorrect #container for each application when setLabelContainer=true. The usedResource or current in TestProportionalPreemptionPolicy actually means used resource of nodes without label. So if we want to have labeled container in an application, we should make it stay outside of usedResource. I don't think thats needed as the basic functionality for the test is to demonstrate we can skip labeled container, So I think it does not mater. bq. And testSkipLabeledContainer is fully covered by testIdealAllocationForLabels. Since we have already checked #container preempted in each application in testIdealAllocationForLabels, which implies labeled containers are ignored. Agreed bq. A minor suggest is rename setLabelContainer to setLabeledContainer Agreed bq. An application's(if not specified any labels during submission time) containers, may fall in to nodes where it can be labelled or not labelled. Am I correct? No , As of now containers with no labels can not go to labeled nodes. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-6.patch Attaching patch Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267962#comment-14267962 ] Mayank Bansal commented on YARN-2933: - Thanks [~wangda] for review. 1. Fixed, I should have used it. 2. I think getter and setter should be there. 3. Done 4. Done 5. Test is fixed 6. FInd bug is not due to this patch. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-5.patch Updating patch. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-4.patch Thanks [~wangda] for review. I updated the patch based on the comments. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-3.patch Fixing javadoc warnings Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14257717#comment-14257717 ] Mayank Bansal commented on YARN-2933: - These findbugs is not due to this patch, Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-2.patch Thanks [~wangda] for the review. Make sense. Updating the patch. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252629#comment-14252629 ] Mayank Bansal commented on YARN-2933: - These find bug and test failure is not due to this patch. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-2933: --- Assignee: Mayank Bansal (was: Wangda Tan) Taking it over Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-1.patch Attaching patch for avoiding preemption for labeled containers. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179377#comment-14179377 ] Mayank Bansal commented on YARN-2647: - HI [~sunilg] , Are u still working on this ? Can i take it over if u r not looking at it? Thanks, Mayank Add yarn queue CLI to get queue info including labels of such queue --- Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2698) Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-2698: --- Assignee: Mayank Bansal (was: Wangda Tan) Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI --- Key: YARN-2698 URL: https://issues.apache.org/jira/browse/YARN-2698 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Mayank Bansal YARN RMAdminCLI and AdminService should have write API only, for other read APIs, they should be located at YARNCLI and RMClientService. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2698) Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179379#comment-14179379 ] Mayank Bansal commented on YARN-2698: - taking it over Thanks, Mayank Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI --- Key: YARN-2698 URL: https://issues.apache.org/jira/browse/YARN-2698 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Mayank Bansal YARN RMAdminCLI and AdminService should have write API only, for other read APIs, they should be located at YARNCLI and RMClientService. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2598) GHS should show N/A instead of null for the inaccessible information
[ https://issues.apache.org/jira/browse/YARN-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165715#comment-14165715 ] Mayank Bansal commented on YARN-2598: - Committed to branch 2, 2,6 and trunk Thanks [~zjshen] Thanks, Mayank GHS should show N/A instead of null for the inaccessible information Key: YARN-2598 URL: https://issues.apache.org/jira/browse/YARN-2598 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2598.1.patch, YARN-2598.2.patch When the user doesn't have the access to an application, the app attempt information is not visible to the user. ClientRMService will output N/A, but GHS is showing null, which is not user-friendly. {code} 14/09/24 22:07:20 INFO impl.TimelineClientImpl: Timeline service address: http://nn.example.com:8188/ws/v1/timeline/ 14/09/24 22:07:20 INFO client.RMProxy: Connecting to ResourceManager at nn.example.com/240.0.0.11:8050 14/09/24 22:07:21 INFO client.AHSProxy: Connecting to Application History server at nn.example.com/240.0.0.11:10200 Application Report : Application-Id : application_1411586934799_0001 Application-Name : Sleep job Application-Type : MAPREDUCE User : hrt_qa Queue : default Start-Time : 1411586956012 Finish-Time : 1411586989169 Progress : 100% State : FINISHED Final-State : SUCCEEDED Tracking-URL : null RPC Port : -1 AM Host : null Aggregate Resource Allocation : N/A Diagnostics : null {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165795#comment-14165795 ] Mayank Bansal commented on YARN-2320: - Thanks [~zjshen] for the patch overall looks ok 1) couple of points I think Attempt and container too should have N/A instead of null. If you wanted to do it in seprate jira thats fine too. 2) latest patch needs rebasing 3) What testing you have done on this patch? Once I will have rebased patch will run tests. Thanks, Mayank Removing old application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2320.1.patch, YARN-2320.2.patch, YARN-2320.3.patch After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2670) Adding feedback capability to capacity scheduler from external systems
Mayank Bansal created YARN-2670: --- Summary: Adding feedback capability to capacity scheduler from external systems Key: YARN-2670 URL: https://issues.apache.org/jira/browse/YARN-2670 Project: Hadoop YARN Issue Type: New Feature Reporter: Mayank Bansal Assignee: Mayank Bansal The sheer growth in data volume and Hadoop cluster size make it a significant challenge to diagnose and locate problems in a production-level cluster environment efficiently and within a short period of time. Often times, the distributed monitoring systems are not capable of detecting a problem well in advance when a large-scale Hadoop cluster starts to deteriorate in performance or becomes unavailable. Thus, incoming workloads, scheduled between the time when cluster starts to deteriorate and the time when the problem is identified, suffer from longer execution times. As a result, both reliability and throughput of the cluster reduce significantly. we address this problem by proposing a system called Astro, which consists of a predictive model and an extension to the Capacity scheduler. The predictive model in Astro takes into account a rich set of cluster behavioral information that are collected by monitoring processes and model them using machine learning algorithms to predict future behavior of the cluster. The Astro predictive model detects anomalies in the cluster and also identifies a ranked set of metrics that have contributed the most towards the problem. The Astro scheduler uses the prediction outcome and the list of metrics to decide whether it needs to move and reduce workloads from the problematic cluster nodes or to prevent additional workload allocations to them, in order to improve both throughput and reliability of the cluster. This JIRA is only for adding feedback capabilities to Capacity Scheduler which can take feedback from external systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152559#comment-14152559 ] Mayank Bansal commented on YARN-2320: - I think overal looks ok however Have to run. some small comments shouldn't we use N/A in convertToApplicationAttemptReport instead of null ? Similarly for convertToApplicationReport? Similary for convertToContainerReport? Removing old application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2320.1.patch, YARN-2320.2.patch After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2459: Attachment: YARN-2459-2.patch Attaching patch after oflline discussion with [~jianhe] Thanks, Mayank RM crashes if App gets rejected for any reason and HA is enabled Key: YARN-2459 URL: https://issues.apache.org/jira/browse/YARN-2459 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-2459-1.patch, YARN-2459-2.patch If RM HA is enabled and used Zookeeper store for RM State Store. If for any reason Any app gets rejected and directly goes to NEW to FAILED then final transition makes that to RMApps and Completed Apps memory structure but that doesn't make it to State store. Now when RMApps default limit reaches it starts deleting apps from memory and store. In that case it try to delete this app from store and fails which causes RM to crash. Thanks, Mayank -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
Mayank Bansal created YARN-2459: --- Summary: RM crashes if App gets rejected for any reason and HA is enabled Key: YARN-2459 URL: https://issues.apache.org/jira/browse/YARN-2459 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 2.5.0 If RM HA is enabled and used Zookeeper store for RM State Store. If for any reason Any app gets rejected and directly goes to NEW to FAILED then final transition makes that to RMApps and Completed Apps memory structure but that doesn't make it to State store. Now when RMApps default limit reaches it starts deleting apps from memory and store. In that case it try to delete this app from store and fails which causes RM to crash. Thanks, Mayank -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2459: Attachment: YARN-2459-1.patch Updating patch , Adding app to state store in App reject event to make RM memory and state store consistent. Thanks, Mayank RM crashes if App gets rejected for any reason and HA is enabled Key: YARN-2459 URL: https://issues.apache.org/jira/browse/YARN-2459 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-2459-1.patch If RM HA is enabled and used Zookeeper store for RM State Store. If for any reason Any app gets rejected and directly goes to NEW to FAILED then final transition makes that to RMApps and Completed Apps memory structure but that doesn't make it to State store. Now when RMApps default limit reaches it starts deleting apps from memory and store. In that case it try to delete this app from store and fails which causes RM to crash. Thanks, Mayank -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2459: Description: If RM HA is enabled and used Zookeeper store for RM State Store. If for any reason Any app gets rejected and directly goes to NEW to FAILED then final transition makes that to RMApps and Completed Apps memory structure but that doesn't make it to State store. Now when RMApps default limit reaches it starts deleting apps from memory and store. In that case it try to delete this app from store and fails which causes RM to crash. Stack Trace 2014-08-24 18:43:04,603 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Skipping scheduling since node phxaishdc9dn0360.phx.ebay.com:58458 is reserved by applica tion appattempt_1408727267637_12984_01 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Trying to fulfill reservation for application application_1408727267637_12984 on node: ph xaishdc9dn0816.phx.ebay.com:50443 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Application application_1408727267637_12984 reserved container container_1408727267637_1 2984_01_003215 on node host: phxaishdc9dn0816.phx.ebay.com:50443 #containers=17 available=4224 used=63360, currently has 310 at priority 10; currentReservation 2618880 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: Updated reserved container container_1408727267637_12984_01_003215 on node host: phxai shdc9dn0816.phx.ebay.com:50443 #containers=17 available=4224 used=63360 for application org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp@2da03710 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Reserved container application=application_1408727267637_12984 resource=memory:8448, vCores:1 queue=hdmi-set: capacity=0.2, absoluteCapacity=0.2, usedResources=memory:34293248, vCores:7092usedCapacity=1.4031365, absoluteUsedCapacity=0.28062728, numApps=12, numContainers=7092 usedCapacity=1.403 1365 absoluteUsedCapacity=0.28062728 used=memory:34293248, vCores:7092 cluster=memory:122202112, vCores:14584 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Skipping scheduling since node phxaishdc9dn0816.phx.ebay.com:50443 is reserved by applica tion appattempt_1408727267637_12984_01 2014-08-24 18:43:04,614 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:852) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:849) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:948) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:967) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:849) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:642) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:181) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:167) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:832) at
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-10.patch Fixing small bug. Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-10.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, YARN-2069-trunk-8.patch, YARN-2069-trunk-9.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2032: Attachment: YARN-2032-branch2-2.patch Attching update patch for branch 2 Thanks, Mayank Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2032-branch-2-1.patch, YARN-2032-branch2-2.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081193#comment-14081193 ] Mayank Bansal commented on YARN-2069: - Hi [~wangda] , Thanks for your review comments. Updating the patch with the fix. Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, YARN-2069-trunk-8.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-8.patch CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, YARN-2069-trunk-8.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074940#comment-14074940 ] Mayank Bansal commented on YARN-2069: - HI [~wangda] , Thanks for the review. Let me explain what this algo is doing . Lets say you have queueA in your cluster with capacity 30% allocated to it. Now Queue A is using 50% resources. Queue A has 5 users with 20% user limit. That means with each user is using 10% of the capacity of the cluster. Now Another queueB is there with allocated capacity 70%. Used capacity of queue B is 50%. Now if another application gets submitted to Queue B which needs 10% capacity. Now 10% capacity has to be claimed back from queue A . So restoobtain = 10% Targated user limit will be = 8% (This is always calculated based on how much we need to calim back from user) So based on the current alogorithm , it will take out 2% resources from every user and will leave behind the balance for each users. This will also be true if all the users are not using same number of resources so based on this algo it will take out more from the users which are using more to balance till targated user limit. Other thing this algo also does is it preempt application which is submitted last that means if user1 has 2 application, it will try to take the maximum containers from the last application submitted leaving behind the AM container however user limit will be honoured with combined all applications in the queue. This algo does not remove AM continer if its not absolutely needed, it goes to get all the tasks containers first and then consider AM containers.to be preempted. Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073727#comment-14073727 ] Mayank Bansal commented on YARN-2069: - Thanks [~vinodkv] for the review. I have changed the patch based on the targated capacity for the queue. It balances out with the users resources. I also removed the twp passes and now its only one pass. Please review it. Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-6.patch CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-7.patch Updated patch Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1408: Fix Version/s: 2.6.0 2.5.0 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0, 2.6.0 Attachments: YARN-1408-branch-2.5-1.patch, Yarn-1408.1.patch, Yarn-1408.10.patch, Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062698#comment-14062698 ] Mayank Bansal commented on YARN-1408: - +1 Committing Thanks [~sunilg] for the patch. Thanks [~jianhe], [~vinodkv] and [~wangda] for the reviews. Thanks, Mayank Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062879#comment-14062879 ] Mayank Bansal commented on YARN-1408: - Committed to trunk, branch 2 and branch-2.5. branch-2.5 needed some rebase , Updating the patch for branch-2.5 Thanks, Mayank Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-1408-branch-2.5-1.patch, Yarn-1408.1.patch, Yarn-1408.10.patch, Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1408: Attachment: YARN-1408-branch-2.5-1.patch rebasing against branch 2.5 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-1408-branch-2.5-1.patch, Yarn-1408.1.patch, Yarn-1408.10.patch, Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060929#comment-14060929 ] Mayank Bansal commented on YARN-1408: - At [~jianhe] 's point . I think its good to check schedulerAttempt is not null before accessing it. Make Sense? Thanks, Mayank Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061320#comment-14061320 ] Mayank Bansal commented on YARN-1408: - [~sunilg], Can you check these test failures? Thanks, Mayank Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14059104#comment-14059104 ] Mayank Bansal commented on YARN-1408: - Thanks [~sunilg] for the patch. Patch Looks good, Can you check these teste failures. Thanks, Mayank Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057170#comment-14057170 ] Mayank Bansal commented on YARN-2069: - I just verified, rebased the patch and compiled and tested . Patch doesn't seems to be the problem. Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057200#comment-14057200 ] Mayank Bansal commented on YARN-1408: - Thanks [~sunilg] for the patch. Patch looks good , There are some minor comments 1. You current patch is not applying on the trunk, Please rebase on trunk. 2. There are lot of unwanted formatting changes, can you please revert them back. Some examples are as follows {code} - .currentTimeMillis()); +.currentTimeMillis()); {code} {code} -RMContainer rmContainer = -new RMContainerImpl(container, attemptId, node.getNodeID(), - applications.get(attemptId.getApplicationId()).getUser(), rmContext, - status.getCreationTime()); +RMContainer rmContainer = new RMContainerImpl(container, attemptId, +node.getNodeID(), applications.get(attemptId.getApplicationId()) +.getUser(), rmContext, status.getCreationTime()); {code} Please check this in all the patch. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-5.patch Thanks [~wangda] and [~sunilg] for the review. I have update all the comments from [~wangda] except test cases one as I discussed offline with wangda and explain that current test cases are covering both the scenarios which he explained. [~sunilg] , I think [~wangda] already addressed your comments. Please review the latest patch Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054648#comment-14054648 ] Mayank Bansal commented on YARN-2022: - Merged to Branch 2.5 Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055598#comment-14055598 ] Mayank Bansal commented on YARN-2069: - Hi [~wangda], Good point , I missed it. I have updated the patch accordingly, Please review. Hi [~sunilg] Previously as well we dont wait for one more cycle to happen before we start preemption. it drops reservation and then count against res to Obtain and return rest of the containers to preempt, I am following the same pattern. So essentially I am droping reservation , then try to balance the queue with user limits and then get rest of the containers and then send them to preempt. Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-4.patch CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053959#comment-14053959 ] Mayank Bansal commented on YARN-2069: - hi [~wangda], Thanks for the review. I updated the patch please take a look , Le tme answer your questions. bq. In ProportionalCapacityPreemptionPolicy, bq. 1) balanceUserLimitsinQueueForPreemption() bq. 1.1, I think there's a bug when multiple applications under a same user (say Jim) in a queue, and usage of Jim is over user-limit. Any of Jim's applications will be tried to be preempted (total-resource-used-by-Jim - user-limit). We should remember resourcesToClaimBackFromUser and initialRes for each user (not reset them when handling each application) And it's better to add test to make sure this behavior is correct. We need to maintian the reverse order of application submission which only can be done iterating through applications as we want to preempt applications which are last submitted. bq. 1.2, Some debug logging should be removed like Done bq. 1.3, This check should be unnecessary Done bq. 2) preemptFrom bq. I noticed this method will be called multiple times for a same application within a editSchedule() call. bq. The reservedContainers will be calculated multiple times. bq. An alternative way to do this is to cache This method will only be executed for all the applicatoins only once as we will be removing all reservations and for the apps the reservation is been removed that would be no-op bq.In LeafQueue, bq. 1) I think it's better to remember user limit, no need to compute it every time, add a method like getUserLimit() to leafQueue should be better. That valus is not static and changed every time based on cluster utilization and thats why i am calculating every time. bq, 1) Should we preempt containers equally from users when there're multiple users beyond user-limit in a queue? Its not good it should be based on who submitted last and over user limit, however its not fair but we want to preempt last jobs first bq. 2) Should we preempt containers equally from applications in a same user? (Heap-like data structure maybe helpful to solve 1/2) No as above mentioned bq. 3) Should user-limit preemption be configurable? I think if we just configure preemption , thats enough thoughts? Thanks, Mayank Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2113: Summary: Add cross-user preemption within CapacityScheduler's leaf-queue (was: CS queue level preemption should respect user-limits) Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2113 URL: https://issues.apache.org/jira/browse/YARN-2113 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 2.5.0 This is different from (even if related to, and likely share code with) YARN-2069. YARN-2069 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Summary: CS queue level preemption should respect user-limits (was: Add cross-user preemption within CapacityScheduler's leaf-queue) CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2113: Description: Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. (was: This is different from (even if related to, and likely share code with) YARN-2069. YARN-2069 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities.) Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2113 URL: https://issues.apache.org/jira/browse/YARN-2113 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 2.5.0 Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Description: This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. was:Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-3.patch Rebasing and Updating the patch. Thanks, Mayank Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049529#comment-14049529 ] Mayank Bansal commented on YARN-2022: - + 1 committing Thanks [~sunilg] for the patch. Thanks [~vinodkv] and [~wangda] for the reviews. Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048308#comment-14048308 ] Mayank Bansal commented on YARN-2022: - Thanks [~sunilg] for the patch. {code} public void setAMContainer(boolean isAMContainer) { this.isAMContainer = isAMContainer; } {code} There should be write lock to it as well Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045386#comment-14045386 ] Mayank Bansal commented on YARN-2022: - You are using getAbsoluteMaximumCapacity() in you patch while calculating the AM resources which seems to me wrong, I think you should be using getAbsoluteCapacity which is the configured capacity of the queue not the max capacity of the queue. Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044217#comment-14044217 ] Mayank Bansal commented on YARN-2022: - Hi [~sunilg] If we dont use getAbsoluteCapacity then there is possibility we are running only AM's in the queue. lets say we have 10% capacity of the queue and MAX capacity is 100% and AM precentage is 10% that means with your approach 10 AM's can run for this queue.And if we have cluster fully utilized then only AM's will be running in this queue. Make sense? Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044218#comment-14044218 ] Mayank Bansal commented on YARN-2181: - If we are adding this information into Web UI then we should change the CLI and Rest Apis as well for adding that info. Thats inconsistent if we dont change the CLI/Rest and only add this info to Web UI Thanks, Mayank Add preemption info to RM Web UI Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page.png, queue page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app/queue, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-2.patch Thanks for the review. I don't have any easy way now to seprate the YARN-2022 and this patch as I am changing the same code and run it through junkins. I will rebase this patch once YARN-2022 is committed. Thanks, Mayank Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041435#comment-14041435 ] Mayank Bansal commented on YARN-2022: - Hi [~vinodkv] Is this ok with you if we commit this patch? As you have concerns before. I think we need to still avoid killing AM's even if we have patch for not killing applications if AM gets killed. Please suggest. Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-2069: --- Assignee: Mayank Bansal (was: Vinod Kumar Vavilapalli) Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041617#comment-14041617 ] Mayank Bansal commented on YARN-2069: - Taking it over. Thanks, Mayank Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-1.patch Attaching patch Thanks, Mayank Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032918#comment-14032918 ] Mayank Bansal commented on YARN-2022: - HI [~sunilg] Thanks for the patch. Overall looks ok however I think we need to add the test case for AM percentage per queue as well. Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030143#comment-14030143 ] Mayank Bansal commented on YARN-2022: - Hi [~vinodkv] what you are saying make sense and I agree to that however I think we still need this patch as that will ensure we are give least priority to kill AM's. Thoughts? [~sunilg] Thanks for the patch. here are some high level comments {code} + public static final String SKIP_AM_CONTAINER_FROM_PREEMPTION = yarn.resourcemanager.monitor.capacity.preemption.skip_am_container; {code} Please run the formatter , it doesn't seems to be the standard length of the line {code} +skipAMContainer = config.getBoolean(SKIP_AM_CONTAINER_FROM_PREEMPTION, +false); {code} By default it should be true, as we always wanted am to be least priority. Did you run the test on the cluster? Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028174#comment-14028174 ] Mayank Bansal commented on YARN-2022: - hi [~sunilg] Thanks for the patch. There is small addition we need to do in the approach which is as follows we need to consider below parameters into account yarn.scheduler.capacity.maximum-am-resource-percent / yarn.scheduler.capacity.queue-path.maximum-am-resource-percent It would be if user have set yarn.scheduler.capacity.queue-path.maximum-am-resource-percent in the queue then we can not preempt AM even if we didn't reach to full resource need from the queue. If user didn't set that queue level setting then we need to check if we are not avoiding yarn.scheduler.capacity.maximum-am-resource-percent constraint as well if these two constraints are not avoided and still we have some ams which we need to kill then yes we can go with the approach you put in your patch. Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026789#comment-14026789 ] Mayank Bansal commented on YARN-2022: - Hi [~sunilg] Thanks for the update. We are in the rush of pushing release is there a possibility you can put this simple patch today? If not do you mind If i can put this patch.? Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026795#comment-14026795 ] Mayank Bansal commented on YARN-2022: - for user limits there is already a jira YARN-2113 and I think [~wangda] is working on it. Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025581#comment-14025581 ] Mayank Bansal commented on YARN-2022: - Hi [~sunilg] and [~curino] Me and [~vinodkv] were discussing about making it simple and if we can just don't kill AM contianer that would be easier and will work well. I think many framweorks (MR, Tez etc) depends on last AM attempt. The only problem is if we have only AM running in that queue, I think that can be avoided by AM precentage per queue. Thoughts? Thanks, Mayank Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006459#comment-14006459 ] Mayank Bansal commented on YARN-2074: - Thanks [~jianhe] for the patch. Overall looks good. some nits {code} maxAppAttempts = attempts.size() {code} Can we use this? {code} maxAppAttempts == getAttemptFailureCount() {code} {code} public boolean isPreempted() { return getDiagnostics().contains(SchedulerUtils.PREEMPTED_CONTAINER); } {code} I think we need to compare the exit status (-102) instead of relying on string message. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006548#comment-14006548 ] Mayank Bansal commented on YARN-1408: - I agree with [~jianhe] and [~devaraj.k] We should be able to preempt the container in ALLOCATED state. bq. oday the resource request is decremented when container is allocated. we may change it to decrement the resource request only when the container is pulled by the AM ? I am not sure if thats the right thing as you dont want to run into other race conditions when container is been allocated however capacity is given to some other AM's? Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006735#comment-14006735 ] Mayank Bansal commented on YARN-2074: - +1 LGTM Thanks, Mayank Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998941#comment-13998941 ] Mayank Bansal commented on YARN-2055: - YARN-2022 is for avoiding killing AM however this issue more like how we are launching AM after preemption as there would be situations where you get some capacity for one heart beat and then again that capacity is reclaimed by other queue and then again AM will be killed and job will be failed. Based on the comments of YARN-2022 i dont see this case have been handeled there. Thanks, Mayank Preemption: Jobs are failing due to AMs are getting launched and killed multiple times -- Key: YARN-2055 URL: https://issues.apache.org/jira/browse/YARN-2055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2055) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2055: Assignee: (was: Sunil G) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times - Key: YARN-2055 URL: https://issues.apache.org/jira/browse/YARN-2055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Fix For: 2.1.0-beta Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2055) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times
Mayank Bansal created YARN-2055: --- Summary: Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times Key: YARN-2055 URL: https://issues.apache.org/jira/browse/YARN-2055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Sunil G Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2056: Description: We need to be able to disable preemption at individual queue level (was: If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed.) Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Fix For: 2.1.0-beta We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2055) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2055: Description: If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed. (was: Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs.) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times - Key: YARN-2055 URL: https://issues.apache.org/jira/browse/YARN-2055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Fix For: 2.1.0-beta If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2032: Attachment: YARN-2032-branch-2-1.patch Updating patch for branch-2 Thanks, Mayank Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2032-branch-2-1.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-2032: --- Assignee: Mayank Bansal (was: Vinod Kumar Vavilapalli) Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993000#comment-13993000 ] Mayank Bansal commented on YARN-2032: - Taking it over, As I am already working on it Thanks, Mayank Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-304) RM Tracking Links for purged applications needs a long-term solution
[ https://issues.apache.org/jira/browse/YARN-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-304: --- Attachment: YARN-304-1.patch Attaching patch Thanks, Mayank RM Tracking Links for purged applications needs a long-term solution Key: YARN-304 URL: https://issues.apache.org/jira/browse/YARN-304 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.5 Reporter: Derek Dagit Assignee: Mayank Bansal Attachments: YARN-304-1.patch This JIRA is intended to track a proper long-term fix for the issue described in YARN-285. The following is from the original description: As applications complete, the RM tracks their IDs in a completed list. This list is routinely truncated to limit the total number of application remembered by the RM. When a user clicks the History for a job, either the browser is redirected to the application's tracking link obtained from the stored application instance. But when the application has been purged from the RM, an error is displayed. In very busy clusters the rate at which applications complete can cause applications to be purged from the RM's internal list within hours, which breaks the proxy URLs users have saved for their jobs. We would like the RM to provide valid tracking links persist so that users are not frustrated by broken links. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1809) Synchronize RM and Generic History Service Web-UIs
[ https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946731#comment-13946731 ] Mayank Bansal commented on YARN-1809: - Thanks [~zjshen] for the patch bq. Yes, I think it should, but I prefer to put it in ApplicationBaseProtocol when ApplicationHistoryClientService has implemented DT related methods. history protocol already have these methods so don't need to wait , as they have dummy iplementation for that. bq. ApplicationBaseProtocol and ApplicationContext are completely different things. ApplicationBaseProtocol is the PRC interface. Previously, I thought we should have a uniformed ApplicationContext: on the RM side, it wraps RMContext; while on the AHS side, it wraps ApplicationHistory. However, inspired by RMWebServices#getApps, I think the RPC interface is a better place to uniform the way of retrieving app info, so I created ApplicationBaseProtocol. And ApplicationContext is no longer useful. ApplicationBaseProtocol would be the base protocol of Client and history however application context is something different. The motivation for context is to wrap RM and AHS application data, SO I think having context make sense as protocol has totally different motivation and methods as well when we add the delegation methods to it. bq. I understand the big patch is desperate for review, but I've to do that because the patch is aiming to refactor the code to avoid duplicate web-UI code for RM and for AHS. The two webUI should share the common code path, and then display similarly. I am fine with this if this is something you want to do. {code} p + * The protocol between clients and the codeResourceManager/code or + * codeApplicationHistoryServer/code to get information on applications, + * application attempts and containers. + * /p This should be= it is a base protocol for application client and history. Shouldn't we add @Idempotent to getallapplications as well? If we add appliction context back then we need to rebase the patch according to that. Synchronize RM and Generic History Service Web-UIs -- Key: YARN-1809 URL: https://issues.apache.org/jira/browse/YARN-1809 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch, YARN-1809.6.patch After YARN-953, the web-UI of generic history service is provide more information than that of RM, the details about app attempt and container. It's good to provide similar web-UIs, but retrieve the data from separate source, i.e., RM cache and history store respectively. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-304) RM Tracking Links for purged applications needs a long-term solution
[ https://issues.apache.org/jira/browse/YARN-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-304: -- Assignee: Mayank Bansal (was: Zhijie Shen) RM Tracking Links for purged applications needs a long-term solution Key: YARN-304 URL: https://issues.apache.org/jira/browse/YARN-304 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.5 Reporter: Derek Dagit Assignee: Mayank Bansal This JIRA is intended to track a proper long-term fix for the issue described in YARN-285. The following is from the original description: As applications complete, the RM tracks their IDs in a completed list. This list is routinely truncated to limit the total number of application remembered by the RM. When a user clicks the History for a job, either the browser is redirected to the application's tracking link obtained from the stored application instance. But when the application has been purged from the RM, an error is displayed. In very busy clusters the rate at which applications complete can cause applications to be purged from the RM's internal list within hours, which breaks the proxy URLs users have saved for their jobs. We would like the RM to provide valid tracking links persist so that users are not frustrated by broken links. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-304) RM Tracking Links for purged applications needs a long-term solution
[ https://issues.apache.org/jira/browse/YARN-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943545#comment-13943545 ] Mayank Bansal commented on YARN-304: Taking it over RM Tracking Links for purged applications needs a long-term solution Key: YARN-304 URL: https://issues.apache.org/jira/browse/YARN-304 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.5 Reporter: Derek Dagit Assignee: Mayank Bansal This JIRA is intended to track a proper long-term fix for the issue described in YARN-285. The following is from the original description: As applications complete, the RM tracks their IDs in a completed list. This list is routinely truncated to limit the total number of application remembered by the RM. When a user clicks the History for a job, either the browser is redirected to the application's tracking link obtained from the stored application instance. But when the application has been purged from the RM, an error is displayed. In very busy clusters the rate at which applications complete can cause applications to be purged from the RM's internal list within hours, which breaks the proxy URLs users have saved for their jobs. We would like the RM to provide valid tracking links persist so that users are not frustrated by broken links. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1809) Synchronize RM and Generic History Service Web-UIs
[ https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940879#comment-13940879 ] Mayank Bansal commented on YARN-1809: - Thanks [~zjshen] for the patch. Herer are some comments 1. Change name from ApplicationInformationProtocol to like ApplicationBaseProtocol 2. Why we cant have delegationtoken related api's to Base Protocol? 3. ApplicationHistoryClientService - Why we removing protocol handler? I think we should keep it as it was. 4. I am not sure why we removed the ApplicationContext, I think ApplicationContext shoule be retained Isn't it that good if we have the following structure bq . ApplicationBaseProtocol derived by ApplicationContext Thoughts? 5. There are lot of refactoring in the patch , which is good but we could have seprated in two JIRAs which will make changes central to specific issue. Thoughts? Synchronize RM and Generic History Service Web-UIs -- Key: YARN-1809 URL: https://issues.apache.org/jira/browse/YARN-1809 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch After YARN-953, the web-UI of generic history service is provide more information than that of RM, the details about app attempt and container. It's good to provide similar web-UIs, but retrieve the data from separate source, i.e., RM cache and history store respectively. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1809) Synchronize RM and Generic History Service Web-UIs
[ https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941234#comment-13941234 ] Mayank Bansal commented on YARN-1809: - I have tested this patch locally, this works ok with running apps however as soon as app is finished the urls starts giving error however they should be redirected to ahs urls Thoughts? Thanks, Mayank Synchronize RM and Generic History Service Web-UIs -- Key: YARN-1809 URL: https://issues.apache.org/jira/browse/YARN-1809 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch After YARN-953, the web-UI of generic history service is provide more information than that of RM, the details about app attempt and container. It's good to provide similar web-UIs, but retrieve the data from separate source, i.e., RM cache and history store respectively. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1690) Sending timeline entities+events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939598#comment-13939598 ] Mayank Bansal commented on YARN-1690: - Thanks [~zjshen] for the review bq . 1. Call it DSEvent? Done bq. 2. Chang it to Timeline Client? Done bq.3. Typo on CLient Done bq. config is the member field of ApplicationMaster Done bq. 5. Please merge the following duplicate exception handling as well Done bq. 6. Again, please do not mention AHS here Done bq.7. Please change publishContainerStartEvent, publishContainerEndEvent, publishApplicationAttemptEvent to static, which don't need to be per instance. Done bq.8. Please apply for the following to all the added error logs. Done bq.9. Please don't limit the output to 1. According to the args for this DS job, it should be 1 DS_APP_ATTEMPT entities and 2 DS_CONTAINER entities, which has 2 events each? And assert the number of returned entities/events? Done Sending timeline entities+events from Distributed shell Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch, YARN-1690-5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1690) Sending timeline entities+events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1690: Attachment: YARN-1690-6.patch Attaching patch Thanks, Mayank Sending timeline entities+events from Distributed shell Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch, YARN-1690-5.patch, YARN-1690-6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1690) Sending timeline entities+events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1690: Attachment: YARN-1690-7.patch Attaching patch Sending timeline entities+events from Distributed shell Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch, YARN-1690-5.patch, YARN-1690-6.patch, YARN-1690-7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)