[jira] [Updated] (YARN-2932) Add entry for preemptable status to scheduler web UI and queue initialize/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2932: - Attachment: YARN-2932.v7.txt Version 7 of patch fixes new javadoc warnings. Sorry about that. Add entry for preemptable status to scheduler web UI and queue initialize/refresh logging --- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt, YARN-2932.v4.txt, YARN-2932.v5.txt, YARN-2932.v6.txt, YARN-2932.v7.txt YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2896) Server side PB changes for Priority Label Manager and Admin CLI support
[ https://issues.apache.org/jira/browse/YARN-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286249#comment-14286249 ] Eric Payne commented on YARN-2896: -- [~sunilg], it seems to me that labels can make things more confusing, not less, since different queues can have arbitrary names for the same concept. Also, it would eliminate the need to add an infrastructure for mapping, passing, and interpreting labels and priority numbers. YARN could always specify that priorities go from low to high, and each queue could then decide how hight to go with the priority numbers. Also, it seems to me that the following property definition could specify high priority: {code} yarn.scheduler.capacity.root.queueA.5.acl=user1,user2 {code} Server side PB changes for Priority Label Manager and Admin CLI support --- Key: YARN-2896 URL: https://issues.apache.org/jira/browse/YARN-2896 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2896.patch, 0002-YARN-2896.patch, 0003-YARN-2896.patch, 0004-YARN-2896.patch Common changes: * PB support changes required for Admin APIs * PB support for File System store (Priority Label Store) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2896) Server side PB changes for Priority Label Manager and Admin CLI support
[ https://issues.apache.org/jira/browse/YARN-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288445#comment-14288445 ] Eric Payne commented on YARN-2896: -- [~sunilg], [~leftnoteasy], and [~vinodkv], can we move this discussion to YARN-1963 in order to achieve a higher visibility? Server side PB changes for Priority Label Manager and Admin CLI support --- Key: YARN-2896 URL: https://issues.apache.org/jira/browse/YARN-2896 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2896.patch, 0002-YARN-2896.patch, 0003-YARN-2896.patch, 0004-YARN-2896.patch Common changes: * PB support changes required for Admin APIs * PB support for File System store (Priority Label Store) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3088) LinuxContainerExecutor.deleteAsUser can throw NPE if native executor returns an error
[ https://issues.apache.org/jira/browse/YARN-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne reassigned YARN-3088: Assignee: Eric Payne LinuxContainerExecutor.deleteAsUser can throw NPE if native executor returns an error - Key: YARN-3088 URL: https://issues.apache.org/jira/browse/YARN-3088 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.1-beta Reporter: Jason Lowe Assignee: Eric Payne If the native executor returns an error trying to delete a path as a particular user when dir==null then the code can NPE trying to build a log message for the error. It blindly deferences dir in the log message despite the code just above explicitly handling the cases when dir could be null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3089) LinuxContainerExecutor does not handle file arguments to deleteAsUser
[ https://issues.apache.org/jira/browse/YARN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne reassigned YARN-3089: Assignee: Eric Payne LinuxContainerExecutor does not handle file arguments to deleteAsUser - Key: YARN-3089 URL: https://issues.apache.org/jira/browse/YARN-3089 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Blocker YARN-2468 added the deletion of individual logs that are aggregated, but this fails to delete log files when the LCE is being used. The LCE native executable assumes the paths being passed are paths and the delete fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3074) Nodemanager dies when localizer runner tries to write to a full disk
[ https://issues.apache.org/jira/browse/YARN-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294130#comment-14294130 ] Eric Payne commented on YARN-3074: -- [~varun_saxena], Thanks for posting this patch. Rather than duplicating the catch blocks, I would like to see the {{catch}} blocks save off the exception and fserror, then process it during the {{finally}} block. So, what I'm suggesting is before the {{try}} block, add a {{Throwable}} variable: {code} Throwable t = null; {code} In the catch blocks, save the exception and error: {code} } catch (Exception e) { t = e; } catch (FSError fse) { t = fse; } {code} Then, move what used to be in the original {{catch (Exception e)}} block into the {{finally}} block surrounded by {code} if (t != null) { ... } {code} Also, please add a unit test. Nodemanager dies when localizer runner tries to write to a full disk Key: YARN-3074 URL: https://issues.apache.org/jira/browse/YARN-3074 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-3074.001.patch When a LocalizerRunner tries to write to a full disk it can bring down the nodemanager process. Instead of failing the whole process we should fail only the container and make a best attempt to keep going. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2932) Add entry for preemptable status (enabled/disabled) to scheduler web UI and queue initialize/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295218#comment-14295218 ] Eric Payne commented on YARN-2932: -- Thank you for your input and review, [~leftnoteasy] Add entry for preemptable status (enabled/disabled) to scheduler web UI and queue initialize/refresh logging -- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Fix For: 2.7.0 Attachments: Screenshot.Queue.Preemption.Disabled.jpg, YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt, YARN-2932.v4.txt, YARN-2932.v5.txt, YARN-2932.v6.txt, YARN-2932.v7.txt, YARN-2932.v8.txt YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290300#comment-14290300 ] Eric Payne commented on YARN-1963: -- +1 on using numbers and not labels. It seems that the use of labels adds more complexity in mapping, sending via PB, and converting back to numbers, and does not seem to add much clarity. Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: YARN Application Priorities Design.pdf, YARN Application Priorities Design_01.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3089) LinuxContainerExecutor does not handle file arguments to deleteAsUser
[ https://issues.apache.org/jira/browse/YARN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298637#comment-14298637 ] Eric Payne commented on YARN-3089: -- Thank you, [~sunilg], for your review of this patch. {quote} {code} int subDirEmptyStr = (subdir == NULL || subdir[0] == 0); {code} I think strlen(subdir) also has to be checked against 0, correct? {quote} The {{strlen}} function will do exactly the same thing that {{subdir[0] == 0}} does, which is check that the first byte in the string is 0. In {{strlen}}, it takes the form of {{*s == '\0'}}, but it amounts to the same thing. By checking for empty string as is done in the existing patch, it avoids the overhead of another function call. LinuxContainerExecutor does not handle file arguments to deleteAsUser - Key: YARN-3089 URL: https://issues.apache.org/jira/browse/YARN-3089 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Blocker Attachments: YARN-3089.v1.txt YARN-2468 added the deletion of individual logs that are aggregated, but this fails to delete log files when the LCE is being used. The LCE native executable assumes the paths being passed are paths and the delete fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2932: - Attachment: YARN-2932.v2.txt Thanks very much, [~leftnoteasy], for your thorough review of this patch and for your helpful comments. {quote} 1) Since the QUEUE_PREEMPTION_DISABLED is an option for CS, I suggest to make it as a member of CapacitySchedulerConfiguration, like getUserLimitFactor/setUserLimit, etc. This will void some String operations. {quote} This is a good idea. I added {{isQueuePreemptable}} and {{setQueuePreemptable}}. For {{isQueuePreemptable}}, I needed to add a default value parameter because the default for the queue at a particular level should be whatever its parent's value is. {quote} 2) Rename {{context}} in {{AbstractCSQueue}} to name like {{csContext}} since we have {{rmContext}} {quote} Renamed. {quote} 3) I suggest to add a member var like {{preemptable}} to {{AbstractCSQueue}}, instead of calling: {code} + @Private + public boolean isPreemptable() { +return context.getConfiguration().isPreemptable(getQueuePath()); + } {code} The implementation of {{CSConfiguration.isPreemptable(..)}} seems too complex to me. {{CSConfiguration}} should only care about value of configuration file, such logic should put to {{AbstractCSQueue.setupQueueConfigs(...)}} {quote} I moved the logic to {{AbstractCSQueue.setupQueueConfigs(...)}}, and you are right. It is much cleaner that way. Thanks! {quote} 4) It's better to web UI name (preemptable) and configuration name (disable_preemption) consistent. I prefer preemptable personally. {quote} Yes, it is less confusing that way. In this patch, the only things that worry about the {{disable_preemption}} property are the internals of the {{CSConfiguration}} methods. The APIs are now all asking whether or not the queue is preemptable. {quote} 5) {{testIsPreemptable}} should be a part of {{TestCapacityScheduler}} instead of putting it to {{TestProportionalCapacityPreemptionPolicy}}. {quote} Thanks. I moved the test to {{testIsPreemptable}}. However, since the interface for changing a queue's preemptability changed, there were also several changes to {{TestProportionalCapacityPreemptionPolicy}}. {quote} 6) In {{ProportionalCapacityPreemptionPolicy.cloneQueues}}, preemptable field should get from Queue instead of getting from configuration. {quote} Done. Add entry for preemption setting to queue status screen and startup/refresh logging --- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2932: - Attachment: YARN-2932.v3.txt Upmerged and uploading new patch (v3). Add entry for preemption setting to queue status screen and startup/refresh logging --- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275686#comment-14275686 ] Eric Payne commented on YARN-2932: -- [~leftnoteasy], thanks very much for your review and comments: bq. 1. Rename {{isQueuePreemptable}} to {{getQueuePreemptable}} for getter/setter consistency in {{CapacitySchedulerConfiguration}} Renamed. bq. 2. Should consider queue reinitialize when queue preemptable in configuration changes (See {{TestQueueParsing}}). And it's best to add a test for verify that. I'm sorry. I don't understand what you mean by the use of the word consider. Calling {{CapacityScheduler.reinitialize}} will follow the queue hierarchy down and eventually call {{AbstractCSQueue#setupQueueConfigs}} for every queue, so I don't think there is any additional code needed, unless I'm missing something. Were you just saying that I need to add a test case for that? {quote} 3. It's better to remove the {{defaultVal}} parameter in {{CapacitySchedulerConfiguration.isPreemptable}}: {code} public boolean isQueuePreemptable(String queue, boolean defaultVal) {code} And the default_value should be placed in {{CapacitySchedulerConfiguration}}, like other queue configuration options. I understand what you trying to do is moving some logic from queue to {{CapacitySchedulerConfiguration}}, but I still think it's better to keep the {{CapacitySchedulerConfiguration}} simply gets some values from configuration file. {quote} The problem is that without the {{defaultval}} parameter, {{AbstractCSQueue#isQueuePathHierarchyPreemptable}} can't tell if the queue has explicitly set its preemptability or if it is just returning the default. For example: {code} root: disable_preemption = true root.A: disable_preemption (the property is not set) root.B: disable_preemption = false (the property is explicitly set to false) {code} Let's say the {{getQueuePreemptable}} interface is changed to remove the {{defaultVal}} parameter, and that when {{getQueuePreemptable}} calls {{getBoolean}}, it uses {{false}} as the default. # {{getQueuePreemptable}} calls {{getBoolean}} on {{root}} ## {{getBoolean}} returns {{true}} because the {{disable_preemption}} property is set to {{true}} ## {{getQueuePreemptable}} inverts {{true}} and returns {{false}} (That is, {{root}} has preemption disabled, so it is not preemptable). # {{getQueuePreemptable}} calls {{getBoolean}} on {{root.A}} ## {{getBoolean}} returns {{false}} because there is no {{disable_preemption}} property set for this queue, so {{getBoolean}} returns the default. ## {{getQueuePreemptable}} inverts {{false}} and returns {{true}} # {{getQueuePreemptable}} calls {{getBoolean}} on {{root.B}} ## {{getBoolean}} returns {{false}} because {{disable_preemption}} property is set to {{false}} for this queue ## {{getQueuePreemptable}} inverts {{false}} and returns {{true}} At this point, {{isQueuePathHierarchyPreemptable}} needs to know if it should use the default preemption from {{root}} or if it should use the value from each child queue. In the case of {{root.A}}, the value from {{root}} ({{false}}) should be used because {{root.A}} does not have the property set. In the case of {{root.B}}, the value should be the one returned for {{root.B}} ({{true}}) because it is explicitly set. But, since both {{root.A}} and {{root.B}} both returned {{true}}, {{isQueuePathHierarchyPreemptable}} can't tell the difference. Does that make sense? Add entry for preemption setting to queue status screen and startup/refresh logging --- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3074) Nodemanager dies when localizer runner tries to write to a full disk
[ https://issues.apache.org/jira/browse/YARN-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315147#comment-14315147 ] Eric Payne commented on YARN-3074: -- [~varun_saxena], Thank you for the updated patch! +1 Patch LGTM Nodemanager dies when localizer runner tries to write to a full disk Key: YARN-3074 URL: https://issues.apache.org/jira/browse/YARN-3074 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-3074.001.patch, YARN-3074.002.patch, YARN-3074.03.patch When a LocalizerRunner tries to write to a full disk it can bring down the nodemanager process. Instead of failing the whole process we should fail only the container and make a best attempt to keep going. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363790#comment-14363790 ] Eric Payne commented on YARN-1963: -- {quote} I think label-based and integer-based priorities are just two different ways to configure as well as API. No matter we choose to use label-based or integer-based priority, we should use integer only to implement internal logic (like in CapacityScheduler). {quote} I think that is true especially when passing priorities through proto buffers, using integers is best. Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: 0001-YARN-1963-prototype.patch, YARN Application Priorities Design.pdf, YARN Application Priorities Design_01.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357600#comment-14357600 ] Eric Payne commented on YARN-1963: -- Thanks, [~sunilg], for your work on in-queue priorities. Along with [~nroberts], I'm confused about why priority labels are needed. As a user, I just need to know that the higher the number, the higher the priority. Then, I just need a way to see what priority each application is using and a way to set the priority of applications. To me, it just seems like labels will get in the way. Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: 0001-YARN-1963-prototype.patch, YARN Application Priorities Design.pdf, YARN Application Priorities Design_01.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2498) Respect labels in preemption policy of capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355361#comment-14355361 ] Eric Payne commented on YARN-2498: -- Hi [~leftnoteasy]. Great job on this patch. I have one minor nit: Would you mind changing {{duductAvailableResourceAccordingToLabel}} to {{deductAvailableResourceAccordingToLabel}}? That is, {{duduct...}} should be {{deduct...}}. Respect labels in preemption policy of capacity scheduler - Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, yarn-2498-implementation-notes.pdf There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when there're some resource available in the cluster, we shouldn't assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot access such labels For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3275) Preemption happening on non-preemptable queues
Eric Payne created YARN-3275: Summary: Preemption happening on non-preemptable queues Key: YARN-3275 URL: https://issues.apache.org/jira/browse/YARN-3275 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN:3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3275) Preemption happening on non-preemptable queues
[ https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340393#comment-14340393 ] Eric Payne commented on YARN-3275: -- This situation can happen under the following conditions: - All of the resources in the cluster are being used - {{QueueA}} is preemptable and over its absolute capacity (AKA guaranteed capacity) - {{QueueB}} is not preemptable, over its absolute capacity, also over its absolute max capacity (which can happen), and asking for more resources In the above scenario, {{ProportionalCapacityPreemptionPolicy}} will subtract {{QueuB}}'s ideal assigned value from its absolute max capacity value and get a negative number, which will adjust it's ideally assigned resources downwards by that amount, which will result in that amount getting preempted. Regardless of the reason, if a queue is marked as unpreemptable, resources should never be preempted from that queue. Preemption happening on non-preemptable queues -- Key: YARN-3275 URL: https://issues.apache.org/jira/browse/YARN-3275 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN:3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.
[ https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-2592. -- Resolution: Invalid Preemption can kill containers to fulfil need of already over-capacity queue. - Key: YARN-2592 URL: https://issues.apache.org/jira/browse/YARN-2592 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.1 Reporter: Eric Payne There are scenarios in which one over-capacity queue can cause preemption of another over-capacity queue. However, since killing containers may lose work, it doesn't make sense to me to kill containers to feed an already over-capacity queue. Consider the following: {code} root has A,B,C, total capacity = 90 A.guaranteed = 30, A.pending = 5, A.current = 40 B.guaranteed = 30, B.pending = 0, B.current = 50 C.guaranteed = 30, C.pending = 0, C.current = 0 {code} In this case, the queue preemption monitor will kill 5 resources from queue B so that queue A can pick them up, even though queue A is already over its capacity. This could lose any work that those containers in B had already done. Is there a use case for this behavior? It seems to me that if a queue is already over its capacity, it shouldn't destroy the work of other queues. If the over-capacity queue needs more resources, that seems to be a problem that should be solved by increasing its guarantee. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.
[ https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340241#comment-14340241 ] Eric Payne commented on YARN-2592: -- Closing this, since it is expected that as long as there are available resources, queue usage should grow evenly based on percentage of absolute capacity, even when preemption can happen to fill this growth as long as the absolute max capacity is not reached and the queues are growing evenly. Preemption can kill containers to fulfil need of already over-capacity queue. - Key: YARN-2592 URL: https://issues.apache.org/jira/browse/YARN-2592 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.1 Reporter: Eric Payne There are scenarios in which one over-capacity queue can cause preemption of another over-capacity queue. However, since killing containers may lose work, it doesn't make sense to me to kill containers to feed an already over-capacity queue. Consider the following: {code} root has A,B,C, total capacity = 90 A.guaranteed = 30, A.pending = 5, A.current = 40 B.guaranteed = 30, B.pending = 0, B.current = 50 C.guaranteed = 30, C.pending = 0, C.current = 0 {code} In this case, the queue preemption monitor will kill 5 resources from queue B so that queue A can pick them up, even though queue A is already over its capacity. This could lose any work that those containers in B had already done. Is there a use case for this behavior? It seems to me that if a queue is already over its capacity, it shouldn't destroy the work of other queues. If the over-capacity queue needs more resources, that seems to be a problem that should be solved by increasing its guarantee. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3275) Preemption happening on non-preemptable queues
[ https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3275: - Description: YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN-3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. was: YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN:3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. Preemption happening on non-preemptable queues -- Key: YARN-3275 URL: https://issues.apache.org/jira/browse/YARN-3275 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN-3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3275) Preemption happening on non-preemptable queues
[ https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3275: - Attachment: YARN-3275.v1.txt Preemption happening on non-preemptable queues -- Key: YARN-3275 URL: https://issues.apache.org/jira/browse/YARN-3275 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3275.v1.txt YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN-3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3275) CapacityScheduler: Preemption happening on non-preemptable queues
[ https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3275: - Attachment: YARN-3275.v2.txt [~jlowe] and [~leftnoteasy], thank you for the reviews. Attached is an updated patch (v2) with your suggested changes. CapacityScheduler: Preemption happening on non-preemptable queues - Key: YARN-3275 URL: https://issues.apache.org/jira/browse/YARN-3275 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Labels: capacity-scheduler Attachments: YARN-3275.v1.txt, YARN-3275.v2.txt YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN-3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3275) Preemption happening on non-preemptable queues
[ https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341626#comment-14341626 ] Eric Payne commented on YARN-3275: -- Thanks very much, [~leftnoteasy], for reviewing this issue. {quote} Actually, go over max capacity is possible, when a cluster with resource = 1000G, and a queue reaches its max capacity, after the cluster resource goes down to 100G, it can over max capacity. n addition, parent queue can go beyond max capacity as described in YARN-3243 no matter if cluster resource changed or not. But child queue can only go beyond max capacity when cluster resource reduced. {quote} It is possible that the total available capacity of the cluster dropped by some percentage, causing the leaf node to go over its abs max cap by 5%. The cluster has a large number of nodes and memory, and that value is always changing slightly as nodes are lost and re-register. This may not account for the 5% overage we saw on the small leaf queue, because that total memory number isn't varying by 5%. {quote} we haven't defined disable-preemption is more important than max-capacity. IMO, if we should do this JIRA or not is still discussable. {quote} I see your point. In other words, it could be argued that the preemption monitor is doing the right thing. That is, when it sees that the queue is over its absolute max capacity (which should not happen), the preemption monitor is moving those resources back into the usable pool. However, the expectation of our users is that if they are running a job on a non-preemptable queue, their containers should never be preempted. From their point of view, it doesn't matter what the reason is, they are expecting the RM to obey the contract that says it will not preempt their resources. Preemption happening on non-preemptable queues -- Key: YARN-3275 URL: https://issues.apache.org/jira/browse/YARN-3275 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3275.v1.txt YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN-3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3089) LinuxContainerExecutor does not handle file arguments to deleteAsUser
[ https://issues.apache.org/jira/browse/YARN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3089: - Attachment: YARN-3089.v1.txt LinuxContainerExecutor does not handle file arguments to deleteAsUser - Key: YARN-3089 URL: https://issues.apache.org/jira/browse/YARN-3089 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Blocker Attachments: YARN-3089.v1.txt YARN-2468 added the deletion of individual logs that are aggregated, but this fails to delete log files when the LCE is being used. The LCE native executable assumes the paths being passed are paths and the delete fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3540) Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler
Eric Payne created YARN-3540: Summary: Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler Key: YARN-3540 URL: https://issues.apache.org/jira/browse/YARN-3540 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Priority: Blocker We are seeing this happen when - an NM's disk goes bad during the creation of map output(s) - the reducer's fetcher can read the shuffle header and reserve the memory - but gets an IOException when trying to shuffle for InMemoryMapOutput - shuffle fetch retry is enabled -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517252#comment-14517252 ] Eric Payne commented on YARN-2004: -- [~sunilg], Thanks for all of the work you are doing for this important feature. {quote} queueA: default=low queueB: default=medium The type of apps which we run may vary from queueA to B. So by keeping default priority different for each queue will help to handle such case. Assume more high level apps are running in queueA often, and medium level in queueB. Making different default priority can help here. {quote} I don't know a lot about the fair scheduler, but I'm pretty sure that in the capacity scheduler, there is no way to make one queue a higher priority than another. There is no way to compare job priorities between queues. That is, you can't say that jobs running in queueA have a higher priority than jobs running in queueB. So, it only makes sense to compare priorities between jobs in the same queue. Am I missing something? Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 0006-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies
[ https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524106#comment-14524106 ] Eric Payne commented on YARN-3097: -- Thanks, [~gtCarrera9], for your interest. Although I haven't made much progress on this yet, I do still plan on working on it in the near future. Logging of resource recovery on NM restart has redundancies --- Key: YARN-3097 URL: https://issues.apache.org/jira/browse/YARN-3097 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Minor Labels: newbie ResourceLocalizationService logs that it is recovering a resource with the remote and local paths, but then very shortly afterwards the LocalizedResource emits an INIT-LOCALIZED transition that also logs the same remote and local paths. The recovery message should be a debug message, since it's not conveying any useful information that isn't already covered by the resource state transition log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3097) Logging of resource recovery on NM restart has redundancies
[ https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3097: - Attachment: YARN-3097.001.patch Logging of resource recovery on NM restart has redundancies --- Key: YARN-3097 URL: https://issues.apache.org/jira/browse/YARN-3097 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Minor Labels: newbie Attachments: YARN-3097.001.patch ResourceLocalizationService logs that it is recovering a resource with the remote and local paths, but then very shortly afterwards the LocalizedResource emits an INIT-LOCALIZED transition that also logs the same remote and local paths. The recovery message should be a debug message, since it's not conveying any useful information that isn't already covered by the resource state transition log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies
[ https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526577#comment-14526577 ] Eric Payne commented on YARN-3097: -- {quote} -1 The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {quote} Since the only change in this patch is to change an info log message to a debug log message, no tests were included. Logging of resource recovery on NM restart has redundancies --- Key: YARN-3097 URL: https://issues.apache.org/jira/browse/YARN-3097 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Minor Labels: newbie Attachments: YARN-3097.001.patch ResourceLocalizationService logs that it is recovering a resource with the remote and local paths, but then very shortly afterwards the LocalizedResource emits an INIT-LOCALIZED transition that also logs the same remote and local paths. The recovery message should be a debug message, since it's not conveying any useful information that isn't already covered by the resource state transition log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517545#comment-14517545 ] Eric Payne commented on YARN-2004: -- [~sunilg], bq. Hope you understood my comment about priority config across queue. Pls let me know your thoughts. I think you are referring to [~leftnoteasy]'s suggestion that a cluster-wide config should be added to put a cap on the maximum priorities allowed in the queue. Is that correct? I think that makes sense so that cluster admins can put a cap on the number of priorities within any given queue. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 0006-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1519) check if sysconf is implemented before using it
[ https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-1519: - Attachment: YARN-1519.003.patch [~hsn], since I didn't hear back, I will go ahead and post the patch with the changes suggested by [~raviprak]. Thanks again for doing all of the work and testing on this patch. [~raviprak], will you please have a look? Thanks. check if sysconf is implemented before using it --- Key: YARN-1519 URL: https://issues.apache.org/jira/browse/YARN-1519 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.3.0 Reporter: Radim Kolar Assignee: Radim Kolar Labels: BB2015-05-TBR Attachments: YARN-1519.002.patch, YARN-1519.003.patch, nodemgr-sysconf.txt If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to segfault because invalid pointer gets passed to libc function. fix: enforce minimum value 1024, same method is used in hadoop-common native code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544404#comment-14544404 ] Eric Payne commented on YARN-2069: -- Hi [~mayank_bansal]. Thanks for working through the details related to this issue. I have one small nit. In {{LeafQueue#computeTargetedUserLimit}}, it does not look like the {{MIN}} and {{MAX}} variables are ever used. CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Labels: BB2015-05-TBR Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-10.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, YARN-2069-trunk-8.patch, YARN-2069-trunk-9.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573670#comment-14573670 ] Eric Payne commented on YARN-3769: -- [~leftnoteasy] bq. If you think it's fine, could I take a shot at it? It sounds like it would work. It's fine with me if you want to work on that. Preemption occurring unnecessarily because preemption doesn't consider user limit - Key: YARN-3769 URL: https://issues.apache.org/jira/browse/YARN-3769 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0, 2.7.0, 2.8.0 Reporter: Eric Payne Assignee: Eric Payne We are seeing the preemption monitor preempting containers from queue A and then seeing the capacity scheduler giving them immediately back to queue A. This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
Eric Payne created YARN-3769: Summary: Preemption occurring unnecessarily because preemption doesn't consider user limit Key: YARN-3769 URL: https://issues.apache.org/jira/browse/YARN-3769 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0, 2.6.0, 2.8.0 Reporter: Eric Payne Assignee: Eric Payne We are seeing the preemption monitor preempting containers from queue A and then seeing the capacity scheduler giving them immediately back to queue A. This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573619#comment-14573619 ] Eric Payne commented on YARN-3769: -- The following configuration will cause this: || queue || capacity || max || pending || used || user limit | root | 100 | 100 | 40 | 90 | N/A | | A | 10 | 100 | 20 | 70 | 70 | | B | 10 | 100 | 20 | 20 | 20 | One app is running in each queue. Both apps are asking for more resources, but they have each reached their user limit, so even though both are asking for more and there are resources available, no more resources are allocated to either app. The preemption monitor will see that {{B}} is asking for a lot more resources, and it will see that {{B}} is more underserved than {{A}}, so the preemption monitor will try to make the queues balance by preempting resources (10, for example) from {{A}}. || queue || capacity || max || pending || used || user limit | root | 100 | 100 | 50 | 80 | N/A | | A | 10 | 100 | 30 | 60 | 70 | | B | 10 | 100 | 20 | 20 | 20 | However, when the capacity scheduler tries to give that container to the app in {{B}}, the app will recognize that it has no headroom, and refuse the container. So the capacity scheduler offers the container again to the app in {{A}}, which accepts it because it has headroom now, and the process starts over again. Note that this happens even when used cluster resources are below 100% because the used + pending for the cluster would put it above 100%. Preemption occurring unnecessarily because preemption doesn't consider user limit - Key: YARN-3769 URL: https://issues.apache.org/jira/browse/YARN-3769 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0, 2.7.0, 2.8.0 Reporter: Eric Payne Assignee: Eric Payne We are seeing the preemption monitor preempting containers from queue A and then seeing the capacity scheduler giving them immediately back to queue A. This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573664#comment-14573664 ] Eric Payne commented on YARN-3769: -- [~leftnoteasy], {quote} One thing I've thought for a while is adding a lazy preemption mechanism, which is: when a container is marked preempted and wait for max_wait_before_time, it becomes a can_be_killed container. If there's another queue can allocate on a node with can_be_killed container, such container will be killed immediately to make room the new containers. {quote} IIUC, in your proposal, the preemption monitor would mark the containers as preemptable, and then after some configurable wait period, the capacity scheduler would be the one to do the killing if it finds that it needs the resources on that node. Is my understanding correct? Preemption occurring unnecessarily because preemption doesn't consider user limit - Key: YARN-3769 URL: https://issues.apache.org/jira/browse/YARN-3769 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0, 2.7.0, 2.8.0 Reporter: Eric Payne Assignee: Eric Payne We are seeing the preemption monitor preempting containers from queue A and then seeing the capacity scheduler giving them immediately back to queue A. This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603603#comment-14603603 ] Eric Payne commented on YARN-2004: -- Thanks, [~sunilg], for this fix. - {{SchedulerApplicationAttempt.java}}: {code} if (!getApplicationPriority().equals( ((SchedulerApplicationAttempt) other).getApplicationPriority())) { return getApplicationPriority().compareTo( ((SchedulerApplicationAttempt) other).getApplicationPriority()); } {code} -- Can {{getApplicationPriority}} return null? I see that {{SchedulerApplicationAttempt}} initializes {{appPriority}} to null. - {{CapacityScheduler.java}}: {code} if (!a1.getApplicationPriority().equals(a2.getApplicationPriority())) { return a1.getApplicationPriority().compareTo( a2.getApplicationPriority()); } {code} -- Same question about {{getApplicationPriority}} returning null. -- Also, can {{updateApplicationPriority}} call {{authenticateApplicationPriority}}? Seems like duplicate code to me. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 0006-YARN-2004.patch, 0007-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590747#comment-14590747 ] Eric Payne commented on YARN-2902: -- Hi [~varun_saxena]. Thank you very much for working on and fixing this issue. We are looking forward to your next patch. Do you have an ETA for when that might be? Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3978: - Attachment: YARN-3978.004.patch Thanks very much [~jeagles] for your review and comments. {quote} It is not evident to me the changes made to TestClientRMService, TestChildQueueOrder, TestLeafQueue, TestReservations, and TestFifoScheduler. Could you help to explain those changes or remove them if they are extra? {quote} In each of these tests, the {{rmcontext}} is mocked before constructing a new instance of {{RMContainerImpl}}, but the {{getYarnConfiguration}} method is not handled. Since this patch adds a dependency on {{rmContext.getYarnConfiguration()}} in the constructor for {{RMContainerImpl}}, an explicit mock for {{getYarnConfiguration}} had to be added in these tests to prevent NPE. {quote} Please update the comment + // Store system metrics for all containers only when storeContainerMetaInfo + // is true. To indicate that AM metrics publishing are delayed until later in this scenario. {quote} Done {quote} Is there a better configuration name that could be used? save-container-meta-info doesn't convey that AM container info is still published if this flag is disabled. {quote} How about {{save-non-am-container-meta-info}}? I thought about {{save-only-am-container-meta-info}}, but then {{true}} would mean that publishing of non-am containers would be turned off, and I thought that was too confusing. Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Affects Versions: 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch, YARN-3978.002.patch, YARN-3978.003.patch, YARN-3978.004.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648175#comment-14648175 ] Eric Payne commented on YARN-3978: -- {{checkstyle}} indicates that {{YarnConfiguration.java}} is too long. I will not be fixing that as part of this JIRA. Everything else from the build seems to be okay. [~jeagles], can you please have a look at this patch? Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Affects Versions: 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch, YARN-3978.002.patch, YARN-3978.003.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3978: - Affects Version/s: 2.8.0 2.7.1 Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Affects Versions: 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3978: - Attachment: YARN-3978.003.patch Version 003 of patch. Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Affects Versions: 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch, YARN-3978.002.patch, YARN-3978.003.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3978: - Attachment: YARN-3978.002.patch Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Affects Versions: 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch, YARN-3978.002.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662133#comment-14662133 ] Eric Payne commented on YARN-3250: -- Just my 2 cents: I prever {{yarn application --appId Applicationid --setPriority value}} Support admin cli interface in for Application Priority --- Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Attachments: 0001-YARN-3250-V1.patch Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696957#comment-14696957 ] Eric Payne commented on YARN-4014: -- {code} +pw.println( -appId Application ID ApplicationId can be used with any other); +pw.println( sub commands in future. Currently it is); +pw.println( used along only with -set-priority); ... + ApplicationId can be used with any other sub commands in future. + + Currently it is used along only with -set-priority); {code} This is a minor point, but in these 2 places, I would simply state something like the following: {{ID of the affected application.}} That way, when it is used in the future by other switches, the developer doesn't have to remember to change these statements. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3978: - Attachment: YARN-3978.001.patch Attaching version 001 of the patch. @jeagles, would you like to take a look? Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
Eric Payne created YARN-3978: Summary: Configurably turn off the saving of container info in Generic AHS Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Reporter: Eric Payne Assignee: Eric Payne Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641768#comment-14641768 ] Eric Payne commented on YARN-3978: -- Use Case: A user launches an application on a secured cluster that runs for some time and then fails within the AM (perhaps due to OOM in the AM), leaving no history in the job history server. The user doesn't notice that the job has failed until after the application has dropped off of the RM's application store. At this point, if no information was stored in the Generic Application History Service, a user must rely on a priviledged system administrator to access the AM logs for them. It is desirable to activate the Generic Application History service within the timeline server so that users can access their application's information even after the RM has forgotten about their application. This app information should be kept in the GAHS for 1 week, as is done, for example, for logs in the job history server. One way that the Generic AHS stores metadata about an application is in an Entity levelDB. This includes information about each container for each application. Based on my analysis, the levelDB size grows by at least 2500 bytes per container (uncompressed). This is a conservative estimate as the size could be much bigger based on the amount of diagnostic information associated with failed containers. On very large and busy clusters, the amount needed on the timeline server's local disk would be between 0.6 TB and 1.0 TB (uncompressed). Even if we assume 90% compression, that's still between 60 GB and 100 GB that will be needed on the local disk. In addition to this, between 80 GB and 143 GB of metadata (uncopressed) will need to be cleaned up every day from the levelDB, which will delay other processing in the timeline server. The proposal of this JIRA is to add a configuration property that enables/disables whether or not the GAHS stores container information in the levelDB. Whith this change, I estimate that the local disk usage would be about 5700 bytes per job, or about 10 GB (uncompressed) per week. Additionally, the daily cleanup load would only be about 1.5 GB per day. Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Reporter: Eric Payne Assignee: Eric Payne Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3905: - Attachment: YARN-3905.002.patch Fixing checkstyle bug. I forgot to remove the now-unused {{ContainerID}} import. Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3905.001.patch, YARN-3905.002.patch From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3905: - Attachment: YARN-3905.001.patch Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3905.001.patch From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996886#comment-14996886 ] Eric Payne commented on YARN-3769: -- bq. you don't need to do componmentwiseMax here, since minPendingAndPreemptable <= headroom, and you can use substractFrom to make code simpler. [~leftnoteasy], you are right, we do know that {{minPendingAndPreemptable <= headroom}}. Thanks for the catch. I will make those changes. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, > YARN-3769.003.patch, YARN-3769.004.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769.005.patch [~leftnoteasy], Attaching YARN-3769.005.patch with the changes we discussed. I have another question that may be an enhancement: In {{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, the calculation of headroom is as follows in this patch: {code} Resource headroom = Resources.subtract( computeUserLimit(app, resources, user, partition, SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), user.getUsed(partition)); {code} Would it be more efficient to just do the following? {code} Resource headroom = Resources.subtract(user.getUserResourceLimit(), user.getUsed()); {code} > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, > YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI
[ https://issues.apache.org/jira/browse/YARN-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004297#comment-15004297 ] Eric Payne commented on YARN-4226: -- Since the {{preemptionDisabled}} tag has already shipped with the capacity scheduler's REST API, I don't think that changing the name of the tag is an option, since users may be relying on that key string. I see only the following options: # Change the value of {{true}} to {{disabled}} and {{false}} to {{enabled}} (which may not be an option either for the same reason changing the key is not an option) # Add a new key like {{preemptionStatus}} and have the values be {{enabled}} or {{disabled}} # Make no changes. Leave it the way that it is > Make capacity scheduler queue's preemption status REST API consistent with GUI > -- > > Key: YARN-4226 > URL: https://issues.apache.org/jira/browse/YARN-4226 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > > In the capacity scheduler GUI, the preemption status has the following form: > {code} > Preemption: disabled > {code} > However, the REST API shows the following for the same status: > {code} > preemptionDisabled":true > {code} > The latter is confusing and should be consistent with the format in the GUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4354) Public resource localization fails with NPE
[ https://issues.apache.org/jira/browse/YARN-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004908#comment-15004908 ] Eric Payne commented on YARN-4354: -- +1 Thanks Jason for catching and fixing this. I also verified that the new test ({{TestLocalResourcesTrackerImpl#testReleaseWhileDownloading}}) passes with the fix and NPEs without it. And, I ran {{TestResourceLocalizationService}} (the above test that is failing) in my local build environment and it passes for me. > Public resource localization fails with NPE > --- > > Key: YARN-4354 > URL: https://issues.apache.org/jira/browse/YARN-4354 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.2 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4354-unittest.patch, YARN-4354.001.patch, > YARN-4354.002.patch > > > I saw public localization on nodemanagers get stuck because it was constantly > rejecting requests to the thread pool executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: YARN-4225.001.patch Attching YARN-4225.001.patch for both trunk and branch-2.8 > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769-branch-2.7.005.patch Unit tests {{TestAMAuthorization}} {{TestClientRMTokens}} {{TestRM}} {{TestWorkPreservingRMRestart}} are all working for me in my local build environment. Attaching branch-2.7 patch, which is a little different, since the 2.7 preemption monitor doesn't consider labels. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, > YARN-3769-branch-2.7.005.patch, YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, > YARN-3769.005.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769.004.patch [~leftnoteasy], Thank you for your review, and sorry for the late reply. {quote} - Why this is needed? MAX_PENDING_OVER_CAPACITY. I think this could be problematic, for example, if a queue has capacity = 50, and it's usage is 10 and it has 45 pending resource, if we set MAX_PENDING_OVER_CAPACITY=0.1, the queue cannot preempt resource from other queue. {quote} Sorry for the poor naming convention. It is not really being used to check against the queue's capacity, it is used to check for a percentage over the currently used resources. I changed the name to {{MAX_PENDING_OVER_CURRENT}}. As you know, there are multiple reasons why preemption could unnecessarily preempt resources (I call it "flapping"). Only one of which is the lack of consideration for user limit factor. Another is that an app could be requesting an 8-gig container, and the preemption monitor could conceivably preempt 8, one-gig containers, which would then be rejected by the requesting AM and potentially given right back to the preempted app. The {{MAX_PENDING_OVER_CURRENT}} buffer is an attempt to alleviate that particular flapping situation by giving a buffer zone above the currently used resources on a particular queue. This is to say that the preemption monitor shouldn't consider that queue B is asking for pending resources unless pending resources on queue B are above a configured percentage of currently used resources on queue B. If you want, we can pull this out and put it as part of a different JIRA so we can document and discuss that particular flapping situation separately. {quote} - n LeafQueue, it uses getHeadroom() to compute how many resource that the user can use. But I think it may not correct: ... For above queue status, headroom for a.a1 is 0 since queue-a's currentResourceLimit is 0. So instead of using headroom, I think you can use computed-user-limit - user.usage(partition) as the headroom. You don't need to consider queue's max capacity here, since we will consider queue's max capacity at following logic of PCPP. {quote} Yes, you are correct. {{getHeadroom}} could be calculating zero headroom when we don't want it to. And, I agree that we don't need to limit pending resources to max queue capacity when calculating pending resources. The concern for this fix is that user limit factor should be considered and limit the pending value. The max queue capacity will be considered during the offer stage of the preemption calculations. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, > YARN-3769.003.patch, YARN-3769.004.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987842#comment-14987842 ] Eric Payne commented on YARN-3769: -- Tests {{hadoop.yarn.server.resourcemanager.TestClientRMTokens}} and {{hadoop.yarn.server.resourcemanager.TestAMAuthorization}} are not failing for me in may own build environment. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, > YARN-3769.003.patch, YARN-3769.004.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947847#comment-14947847 ] Eric Payne commented on YARN-3769: -- Investigating test failures and checkstyle warnings > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: (was: YARN-3769.003.patch) > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769.003.patch > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, > YARN-3769.003.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769.003.patch YARN-3769-branch-2.7.003.patch [~leftnoteasy], Thanks for all of your help on this JIRA. Attaching version 003. {{YARN-3769.003.patch}} applies to both trunk and branch-2 {{YARN-3769-branch-2.7.003.patch}} applies to branch-2.7 > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, > YARN-3769.003.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
Eric Payne created YARN-3905: Summary: Application History Server UI NPEs when accessing apps run after RM restart Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.1, 2.7.0, 2.8.0 Reporter: Eric Payne Assignee: Eric Payne From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621281#comment-14621281 ] Eric Payne commented on YARN-3905: -- {{org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable}} constructs what it believes should be the AM container ID when creating a new {{GetContainerReportRequest}}. {code} // AM container is always the first container of the attempt final GetContainerReportRequest request = GetContainerReportRequest.newInstance(ContainerId.newContainerId( appAttemptReport.getApplicationAttemptId(), 1)); {code} - After the RM is restarted, container IDs contain an {{e##}} string, which the above code doesn't take into consideration - The AM container is not always _01 due to the way reservations work. We have seen non-first AM containers in practice. As a result of the above code, the container ID in the {{GetContainerReportRequest}} may not match the actual AM container ID before RM restart, and will not match those for jobs run after the RM is restarted. So, When {{ApplicationHistoryManagerImpl}} compares the ID of the passed container with it's cache from the history store, it can't find a match and throws the NPE. In {{AppBlock#generateApplicationTable}}, instead of constructing the AM's container ID, I suggest using appAttemptReport#getAMContainerId instead: {code} final GetContainerReportRequest request = GetContainerReportRequest.newInstance( appAttemptReport.getAMContainerId()); {code} Application History Server UI NPEs when accessing apps run after RM restart --- Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.0, 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769.001.branch-2.8.patch YARN-3769.001.branch-2.7.patch {quote} One thing I've thought for a while is adding a "lazy preemption" mechanism, which is: when a container is marked preempted and wait for max_wait_before_time, it becomes a "can_be_killed" container. If there's another queue can allocate on a node with "can_be_killed" container, such container will be killed immediately to make room the new containers. I will upload a design doc shortly for review. {quote} [~leftnoteasy], because it's been a couple of months since the last activity on this JIRA, would it be better to use this JIRA for the purpose of making the preemption monitor "user-limit" aware and open a separate JIRA to address a redesign? Towards that end, I am uploading a couple of patches: - {{YARN-3769.001.branch-2.7.patch}} is a patch to 2.7 (and also 2.6) which we have been using internally. This fix has dramatically reduced the instances of "ping-pong"-ing as I outlined in [the comment above|https://issues.apache.org/jira/browse/YARN-3769?focusedCommentId=14573619=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14573619]. - {{YARN-3769.001.branch-2.8.patch}} is similar to the fix made in 2.7, but it also takes into consideration node label partitions. Thanks for your help and please let me know what you think. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Wangda Tan > Attachments: YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724431#comment-14724431 ] Eric Payne commented on YARN-3769: -- bq. I didn't make any progress on this, assigned this to you. No problem. Thanks [~leftnoteasy]. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737684#comment-14737684 ] Eric Payne commented on YARN-3769: -- Thanks very much [~leftnoteasy]! I think the above is much more efficient, but I think it needs one small tweak, On this line: {code} userNameToHeadroom.get(app.getUser()) -= app.getPending(partition); {code} If {{app.getPending(partition)}} is larger than {{userNameToHeadroom.get(app.getUser())}}, then {{userNameToHeadroom.get(app.getUser())}} could easily go negative. I think what we may want is something like this: {code} MapuserNameToHeadroom; Resource userLimit = computeUserLimit(partition); Resource pendingAndPreemptable = 0; for (app in apps) { if (!userNameToHeadroom.contains(app.getUser())) { userNameToHeadroom.put(app.getUser(), userLimit - app.getUser().getUsed(partition)); } Resource minPendingAndPreemptable = min(userNameToHeadroom.get(app.getUser()), app.getPending(partition)); pendingAndPreemptable += minPendingAndPreemptable; userNameToHeadroom.get(app.getUser()) -= minPendingAndPreemptable; } return pendingAndPreemptable; {code} Also, I will work on adding a test case. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4217) Failed AM attempt retries on same failed host
Eric Payne created YARN-4217: Summary: Failed AM attempt retries on same failed host Key: YARN-4217 URL: https://issues.apache.org/jira/browse/YARN-4217 Project: Hadoop YARN Issue Type: Improvement Components: applications Affects Versions: 2.7.1 Reporter: Eric Payne This happens when the cluster is maxed out. One node is going bad, so everything that happens on it fails, so the bad node is never busy. Since the cluster is maxed out, when the RM looks for a node with available resources, it will always find the almost bad one because nothing can run on it so it has available resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4217) Failed AM attempt retries on same failed host
[ https://issues.apache.org/jira/browse/YARN-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940016#comment-14940016 ] Eric Payne commented on YARN-4217: -- One way to fix this would be by blacklisting the bad nodes. However, we need to be careful that the cure isn't worse than the disease. For example, Hadoop 0.20 had black/grey listing of nodes but it was often disabled because it caused more problems than it solved. We don't want one misconfigured pipeline spawning AMs/tasks that always fail to cause the RM to think all nodes are bad and bring the cluster to a halt. It's difficult to discern whether a failure was the node's fault or the job's fault (or sometimes neither was at fault). I think the best approach initially is to implement an application-specific blacklisting approach, where the RM will track bad nodes per application rather than across applications. That way an AM that isn't working on a node can be tried on another node, but a misconfigured/specialized AM won't break the node for other AMs/tasks that work just fine on that node. The drawback of course is that if the node really is totally bad then each application has to learn that separately. > Failed AM attempt retries on same failed host > - > > Key: YARN-4217 > URL: https://issues.apache.org/jira/browse/YARN-4217 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications >Affects Versions: 2.7.1 >Reporter: Eric Payne > > This happens when the cluster is maxed out. One node is going bad, so > everything that happens on it fails, so the bad node is never busy. Since the > cluster is maxed out, when the RM looks for a node with available resources, > it will always find the almost bad one because nothing can run on it so it > has available resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942434#comment-14942434 ] Eric Payne commented on YARN-3216: -- Hi [~sunilg], [~leftnoteasy], and [~Naganarasimha]. Thank you all for the great work. Have you considered how the Max Application Master Resources will be presented in the GUI? I assume it will just be expressed in the existing Max Application Master Resources field under the partition-specific tab in the scheduler page. Is that correct? > Max-AM-Resource-Percentage should respect node labels > - > > Key: YARN-3216 > URL: https://issues.apache.org/jira/browse/YARN-3216 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3216.patch, 0002-YARN-3216.patch > > > Currently, max-am-resource-percentage considers default_partition only. When > a queue can access multiple partitions, we should be able to compute > max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769-branch-2.7.002.patch YARN-3769-branch-2.002.patch Thank you very much, [~leftnoteasy], for your suggestions and help reviewing this patch. I am attaching an updated patch (version 002) for both branch-2.7 and branch-2. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4217) Failed AM attempt retries on same failed host
[ https://issues.apache.org/jira/browse/YARN-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-4217. -- Resolution: Duplicate bq. Eric Payne - is this a duplicate of YARN-2005? [~vvasudev], yes it is. I did do a search, but I missed that one. Thanks a lot! > Failed AM attempt retries on same failed host > - > > Key: YARN-4217 > URL: https://issues.apache.org/jira/browse/YARN-4217 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications >Affects Versions: 2.7.1 >Reporter: Eric Payne > > This happens when the cluster is maxed out. One node is going bad, so > everything that happens on it fails, so the bad node is never busy. Since the > cluster is maxed out, when the RM looks for a node with available resources, > it will always find the almost bad one because nothing can run on it so it > has available resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI
Eric Payne created YARN-4226: Summary: Make capacity scheduler queue's preemption status REST API consistent with GUI Key: YARN-4226 URL: https://issues.apache.org/jira/browse/YARN-4226 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Priority: Minor In the capacity scheduler GUI, the preemption status has the following form: {code} Preemption: disabled {code} However, the REST API shows the following for the same status: {code} preemptionDisabled":true {code} The latter is confusing and should be consistent with the format in the GUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Component/s: capacity scheduler > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: (was: YARN-3769-branch-2.002.patch) > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.7.002.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Summary: Add preemption status to yarn queue -status (was: Add preemption status to {{yarn queue -status}}) > Add preemption status to yarn queue -status > --- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4225) Add preemption status to {{yarn queue -status}}
Eric Payne created YARN-4225: Summary: Add preemption status to {{yarn queue -status}} Key: YARN-4225 URL: https://issues.apache.org/jira/browse/YARN-4225 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769-branch-2.002.patch > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Summary: Add preemption status to yarn queue -status for capacity scheduler (was: Add preemption status to yarn queue -status) > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: (was: YARN-4225.002.patch) > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: YARN-4225.002.patch Attaching {{YARN-4225.002.patch}}, which implements {{getPreemptionDisabled()}} to return a {{Boolean}}, and {{QueueCLI#printQueueInfo}} will check for non-null before printing out queue status. Patch applies cleanly to trunk, branch-2, and branch-2.8. {quote} In General, what is the Hadoop policy when a newer client talks to an older server and the protobuf output is different than expected. Should we expose some form of the has method, or should we overload the get method as I described here? I would appreciate any additional feedback from the community in general (Vinod Kumar Vavilapalli, do you have any thoughts?) {quote} [~vinodkv], did you have a chance to think about this? [~jlowe], do you have any additional thoughts? > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: YARN-4225.003.patch Sorry, I mis-named the patch. Should have been {{YARN-4225.003.patch}} > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036560#comment-15036560 ] Eric Payne commented on YARN-4225: -- Thanks [~leftnoteasy], for your helpful comments. bq. Do you think is it better to return boolean? I'd prefer to return a default value (false) instead of return null This is the nature of the question that I have about the more general Hadoop policy, and which [~jlowe] and I were discussing in the comments above. Basically, the use case is a newer client is querying an older server, and so some of the newer protobuf entries that the client expects may not exist. In that case, we have 2 options that I can see: # The client exposes both the {{get}} protobuf method and the {{has}} protobuf method for the structure in question # We overload the {{get}} protobuf method to do the {{has}} checking internally and return NULL if the field doesn't exist. I actually prefer the second option because it exposes only one method. But, I would like to know the opinion of others and if there is already a precedence for this use case. > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request
[ https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034633#comment-15034633 ] Eric Payne commented on YARN-4108: -- Thanks very much [~leftnoteasy] for creating this POC. Just a quick note: The patch no longer applies completely cleanly to trunk and branch-2.8. > CapacityScheduler: Improve preemption to preempt only those containers that > would satisfy the incoming request > -- > > Key: YARN-4108 > URL: https://issues.apache.org/jira/browse/YARN-4108 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4108-design-doc-v1.pdf, YARN-4108.poc.1.patch > > > This is sibling JIRA for YARN-2154. We should make sure container preemption > is more effective. > *Requirements:*: > 1) Can handle case of user-limit preemption > 2) Can handle case of resource placement requirements, such as: hard-locality > (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I > don't want to use rack1 and host\[1-3\]) > 3) Can handle preemption within a queue: cross user preemption (YARN-2113), > cross applicaiton preemption (such as priority-based (YARN-1963) / > fairness-based (YARN-3319)). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page
[ https://issues.apache.org/jira/browse/YARN-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4422: - Attachment: YARN-4422.001.patch Attaching {{YARN-4422-001.patch}}. [~jeagles] or [~jlowe], would you mind taking a look? The problem was that when the Applications page in the Generic AHS renders, it depends on a MASTER_CONTAINER_EVENT_INFO being in the AppAttemptReport. If it's not there, it will give up on trying to print start time, node, or log lings. The reason that information then appears when you clidk on the app attempt link is because when the Application Attempt page renders, it just gets the whole list of containers for the app attempt and prints that information for each one, including the AM container, but it still doesn't have an indication which one is the AM container. The reason the MASTER_CONTAINER_EVENT_INFO isn't in the AppAttemptReport is because that is provided by the REGISTER event in the System Metrics Publisher, and since this use case doesn't ever get to the point of AM registration, the MASTER_CONTAINER_EVENT_INFO isn't there. However, in all of these cases, the RM container does get a FINISHED event. I fixed this by adding the MASTER_CONTAINER_EVENT_INFO to the FINISHED event. > Generic AHS sometimes doesn't show started, node, or logs on App page > - > > Key: YARN-4422 > URL: https://issues.apache.org/jira/browse/YARN-4422 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: AppAttemptPage no container or node.jpg, AppPage no logs > or node.jpg, YARN-4422.001.patch > > > Sometimes the AM container for an app isn't able to start the JVM. This can > happen if bogus JVM options are given to the AM container ( > {{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when > misconfiguring the AM container's environment variables > ({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}}) > When the AM container for an app isn't able to start the JVM, the Application > page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and > {{Logs}} columns. It _does_ have links for each app attempt, and if you click > on one of them, you go to the Application Attempt page, where you can see all > containers with links to their logs and nodes, including the AM container. > But none of that shows up for the app attempts on the Application page. > Also, on the Application Attempt page, in the {{Application Attempt > Overview}} section, the {{AM Container}} value is {{null}} and the {{Node}} > value is {{N/A}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: YARN-4225.004.patch Thanks very much [~leftnoteasy], for your review and helpful comments. {quote} I'm OK with both approach - existing one in latest patch or simply return false if there's no such field in proto. {quote} So, if I understand correctly, you are okay with {{QueueInfo#getPreemptionDisabled}} returning {{Boolean}} with the possibility of returning {{null}} if the field doesn't exist. With that understanding, I'm leaving that in the latest patch. {quote} 2) For QueueCLI, is it better to print "preemption is disabled/enabled" instead of "preemption status: disabled/enabled"? {quote} Actually, I think that leaving it as "Preemption : disabled/enabled" is more consistent with the way the other properties are displayed. What do you think? {quote} 3) Is it possible to add a simple test to verify end-to-end behavior? {quote} I added a couple of tests to {{TestYarnCLI}}. Good suggestion. > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch, YARN-4225.004.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049115#comment-15049115 ] Eric Payne commented on YARN-4225: -- I'd like to address the issues raised by the above pre-commit build: - Unit Tests: The following unit tests failed during the above pre-commit build, but they all pass for me in my local build environment: ||Test Name||Modified by this patch||Pre-commit failure|| |hadoop.yarn.client.api.impl.TestAMRMClient|No|Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0| |hadoop.yarn.client.api.impl.TestNMClient|No|Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0| |hadoop.yarn.client.api.impl.TestYarnClient|No|TEST TIMED OUT| |hadoop.yarn.client.cli.TestYarnCLI|Yes|Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0| |hadoop.yarn.client.TestGetGroups|No|java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "48cbb2d33ebc":8033; java.net.UnknownHostException| |hadoop.yarn.server.resourcemanager.TestAMAuthorization|No|java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "48cbb2d33ebc":8030; java.net.UnknownHostException| |hadoop.yarn.server.resourcemanager.TestClientRMTokens|No|java.lang.NullPointerException:| - Findbugs warnings: {{org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getPreemptionDisabled() has Boolean return type and returns explicit null At QueueInfoPBImpl.java:and returns explicit null At QueueInfoPBImpl.java:[line 402]}} This is a result of {{QueueInfo#getPreemptionDisabled}} returning a Boolean. Again, we could expose the {{hasPreemptionDisabled}} method and use that instead. - JavaDocs warnings/failures: I don't think these are caused by this patch: {{[WARNING] The requested profile "docs" could not be activated because it does not exist.}} {{[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.8.1:javadoc (default-cli) on project hadoop-yarn-server-resourcemanager: An error has occurred in JavaDocs report generation:}} {{...}} > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch, YARN-4225.004.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page
[ https://issues.apache.org/jira/browse/YARN-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4422: - Attachment: AppAttemptPage no container or node.jpg AppPage no logs or node.jpg > Generic AHS sometimes doesn't show started, node, or logs on App page > - > > Key: YARN-4422 > URL: https://issues.apache.org/jira/browse/YARN-4422 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: AppAttemptPage no container or node.jpg, AppPage no logs > or node.jpg > > > Sometimes the AM container for an app isn't able to start the JVM. This can > happen if bogus JVM options are given to the AM container ( > {{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when > misconfiguring the AM container's environment variables > ({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}}) > When the AM container for an app isn't able to start the JVM, the Application > page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and > {{Logs}} columns. It _does_ have links for each app attempt, and if you click > on one of them, you go to the Application Attempt page, where you can see all > containers with links to their logs and nodes, including the AM container. > But none of that shows up for the app attempts on the Application page. > Also, on the Application Attempt page, in the {{Application Attempt > Overview}} section, the {{AM Container}} value is {{null}} and the {{Node}} > value is {{N/A}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page
Eric Payne created YARN-4422: Summary: Generic AHS sometimes doesn't show started, node, or logs on App page Key: YARN-4422 URL: https://issues.apache.org/jira/browse/YARN-4422 Project: Hadoop YARN Issue Type: Bug Reporter: Eric Payne Assignee: Eric Payne Sometimes the AM container for an app isn't able to start the JVM. This can happen if bogus JVM options are given to the AM container ( {{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when misconfiguring the AM container's environment variables ({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}}) When the AM container for an app isn't able to start the JVM, the Application page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and {{Logs}} columns. It _does_ have links for each app attempt, and if you click on one of them, you go to the Application Attempt page, where you can see all containers with links to their logs and nodes, including the AM container. But none of that shows up for the app attempts on the Application page. Also, on the Application Attempt page, in the {{Application Attempt Overview}} section, the {{AM Container}} value is {{null}} and the {{Node}} value is {{N/A}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769-branch-2.6.001.patch Attaching {{YARN-3769-branch-2.6.001.patch}} for backport to branch-2.6. TestLeafQueue unit test for multiple apps by multiple users had to be modified specifically to allow for all apps to be active at the same time since the way active apps is calculated is different between 2.6 and 2.7. > Consider user limit when calculating total pending resource for preemption > policy in Capacity Scheduler > --- > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 2.7.3 > > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.6.001.patch, YARN-3769-branch-2.7.002.patch, > YARN-3769-branch-2.7.003.patch, YARN-3769-branch-2.7.005.patch, > YARN-3769-branch-2.7.006.patch, YARN-3769-branch-2.7.007.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, > YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4461) Redundant nodeLocalityDelay log in LeafQueue
[ https://issues.apache.org/jira/browse/YARN-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061298#comment-15061298 ] Eric Payne commented on YARN-4461: -- Thanks a lot, [~jlowe] and [~leftnoteasy]! > Redundant nodeLocalityDelay log in LeafQueue > > > Key: YARN-4461 > URL: https://issues.apache.org/jira/browse/YARN-4461 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Eric Payne >Priority: Trivial > Fix For: 2.8.0 > > Attachments: YARN-4461.001.patch > > > In LeafQueue#setupQueueConfigs there's a redundant log of nodeLocalityDelay: > {code} > "nodeLocalityDelay = " + nodeLocalityDelay + "\n" + > "labels=" + labelStrBuilder.toString() + "\n" + > "nodeLocalityDelay = " + nodeLocalityDelay + "\n" + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061296#comment-15061296 ] Eric Payne commented on YARN-4225: -- Thanks a lot, [~leftnoteasy] > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch, YARN-4225.004.patch, YARN-4225.005.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057144#comment-15057144 ] Eric Payne commented on YARN-4225: -- bq. Could you check findbugs warning in latest Jenkins run is related or not? There's no link to findbugs result in latest Jenkins report, so I guess it's not related. [~leftnoteasy], is there something wrong with this build? I can get to https://builds.apache.org/job/PreCommit-YARN-Build/9968, but many of the other links work in the comment above. For example, https://builds.apache.org/job/PreCommit-YARN-Build/9968/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn-jdk1.8.0_66.txt gets 404. I tried to get to the artifacts page, but that comes up 404 also. I didn't find any findbugs report. > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch, YARN-4225.004.patch, YARN-4225.005.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: YARN-4225.005.patch bq.Patch looks good, could you mark the findbugs warning needs to be skipped? Thanks a lot, [~leftnoteasy]. Attaching YARN-4225.005.patch with findbugs suppressed for {{org.apache.hadoop.yarn.api.records.impl.pb: NP_BOOLEAN_RETURN_NULL}} > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch, YARN-4225.004.patch, YARN-4225.005.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062188#comment-15062188 ] Eric Payne commented on YARN-4225: -- Oh, one more thing, [~leftnoteasy]. I ran testpatch in my own build environment and it gave a +1 for the findbugs, so the above must be a glitch in the Apache pre-commit build (?). > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch, YARN-4225.004.patch, YARN-4225.005.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4461) Redundant nodeLocalityDelay log in LeafQueue
[ https://issues.apache.org/jira/browse/YARN-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne reassigned YARN-4461: Assignee: Eric Payne > Redundant nodeLocalityDelay log in LeafQueue > > > Key: YARN-4461 > URL: https://issues.apache.org/jira/browse/YARN-4461 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Eric Payne >Priority: Trivial > > In LeafQueue#setupQueueConfigs there's a redundant log of nodeLocalityDelay: > {code} > "nodeLocalityDelay = " + nodeLocalityDelay + "\n" + > "labels=" + labelStrBuilder.toString() + "\n" + > "nodeLocalityDelay = " + nodeLocalityDelay + "\n" + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4461) Redundant nodeLocalityDelay log in LeafQueue
[ https://issues.apache.org/jira/browse/YARN-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060109#comment-15060109 ] Eric Payne commented on YARN-4461: -- The two failing tests above ({{TestClientRMTokens}} and {{TestAMAuthorization}}) both work for me in my local environment. > Redundant nodeLocalityDelay log in LeafQueue > > > Key: YARN-4461 > URL: https://issues.apache.org/jira/browse/YARN-4461 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Eric Payne >Priority: Trivial > Attachments: YARN-4461.001.patch > > > In LeafQueue#setupQueueConfigs there's a redundant log of nodeLocalityDelay: > {code} > "nodeLocalityDelay = " + nodeLocalityDelay + "\n" + > "labels=" + labelStrBuilder.toString() + "\n" + > "nodeLocalityDelay = " + nodeLocalityDelay + "\n" + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page
[ https://issues.apache.org/jira/browse/YARN-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051061#comment-15051061 ] Eric Payne commented on YARN-4422: -- bq. Thanks! Will this fix address MAPREDUCE-5502 or MAPREDUCE-4428? It doesn't seem so, but would like to confirm. [~mingma], thanks for your interest. No, this JIRA does not fix the issue documented in MAPREDUCE-5502 or MAPREDUCE-4428. This JIRA only affects the Generic application history server's GUI and not the RM Application GUI. Also, as documented in those JIRAs, the problem is not a missing link in the GUI, but that the log history is missing altogether. > Generic AHS sometimes doesn't show started, node, or logs on App page > - > > Key: YARN-4422 > URL: https://issues.apache.org/jira/browse/YARN-4422 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 3.0.0, 2.8.0, 2.7.3 > > Attachments: AppAttemptPage no container or node.jpg, AppPage no logs > or node.jpg, YARN-4422.001.patch > > > Sometimes the AM container for an app isn't able to start the JVM. This can > happen if bogus JVM options are given to the AM container ( > {{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when > misconfiguring the AM container's environment variables > ({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}}) > When the AM container for an app isn't able to start the JVM, the Application > page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and > {{Logs}} columns. It _does_ have links for each app attempt, and if you click > on one of them, you go to the Application Attempt page, where you can see all > containers with links to their logs and nodes, including the AM container. > But none of that shows up for the app attempts on the Application page. > Also, on the Application Attempt page, in the {{Application Attempt > Overview}} section, the {{AM Container}} value is {{null}} and the {{Node}} > value is {{N/A}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4390) Consider container request size during CS preemption
[ https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-4390. -- Resolution: Duplicate Closing this ticket in favor of YARN-4108 > Consider container request size during CS preemption > > > Key: YARN-4390 > URL: https://issues.apache.org/jira/browse/YARN-4390 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.0.0, 2.8.0, 2.7.3 >Reporter: Eric Payne >Assignee: Eric Payne > > There are multiple reasons why preemption could unnecessarily preempt > containers. One is that an app could be requesting a large container (say > 8-GB), and the preemption monitor could conceivably preempt multiple > containers (say 8, 1-GB containers) in order to fill the large container > request. These smaller containers would then be rejected by the requesting AM > and potentially given right back to the preempted app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)