from:"Eric Payne \(Jira\)"

[jira] [Updated] (YARN-2932) Add entry for preemptable status to scheduler web UI and queue initialize/refresh logging

2015-01-21 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-2932:
-
Attachment: YARN-2932.v7.txt

Version 7 of patch fixes new javadoc warnings. Sorry about that.

 Add entry for preemptable status to scheduler web UI and queue 
 initialize/refresh logging
 ---

 Key: YARN-2932
 URL: https://issues.apache.org/jira/browse/YARN-2932
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt, 
 YARN-2932.v4.txt, YARN-2932.v5.txt, YARN-2932.v6.txt, YARN-2932.v7.txt


 YARN-2056 enables the ability to turn preemption on or off on a per-queue 
 level. This JIRA will provide the preemption status for each queue in the 
 {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue 
 refresh.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2896) Server side PB changes for Priority Label Manager and Admin CLI support

2015-01-21 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286249#comment-14286249
 ] 

Eric Payne commented on YARN-2896:
--

[~sunilg], it seems to me that labels can make things more confusing, not less, 
since different queues can have arbitrary names for the same concept. Also, it 
would eliminate the need to add an infrastructure for mapping, passing, and 
interpreting labels and priority numbers. YARN could always specify that 
priorities go from low to high, and each queue could then decide how hight to 
go with the priority numbers. Also, it seems to me that the following property 
definition could specify high priority:
{code}
yarn.scheduler.capacity.root.queueA.5.acl=user1,user2
{code}

 Server side PB changes for Priority Label Manager and Admin CLI support
 ---

 Key: YARN-2896
 URL: https://issues.apache.org/jira/browse/YARN-2896
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2896.patch, 0002-YARN-2896.patch, 
 0003-YARN-2896.patch, 0004-YARN-2896.patch


 Common changes:
  * PB support changes required for Admin APIs 
  * PB support for File System store (Priority Label Store)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2896) Server side PB changes for Priority Label Manager and Admin CLI support

2015-01-22 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288445#comment-14288445
 ] 

Eric Payne commented on YARN-2896:
--

[~sunilg], [~leftnoteasy], and [~vinodkv], can we move this discussion to 
YARN-1963 in order to achieve a higher visibility?

 Server side PB changes for Priority Label Manager and Admin CLI support
 ---

 Key: YARN-2896
 URL: https://issues.apache.org/jira/browse/YARN-2896
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2896.patch, 0002-YARN-2896.patch, 
 0003-YARN-2896.patch, 0004-YARN-2896.patch


 Common changes:
  * PB support changes required for Admin APIs 
  * PB support for File System store (Priority Label Store)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3088) LinuxContainerExecutor.deleteAsUser can throw NPE if native executor returns an error

2015-01-22 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-3088:


Assignee: Eric Payne

 LinuxContainerExecutor.deleteAsUser can throw NPE if native executor returns 
 an error
 -

 Key: YARN-3088
 URL: https://issues.apache.org/jira/browse/YARN-3088
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.1-beta
Reporter: Jason Lowe
Assignee: Eric Payne

 If the native executor returns an error trying to delete a path as a 
 particular user when dir==null then the code can NPE trying to build a log 
 message for the error.  It blindly deferences dir in the log message despite 
 the code just above explicitly handling the cases when dir could be null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3089) LinuxContainerExecutor does not handle file arguments to deleteAsUser

2015-01-22 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-3089:


Assignee: Eric Payne

 LinuxContainerExecutor does not handle file arguments to deleteAsUser
 -

 Key: YARN-3089
 URL: https://issues.apache.org/jira/browse/YARN-3089
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Blocker

 YARN-2468 added the deletion of individual logs that are aggregated, but this 
 fails to delete log files when the LCE is being used.  The LCE native 
 executable assumes the paths being passed are paths and the delete fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3074) Nodemanager dies when localizer runner tries to write to a full disk

2015-01-27 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294130#comment-14294130
 ] 

Eric Payne commented on YARN-3074:
--

[~varun_saxena], Thanks for posting this patch.

Rather than duplicating the catch blocks, I would like to see the {{catch}} 
blocks save off the exception and fserror, then process it during the 
{{finally}} block.

So, what I'm suggesting is before the {{try}} block, add a {{Throwable}} 
variable:
{code}
Throwable t = null;
{code}
In the catch blocks, save the exception and error:
{code}
} catch (Exception e) {
  t = e;
} catch (FSError fse) {
  t = fse;
}
{code}
Then, move what used to be in the original {{catch (Exception e)}} block into 
the {{finally}} block surrounded by 
{code}
if (t != null) {
  ...
}
{code}

Also, please add a unit test.

 Nodemanager dies when localizer runner tries to write to a full disk
 

 Key: YARN-3074
 URL: https://issues.apache.org/jira/browse/YARN-3074
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-3074.001.patch


 When a LocalizerRunner tries to write to a full disk it can bring down the 
 nodemanager process.  Instead of failing the whole process we should fail 
 only the container and make a best attempt to keep going.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2932) Add entry for preemptable status (enabled/disabled) to scheduler web UI and queue initialize/refresh logging

2015-01-28 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295218#comment-14295218
 ] 

Eric Payne commented on YARN-2932:
--

Thank you for your input and review, [~leftnoteasy]

 Add entry for preemptable status (enabled/disabled) to scheduler web UI and 
 queue initialize/refresh logging
 --

 Key: YARN-2932
 URL: https://issues.apache.org/jira/browse/YARN-2932
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
 Fix For: 2.7.0

 Attachments: Screenshot.Queue.Preemption.Disabled.jpg, 
 YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt, YARN-2932.v4.txt, 
 YARN-2932.v5.txt, YARN-2932.v6.txt, YARN-2932.v7.txt, YARN-2932.v8.txt


 YARN-2056 enables the ability to turn preemption on or off on a per-queue 
 level. This JIRA will provide the preemption status for each queue in the 
 {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue 
 refresh.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2015-01-23 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290300#comment-14290300
 ] 

Eric Payne commented on YARN-1963:
--

+1 on using numbers and not labels. It seems that the use of labels adds more 
complexity in mapping, sending via PB, and converting back to numbers, and does 
not seem to add much clarity.

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G
 Attachments: YARN Application Priorities Design.pdf, YARN Application 
 Priorities Design_01.pdf


 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3089) LinuxContainerExecutor does not handle file arguments to deleteAsUser

2015-01-30 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298637#comment-14298637
 ] 

Eric Payne commented on YARN-3089:
--

Thank you, [~sunilg], for your review of this patch.

{quote}
{code}
int subDirEmptyStr = (subdir == NULL || subdir[0] == 0);
{code}
I think strlen(subdir) also has to be checked against 0, correct?
{quote}
The {{strlen}} function will do exactly the same thing that {{subdir[0] == 0}} 
does, which is check that the first byte in the string is 0. In {{strlen}}, it 
takes the form of {{*s == '\0'}}, but it amounts to the same thing. By checking 
for empty string as is done in the existing patch, it avoids the overhead of 
another function call.

 LinuxContainerExecutor does not handle file arguments to deleteAsUser
 -

 Key: YARN-3089
 URL: https://issues.apache.org/jira/browse/YARN-3089
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Blocker
 Attachments: YARN-3089.v1.txt


 YARN-2468 added the deletion of individual logs that are aggregated, but this 
 fails to delete log files when the LCE is being used.  The LCE native 
 executable assumes the paths being passed are paths and the delete fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging

2015-01-10 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-2932:
-
Attachment: YARN-2932.v2.txt

Thanks very much, [~leftnoteasy], for your thorough review of this patch and 
for your helpful comments.

{quote}
1) Since the QUEUE_PREEMPTION_DISABLED is an option for CS, I suggest to make 
it as a member of CapacitySchedulerConfiguration, like 
getUserLimitFactor/setUserLimit, etc. This will void some String operations.
{quote}
This is a good idea. I added {{isQueuePreemptable}} and 
{{setQueuePreemptable}}. For {{isQueuePreemptable}}, I needed to add a default 
value parameter because the default for the queue at a particular level should 
be whatever its parent's value is.

{quote}
2) Rename {{context}} in {{AbstractCSQueue}} to name like {{csContext}} since 
we have {{rmContext}}
{quote}
Renamed.

{quote}
3) I suggest to add a member var like {{preemptable}} to {{AbstractCSQueue}}, 
instead of calling:
{code}
+  @Private
+  public boolean isPreemptable() {
+return context.getConfiguration().isPreemptable(getQueuePath());
+  }
{code}
The implementation of {{CSConfiguration.isPreemptable(..)}} seems too complex 
to me. {{CSConfiguration}} should only care about value of configuration file, 
such logic should put to {{AbstractCSQueue.setupQueueConfigs(...)}}
{quote}
I moved the logic to {{AbstractCSQueue.setupQueueConfigs(...)}}, and you are 
right. It is much cleaner that way. Thanks!

{quote}
4) It's better to web UI name (preemptable) and configuration name 
(disable_preemption) consistent. I prefer preemptable personally.
{quote}
Yes, it is less confusing that way. In this patch, the only things that worry 
about the {{disable_preemption}} property are the internals of the 
{{CSConfiguration}} methods. The APIs are now all asking whether or not the 
queue is preemptable.

{quote}
5) {{testIsPreemptable}} should be a part of {{TestCapacityScheduler}} instead 
of putting it to {{TestProportionalCapacityPreemptionPolicy}}.
{quote}
Thanks. I moved the test to {{testIsPreemptable}}. However, since the interface 
for changing a queue's preemptability changed, there were also several changes 
to {{TestProportionalCapacityPreemptionPolicy}}.

{quote}
6) In {{ProportionalCapacityPreemptionPolicy.cloneQueues}}, preemptable field 
should get from Queue instead of getting from configuration.
{quote}
Done.


 Add entry for preemption setting to queue status screen and startup/refresh 
 logging
 ---

 Key: YARN-2932
 URL: https://issues.apache.org/jira/browse/YARN-2932
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt


 YARN-2056 enables the ability to turn preemption on or off on a per-queue 
 level. This JIRA will provide the preemption status for each queue in the 
 {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue 
 refresh.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging

2015-01-12 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-2932:
-
Attachment: YARN-2932.v3.txt

Upmerged and uploading new patch (v3).

 Add entry for preemption setting to queue status screen and startup/refresh 
 logging
 ---

 Key: YARN-2932
 URL: https://issues.apache.org/jira/browse/YARN-2932
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt


 YARN-2056 enables the ability to turn preemption on or off on a per-queue 
 level. This JIRA will provide the preemption status for each queue in the 
 {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue 
 refresh.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging

2015-01-13 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275686#comment-14275686
 ] 

Eric Payne commented on YARN-2932:
--

[~leftnoteasy], thanks very much for your review and comments:

bq. 1. Rename {{isQueuePreemptable}} to {{getQueuePreemptable}} for 
getter/setter consistency in {{CapacitySchedulerConfiguration}}
Renamed.

bq. 2. Should consider queue reinitialize when queue preemptable in 
configuration changes (See {{TestQueueParsing}}). And it's best to add a test 
for verify that.
I'm sorry. I don't understand what you mean by the use of the word consider. 
Calling {{CapacityScheduler.reinitialize}} will follow the queue hierarchy down 
and eventually call {{AbstractCSQueue#setupQueueConfigs}} for every queue, so I 
don't think there is any additional code needed, unless I'm missing something. 
Were you just saying that I need to add a test case for that?

{quote}
3. It's better to remove the {{defaultVal}} parameter in 
{{CapacitySchedulerConfiguration.isPreemptable}}:
{code}
public boolean isQueuePreemptable(String queue, boolean defaultVal) 
{code}
And the default_value should be placed in {{CapacitySchedulerConfiguration}}, 
like other queue configuration options.
I understand what you trying to do is moving some logic from queue to 
{{CapacitySchedulerConfiguration}}, but I still think it's better to keep the 
{{CapacitySchedulerConfiguration}} simply gets some values from configuration 
file.
{quote}
The problem is that without the {{defaultval}} parameter, 
{{AbstractCSQueue#isQueuePathHierarchyPreemptable}} can't tell if the queue has 
explicitly set its preemptability or if it is just returning the default. For 
example:
{code}
root: disable_preemption = true
root.A: disable_preemption (the property is not set)
root.B: disable_preemption = false (the property is explicitly set to false)
{code}
Let's say the {{getQueuePreemptable}} interface is changed to remove the 
{{defaultVal}} parameter, and that when {{getQueuePreemptable}} calls 
{{getBoolean}}, it uses {{false}} as the default.

# {{getQueuePreemptable}} calls {{getBoolean}} on {{root}}
## {{getBoolean}} returns {{true}} because the {{disable_preemption}} property 
is set to {{true}}
## {{getQueuePreemptable}} inverts {{true}} and returns {{false}} (That is, 
{{root}} has preemption disabled, so it is not preemptable).
# {{getQueuePreemptable}} calls {{getBoolean}} on {{root.A}}
## {{getBoolean}} returns {{false}} because there is no {{disable_preemption}} 
property set for this queue, so {{getBoolean}} returns the default.
## {{getQueuePreemptable}} inverts {{false}} and returns {{true}}
# {{getQueuePreemptable}} calls {{getBoolean}} on {{root.B}}
## {{getBoolean}} returns {{false}} because {{disable_preemption}} property is 
set to {{false}} for this queue
## {{getQueuePreemptable}} inverts {{false}} and returns {{true}}

At this point, {{isQueuePathHierarchyPreemptable}} needs to know if it should 
use the default preemption from {{root}} or if it should use the value from 
each child queue. In the case of {{root.A}}, the value from {{root}} 
({{false}}) should be used because {{root.A}} does not have the property set. 
In the case of {{root.B}}, the value should be the one returned for {{root.B}} 
({{true}}) because it is explicitly set. But, since both {{root.A}} and 
{{root.B}} both returned {{true}}, {{isQueuePathHierarchyPreemptable}} can't 
tell the difference. Does that make sense?

 Add entry for preemption setting to queue status screen and startup/refresh 
 logging
 ---

 Key: YARN-2932
 URL: https://issues.apache.org/jira/browse/YARN-2932
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-2932.v1.txt, YARN-2932.v2.txt, YARN-2932.v3.txt


 YARN-2056 enables the ability to turn preemption on or off on a per-queue 
 level. This JIRA will provide the preemption status for each queue in the 
 {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue 
 refresh.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3074) Nodemanager dies when localizer runner tries to write to a full disk

2015-02-10 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315147#comment-14315147
 ] 

Eric Payne commented on YARN-3074:
--

[~varun_saxena], Thank you for the updated patch!

+1 Patch LGTM

 Nodemanager dies when localizer runner tries to write to a full disk
 

 Key: YARN-3074
 URL: https://issues.apache.org/jira/browse/YARN-3074
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-3074.001.patch, YARN-3074.002.patch, 
 YARN-3074.03.patch


 When a LocalizerRunner tries to write to a full disk it can bring down the 
 nodemanager process.  Instead of failing the whole process we should fail 
 only the container and make a best attempt to keep going.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2015-03-16 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363790#comment-14363790
 ] 

Eric Payne commented on YARN-1963:
--

{quote}
I think label-based and integer-based priorities are just two different ways to 
configure as well as API. No matter we choose to use label-based or 
integer-based priority, we should use integer only to implement internal logic 
(like in CapacityScheduler).
{quote}
I think that is true especially when passing priorities through proto buffers, 
using integers is best.

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G
 Attachments: 0001-YARN-1963-prototype.patch, YARN Application 
 Priorities Design.pdf, YARN Application Priorities Design_01.pdf


 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2015-03-11 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357600#comment-14357600
 ] 

Eric Payne commented on YARN-1963:
--

Thanks, [~sunilg], for your work on in-queue priorities.

Along with [~nroberts], I'm confused about why priority labels are needed. As a 
user, I just need to know that the higher the number, the higher the priority. 
Then, I just need a way to see what priority each application is using and a 
way to set the priority of applications. To me, it just seems like labels will 
get in the way.

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G
 Attachments: 0001-YARN-1963-prototype.patch, YARN Application 
 Priorities Design.pdf, YARN Application Priorities Design_01.pdf


 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2498) Respect labels in preemption policy of capacity scheduler

2015-03-10 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355361#comment-14355361
 ] 

Eric Payne commented on YARN-2498:
--

Hi [~leftnoteasy]. Great job on this patch. I have one minor nit:

Would you mind changing {{duductAvailableResourceAccordingToLabel}} to 
{{deductAvailableResourceAccordingToLabel}}? That is, {{duduct...}} should be 
{{deduct...}}.

 Respect labels in preemption policy of capacity scheduler
 -

 Key: YARN-2498
 URL: https://issues.apache.org/jira/browse/YARN-2498
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, 
 yarn-2498-implementation-notes.pdf


 There're 3 stages in ProportionalCapacityPreemptionPolicy,
 # Recursively calculate {{ideal_assigned}} for queue. This is depends on 
 available resource, resource used/pending in each queue and guaranteed 
 capacity of each queue.
 # Mark to-be preempted containers: For each over-satisfied queue, it will 
 mark some containers will be preempted.
 # Notify scheduler about to-be preempted container.
 We need respect labels in the cluster for both #1 and #2:
 For #1, when there're some resource available in the cluster, we shouldn't 
 assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot 
 access such labels
 For #2, when we make decision about whether we need preempt a container, we 
 need make sure, resource this container is *possibly* usable by a queue which 
 is under-satisfied and has pending resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3275) Preemption happening on non-preemptable queues

2015-02-27 Thread Eric Payne (JIRA)

Eric Payne created YARN-3275:


 Summary: Preemption happening on non-preemptable queues
 Key: YARN-3275
 URL: https://issues.apache.org/jira/browse/YARN-3275
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne


YARN-2056 introduced the ability to turn preemption on and off at the queue 
level. In cases where a queue goes over its absolute max capacity (YARN:3243, 
for example), containers can be preempted from that queue, even though the 
queue is marked as non-preemptable.

We are using this feature in large, busy clusters and seeing this behavior.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3275) Preemption happening on non-preemptable queues

2015-02-27 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340393#comment-14340393
 ] 

Eric Payne commented on YARN-3275:
--

This situation can happen under the following conditions:

- All of the resources in the cluster are being used
- {{QueueA}} is preemptable and over its absolute capacity (AKA guaranteed 
capacity)
- {{QueueB}} is not preemptable, over its absolute capacity, also over its 
absolute max capacity (which can happen), and asking for more resources

In the above scenario, {{ProportionalCapacityPreemptionPolicy}} will subtract 
{{QueuB}}'s ideal assigned value from its absolute max capacity value and get a 
negative number, which will adjust it's ideally assigned resources downwards by 
that amount, which will result in that amount getting preempted.

Regardless of the reason, if a queue is marked as unpreemptable, resources 
should never be preempted from that queue.


 Preemption happening on non-preemptable queues
 --

 Key: YARN-3275
 URL: https://issues.apache.org/jira/browse/YARN-3275
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne

 YARN-2056 introduced the ability to turn preemption on and off at the queue 
 level. In cases where a queue goes over its absolute max capacity (YARN:3243, 
 for example), containers can be preempted from that queue, even though the 
 queue is marked as non-preemptable.
 We are using this feature in large, busy clusters and seeing this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2015-02-27 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-2592.
--
Resolution: Invalid

 Preemption can kill containers to fulfil need of already over-capacity queue.
 -

 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.1
Reporter: Eric Payne

 There are scenarios in which one over-capacity queue can cause preemption of 
 another over-capacity queue. However, since killing containers may lose work, 
 it doesn't make sense to me to kill containers to feed an already 
 over-capacity queue.
 Consider the following:
 {code}
 root has A,B,C, total capacity = 90
 A.guaranteed = 30, A.pending = 5, A.current = 40
 B.guaranteed = 30, B.pending = 0, B.current = 50
 C.guaranteed = 30, C.pending = 0, C.current = 0
 {code}
 In this case, the queue preemption monitor will kill 5 resources from queue B 
 so that queue A can pick them up, even though queue A is already over its 
 capacity. This could lose any work that those containers in B had already 
 done.
 Is there a use case for this behavior? It seems to me that if a queue is 
 already over its capacity, it shouldn't destroy the work of other queues. If 
 the over-capacity queue needs more resources, that seems to be a problem that 
 should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2015-02-27 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340241#comment-14340241
 ] 

Eric Payne commented on YARN-2592:
--

Closing this, since it is expected that as long as there are available 
resources, queue usage should grow evenly based on percentage of absolute 
capacity, even when preemption can happen to fill this growth as long as the 
absolute max capacity is not reached and the queues are growing evenly.

 Preemption can kill containers to fulfil need of already over-capacity queue.
 -

 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.1
Reporter: Eric Payne

 There are scenarios in which one over-capacity queue can cause preemption of 
 another over-capacity queue. However, since killing containers may lose work, 
 it doesn't make sense to me to kill containers to feed an already 
 over-capacity queue.
 Consider the following:
 {code}
 root has A,B,C, total capacity = 90
 A.guaranteed = 30, A.pending = 5, A.current = 40
 B.guaranteed = 30, B.pending = 0, B.current = 50
 C.guaranteed = 30, C.pending = 0, C.current = 0
 {code}
 In this case, the queue preemption monitor will kill 5 resources from queue B 
 so that queue A can pick them up, even though queue A is already over its 
 capacity. This could lose any work that those containers in B had already 
 done.
 Is there a use case for this behavior? It seems to me that if a queue is 
 already over its capacity, it shouldn't destroy the work of other queues. If 
 the over-capacity queue needs more resources, that seems to be a problem that 
 should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3275) Preemption happening on non-preemptable queues

2015-02-27 Thread Eric Payne (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Payne updated YARN-3275:
-
Description:
YARN-2056 introduced the ability to turn preemption on and off at the queue
level. In cases where a queue goes over its absolute max capacity (YARN-3243,
for example), containers can be preempted from that queue, even though the
queue is marked as non-preemptable.

We are using this feature in large, busy clusters and seeing this behavior.

was:
YARN-2056 introduced the ability to turn preemption on and off at the queue
level. In cases where a queue goes over its absolute max capacity (YARN:3243,
for example), containers can be preempted from that queue, even though the
queue is marked as non-preemptable.

We are using this feature in large, busy clusters and seeing this behavior.

Preemption happening on non-preemptable queues
--

Key: YARN-3275
URL: https://issues.apache.org/jira/browse/YARN-3275
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne

YARN-2056 introduced the ability to turn preemption on and off at the queue
level. In cases where a queue goes over its absolute max capacity (YARN-3243,
for example), containers can be preempted from that queue, even though the
queue is marked as non-preemptable.
We are using this feature in large, busy clusters and seeing this behavior.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3275) Preemption happening on non-preemptable queues

2015-02-27 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3275:
-
Attachment: YARN-3275.v1.txt

 Preemption happening on non-preemptable queues
 --

 Key: YARN-3275
 URL: https://issues.apache.org/jira/browse/YARN-3275
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3275.v1.txt


 YARN-2056 introduced the ability to turn preemption on and off at the queue 
 level. In cases where a queue goes over its absolute max capacity (YARN-3243, 
 for example), containers can be preempted from that queue, even though the 
 queue is marked as non-preemptable.
 We are using this feature in large, busy clusters and seeing this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3275) CapacityScheduler: Preemption happening on non-preemptable queues

2015-03-04 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3275:
-
Attachment: YARN-3275.v2.txt

[~jlowe] and [~leftnoteasy], thank you for the reviews.

Attached is an updated patch (v2) with your suggested changes.

 CapacityScheduler: Preemption happening on non-preemptable queues
 -

 Key: YARN-3275
 URL: https://issues.apache.org/jira/browse/YARN-3275
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
  Labels: capacity-scheduler
 Attachments: YARN-3275.v1.txt, YARN-3275.v2.txt


 YARN-2056 introduced the ability to turn preemption on and off at the queue 
 level. In cases where a queue goes over its absolute max capacity (YARN-3243, 
 for example), containers can be preempted from that queue, even though the 
 queue is marked as non-preemptable.
 We are using this feature in large, busy clusters and seeing this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3275) Preemption happening on non-preemptable queues

2015-02-28 Thread Eric Payne (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341626#comment-14341626
]

Eric Payne commented on YARN-3275:
--

Thanks very much, [~leftnoteasy], for reviewing this issue.
{quote}
Actually, go over max capacity is possible, when a cluster with resource =
1000G, and a queue reaches its max capacity, after the cluster resource goes
down to 100G, it can over max capacity.
n addition, parent queue can go beyond max capacity as described in YARN-3243
no matter if cluster resource changed or not. But child queue can only go
beyond max capacity when cluster resource reduced.
{quote}
It is possible that the total available capacity of the cluster dropped by some
percentage, causing the leaf node to go over its abs max cap by 5%. The cluster
has a large number of nodes and memory, and that value is always changing
slightly as nodes are lost and re-register. This may not account for the 5%
overage we saw on the small leaf queue, because that total memory number isn't
varying by 5%.
{quote}
we haven't defined disable-preemption is more important than max-capacity.
IMO, if we should do this JIRA or not is still discussable.
{quote}
I see your point. In other words, it could be argued that the preemption
monitor is doing the right thing. That is, when it sees that the queue is over
its absolute max capacity (which should not happen), the preemption monitor is
moving those resources back into the usable pool.

However, the expectation of our users is that if they are running a job on a
non-preemptable queue, their containers should never be preempted. From their
point of view, it doesn't matter what the reason is, they are expecting the RM
to obey the contract that says it will not preempt their resources.

Preemption happening on non-preemptable queues
--

Key: YARN-3275
URL: https://issues.apache.org/jira/browse/YARN-3275
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
Attachments: YARN-3275.v1.txt

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3089) LinuxContainerExecutor does not handle file arguments to deleteAsUser

2015-01-29 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3089:
-
Attachment: YARN-3089.v1.txt

 LinuxContainerExecutor does not handle file arguments to deleteAsUser
 -

 Key: YARN-3089
 URL: https://issues.apache.org/jira/browse/YARN-3089
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Blocker
 Attachments: YARN-3089.v1.txt


 YARN-2468 added the deletion of individual logs that are aggregated, but this 
 fails to delete log files when the LCE is being used.  The LCE native 
 executable assumes the paths being passed are paths and the delete fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3540) Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler

2015-04-23 Thread Eric Payne (JIRA)

Eric Payne created YARN-3540:


 Summary: Fetcher#copyMapOutput is leaking usedMemory upon 
IOException during InMemoryMapOutput shuffle handler
 Key: YARN-3540
 URL: https://issues.apache.org/jira/browse/YARN-3540
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
Priority: Blocker


We are seeing this happen when
- an NM's disk goes bad during the creation of map output(s)
- the reducer's fetcher can read the shuffle header and reserve the memory
- but gets an IOException when trying to shuffle for InMemoryMapOutput
- shuffle fetch retry is enabled




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-04-28 Thread Eric Payne (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517252#comment-14517252
]

Eric Payne commented on YARN-2004:
--

[~sunilg], Thanks for all of the work you are doing for this important feature.

{quote}
queueA: default=low
queueB: default=medium

The type of apps which we run may vary from queueA to B. So by keeping default
priority different for each queue will help to handle such case. Assume more
high level apps are running in queueA often, and medium level in queueB. Making
different default priority can help here.
{quote}

I don't know a lot about the fair scheduler, but I'm pretty sure that in the
capacity scheduler, there is no way to make one queue a higher priority than
another. There is no way to compare job priorities between queues. That is, you
can't say that jobs running in queueA have a higher priority than jobs running
in queueB. So, it only makes sense to compare priorities between jobs in the
same queue. Am I missing something?

Priority scheduling support in Capacity scheduler
-

Key: YARN-2004
URL: https://issues.apache.org/jira/browse/YARN-2004
Project: Hadoop YARN
Issue Type: Sub-task
Components: capacityscheduler
Reporter: Sunil G
Assignee: Sunil G
Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch,
0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch,
0006-YARN-2004.patch

Based on the priority of the application, Capacity Scheduler should be able
to give preference to application while doing scheduling.
ComparatorFiCaSchedulerApp applicationComparator can be changed as below.

1.Check for Application priority. If priority is available, then return
the highest priority job.
2.Otherwise continue with existing logic such as App ID comparison and
then TimeStamp comparison.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies

2015-05-01 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524106#comment-14524106
 ] 

Eric Payne commented on YARN-3097:
--

Thanks, [~gtCarrera9], for your interest. Although I haven't made much progress 
on this yet, I do still plan on working on it in the near future.

 Logging of resource recovery on NM restart has redundancies
 ---

 Key: YARN-3097
 URL: https://issues.apache.org/jira/browse/YARN-3097
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Minor
  Labels: newbie

 ResourceLocalizationService logs that it is recovering a resource with the 
 remote and local paths, but then very shortly afterwards the 
 LocalizedResource emits an INIT-LOCALIZED transition that also logs the same 
 remote and local paths.  The recovery message should be a debug message, 
 since it's not conveying any useful information that isn't already covered by 
 the resource state transition log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3097) Logging of resource recovery on NM restart has redundancies

2015-05-02 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3097:
-
Attachment: YARN-3097.001.patch

 Logging of resource recovery on NM restart has redundancies
 ---

 Key: YARN-3097
 URL: https://issues.apache.org/jira/browse/YARN-3097
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Minor
  Labels: newbie
 Attachments: YARN-3097.001.patch


 ResourceLocalizationService logs that it is recovering a resource with the 
 remote and local paths, but then very shortly afterwards the 
 LocalizedResource emits an INIT-LOCALIZED transition that also logs the same 
 remote and local paths.  The recovery message should be a debug message, 
 since it's not conveying any useful information that isn't already covered by 
 the resource state transition log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies

2015-05-04 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526577#comment-14526577
 ] 

Eric Payne commented on YARN-3097:
--

{quote}
-1  The patch doesn't appear to include any new or modified tests. Please 
justify why no new tests are needed for this patch. Also please list what 
manual steps were performed to verify this patch.
{quote}

Since the only change in this patch is to change an info log message to a debug 
log message, no tests were included.

 Logging of resource recovery on NM restart has redundancies
 ---

 Key: YARN-3097
 URL: https://issues.apache.org/jira/browse/YARN-3097
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Minor
  Labels: newbie
 Attachments: YARN-3097.001.patch


 ResourceLocalizationService logs that it is recovering a resource with the 
 remote and local paths, but then very shortly afterwards the 
 LocalizedResource emits an INIT-LOCALIZED transition that also logs the same 
 remote and local paths.  The recovery message should be a debug message, 
 since it's not conveying any useful information that isn't already covered by 
 the resource state transition log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-04-28 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517545#comment-14517545
 ] 

Eric Payne commented on YARN-2004:
--

[~sunilg],
bq. Hope you understood my comment about priority config across queue. Pls let 
me know your thoughts.
I think you are referring to [~leftnoteasy]'s suggestion that a cluster-wide 
config should be added to put a cap on the maximum priorities allowed in the 
queue. Is that correct? I think that makes sense so that cluster admins can put 
a cap on the number of priorities within any given queue.

 Priority scheduling support in Capacity scheduler
 -

 Key: YARN-2004
 URL: https://issues.apache.org/jira/browse/YARN-2004
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 
 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 
 0006-YARN-2004.patch


 Based on the priority of the application, Capacity Scheduler should be able 
 to give preference to application while doing scheduling.
 ComparatorFiCaSchedulerApp applicationComparator can be changed as below.   
 
 1.Check for Application priority. If priority is available, then return 
 the highest priority job.
 2.Otherwise continue with existing logic such as App ID comparison and 
 then TimeStamp comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1519) check if sysconf is implemented before using it

2015-05-13 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1519:
-
Attachment: YARN-1519.003.patch

[~hsn], since I didn't hear back, I will go ahead and post the patch with the 
changes suggested by [~raviprak]. Thanks again for doing all of the work and 
testing on this patch.

[~raviprak], will you please have a look? Thanks.


 check if sysconf is implemented before using it
 ---

 Key: YARN-1519
 URL: https://issues.apache.org/jira/browse/YARN-1519
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0, 2.3.0
Reporter: Radim Kolar
Assignee: Radim Kolar
  Labels: BB2015-05-TBR
 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, 
 nodemgr-sysconf.txt


 If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to 
 segfault because invalid pointer gets passed to libc function.
 fix: enforce minimum value 1024, same method is used in hadoop-common native 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits

2015-05-14 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544404#comment-14544404
 ] 

Eric Payne commented on YARN-2069:
--

Hi [~mayank_bansal]. Thanks for working through the details related to this 
issue. I have one small nit.

In {{LeafQueue#computeTargetedUserLimit}}, it does not look like the {{MIN}} 
and {{MAX}} variables are ever used.

 CS queue level preemption should respect user-limits
 

 Key: YARN-2069
 URL: https://issues.apache.org/jira/browse/YARN-2069
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
  Labels: BB2015-05-TBR
 Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-10.patch, 
 YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, 
 YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, 
 YARN-2069-trunk-8.patch, YARN-2069-trunk-9.patch


 This is different from (even if related to, and likely share code with) 
 YARN-2113.
 YARN-2113 focuses on making sure that even if queue has its guaranteed 
 capacity, it's individual users are treated in-line with their limits 
 irrespective of when they join in.
 This JIRA is about respecting user-limits while preempting containers to 
 balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-06-04 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573670#comment-14573670
 ] 

Eric Payne commented on YARN-3769:
--

[~leftnoteasy]
bq. If you think it's fine, could I take a shot at it?
It sounds like it would work. It's fine with me if you want to work on that.

 Preemption occurring unnecessarily because preemption doesn't consider user 
 limit
 -

 Key: YARN-3769
 URL: https://issues.apache.org/jira/browse/YARN-3769
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0, 2.7.0, 2.8.0
Reporter: Eric Payne
Assignee: Eric Payne

 We are seeing the preemption monitor preempting containers from queue A and 
 then seeing the capacity scheduler giving them immediately back to queue A. 
 This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-06-04 Thread Eric Payne (JIRA)

Eric Payne created YARN-3769:


 Summary: Preemption occurring unnecessarily because preemption 
doesn't consider user limit
 Key: YARN-3769
 URL: https://issues.apache.org/jira/browse/YARN-3769
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0, 2.6.0, 2.8.0
Reporter: Eric Payne
Assignee: Eric Payne


We are seeing the preemption monitor preempting containers from queue A and 
then seeing the capacity scheduler giving them immediately back to queue A. 
This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-06-04 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573619#comment-14573619
 ] 

Eric Payne commented on YARN-3769:
--

The following configuration will cause this:

|| queue || capacity || max || pending || used || user limit
| root | 100 | 100 | 40 | 90 | N/A |
| A | 10 | 100 | 20 | 70 | 70 |
| B | 10 | 100 | 20 | 20 | 20 |

One app is running in each queue. Both apps are asking for more resources, but 
they have each reached their user limit, so even though both are asking for 
more and there are resources available, no more resources are allocated to 
either app.

The preemption monitor will see that {{B}} is asking for a lot more resources, 
and it will see that {{B}} is more underserved than {{A}}, so the preemption 
monitor will try to make the queues balance by preempting resources (10, for 
example) from {{A}}.

|| queue || capacity || max || pending || used || user limit
| root | 100 | 100 | 50 | 80 | N/A |
| A | 10 | 100 | 30 | 60 | 70 |
| B | 10 | 100 | 20 | 20 | 20 |

However, when the capacity scheduler tries to give that container to the app in 
{{B}}, the app will recognize that it has no headroom, and refuse the 
container. So the capacity scheduler offers the container again to the app in 
{{A}}, which accepts it because it has headroom now, and the process starts 
over again.

Note that this happens even when used cluster resources are below 100% because 
the used + pending for the cluster would put it above 100%.

 Preemption occurring unnecessarily because preemption doesn't consider user 
 limit
 -

 Key: YARN-3769
 URL: https://issues.apache.org/jira/browse/YARN-3769
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0, 2.7.0, 2.8.0
Reporter: Eric Payne
Assignee: Eric Payne

 We are seeing the preemption monitor preempting containers from queue A and 
 then seeing the capacity scheduler giving them immediately back to queue A. 
 This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-06-04 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573664#comment-14573664
 ] 

Eric Payne commented on YARN-3769:
--

[~leftnoteasy],
{quote}
One thing I've thought for a while is adding a lazy preemption mechanism, 
which is: when a container is marked preempted and wait for 
max_wait_before_time, it becomes a can_be_killed container. If there's 
another queue can allocate on a node with can_be_killed container, such 
container will be killed immediately to make room the new containers.
{quote}
IIUC, in your proposal, the preemption monitor would mark the containers as 
preemptable, and then after some configurable wait period, the capacity 
scheduler would be the one to do the killing if it finds that it needs the 
resources on that node. Is my understanding correct?

 Preemption occurring unnecessarily because preemption doesn't consider user 
 limit
 -

 Key: YARN-3769
 URL: https://issues.apache.org/jira/browse/YARN-3769
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0, 2.7.0, 2.8.0
Reporter: Eric Payne
Assignee: Eric Payne

 We are seeing the preemption monitor preempting containers from queue A and 
 then seeing the capacity scheduler giving them immediately back to queue A. 
 This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-06-26 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603603#comment-14603603
 ] 

Eric Payne commented on YARN-2004:
--

Thanks, [~sunilg], for this fix.

- {{SchedulerApplicationAttempt.java}}:
{code}
  if (!getApplicationPriority().equals(
  ((SchedulerApplicationAttempt) other).getApplicationPriority())) {
return getApplicationPriority().compareTo(
((SchedulerApplicationAttempt) other).getApplicationPriority());
  }
{code}
-- Can {{getApplicationPriority}} return null? I see that 
{{SchedulerApplicationAttempt}} initializes {{appPriority}} to null.

- {{CapacityScheduler.java}}:
{code}
  if (!a1.getApplicationPriority().equals(a2.getApplicationPriority())) {
return a1.getApplicationPriority().compareTo(
a2.getApplicationPriority());
  }
{code}
-- Same question about {{getApplicationPriority}} returning null.
-- Also, can {{updateApplicationPriority}} call 
{{authenticateApplicationPriority}}? Seems like duplicate code to me.


 Priority scheduling support in Capacity scheduler
 -

 Key: YARN-2004
 URL: https://issues.apache.org/jira/browse/YARN-2004
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 
 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 
 0006-YARN-2004.patch, 0007-YARN-2004.patch


 Based on the priority of the application, Capacity Scheduler should be able 
 to give preference to application while doing scheduling.
 ComparatorFiCaSchedulerApp applicationComparator can be changed as below.   
 
 1.Check for Application priority. If priority is available, then return 
 the highest priority job.
 2.Otherwise continue with existing logic such as App ID comparison and 
 then TimeStamp comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-06-17 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590747#comment-14590747
 ] 

Eric Payne commented on YARN-2902:
--

Hi [~varun_saxena]. Thank you very much for working on and fixing this issue. 
We are looking forward to your next patch.  Do you have an ETA for when that 
might be?

 Killing a container that is localizing can orphan resources in the 
 DOWNLOADING state
 

 Key: YARN-2902
 URL: https://issues.apache.org/jira/browse/YARN-2902
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-2902.002.patch, YARN-2902.patch


 If a container is in the process of localizing when it is stopped/killed then 
 resources are left in the DOWNLOADING state.  If no other container comes 
 along and requests these resources they linger around with no reference 
 counts but aren't cleaned up during normal cache cleanup scans since it will 
 never delete resources in the DOWNLOADING state even if their reference count 
 is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-31 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3978:
-
Attachment: YARN-3978.004.patch

Thanks very much [~jeagles] for your review and comments.

{quote}
It is not evident to me the changes made to TestClientRMService, 
TestChildQueueOrder, TestLeafQueue, TestReservations, and TestFifoScheduler. 
Could you help to explain those changes or remove them if they are extra?
{quote}
In each of these tests, the {{rmcontext}} is mocked before constructing a new 
instance of {{RMContainerImpl}}, but the {{getYarnConfiguration}} method is not 
handled. Since this patch adds a dependency on 
{{rmContext.getYarnConfiguration()}} in the constructor for 
{{RMContainerImpl}}, an explicit mock for {{getYarnConfiguration}} had to be 
added in these tests to prevent NPE.
{quote}
Please update the comment
+ // Store system metrics for all containers only when storeContainerMetaInfo
+ // is true.
To indicate that AM metrics publishing are delayed until later in this scenario.
{quote}
Done
{quote}
Is there a better configuration name that could be used? 
save-container-meta-info doesn't convey that AM container info is still 
published if this flag is disabled.
{quote}
How about {{save-non-am-container-meta-info}}? I thought about 
{{save-only-am-container-meta-info}}, but then {{true}} would mean that 
publishing of non-am containers would be turned off, and I thought that was too 
confusing.

 Configurably turn off the saving of container info in Generic AHS
 -

 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Affects Versions: 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3978.001.patch, YARN-3978.002.patch, 
 YARN-3978.003.patch, YARN-3978.004.patch


 Depending on how each application's metadata is stored, one week's worth of 
 data stored in the Generic Application History Server's database can grow to 
 be almost a terabyte of local disk space. In order to alleviate this, I 
 suggest that there is a need for a configuration option to turn off saving of 
 non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-30 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648175#comment-14648175
 ] 

Eric Payne commented on YARN-3978:
--

{{checkstyle}} indicates that {{YarnConfiguration.java}} is too long. I will 
not be fixing that as part of this JIRA. Everything else from the build seems 
to be okay.

[~jeagles], can you please have a look at this patch?

 Configurably turn off the saving of container info in Generic AHS
 -

 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Affects Versions: 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3978.001.patch, YARN-3978.002.patch, 
 YARN-3978.003.patch


 Depending on how each application's metadata is stored, one week's worth of 
 data stored in the Generic Application History Server's database can grow to 
 be almost a terabyte of local disk space. In order to alleviate this, I 
 suggest that there is a need for a configuration option to turn off saving of 
 non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-28 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3978:
-
Affects Version/s: 2.8.0
   2.7.1

 Configurably turn off the saving of container info in Generic AHS
 -

 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Affects Versions: 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3978.001.patch


 Depending on how each application's metadata is stored, one week's worth of 
 data stored in the Generic Application History Server's database can grow to 
 be almost a terabyte of local disk space. In order to alleviate this, I 
 suggest that there is a need for a configuration option to turn off saving of 
 non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-30 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3978:
-
Attachment: YARN-3978.003.patch

Version 003 of patch.

 Configurably turn off the saving of container info in Generic AHS
 -

 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Affects Versions: 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3978.001.patch, YARN-3978.002.patch, 
 YARN-3978.003.patch


 Depending on how each application's metadata is stored, one week's worth of 
 data stored in the Generic Application History Server's database can grow to 
 be almost a terabyte of local disk space. In order to alleviate this, I 
 suggest that there is a need for a configuration option to turn off saving of 
 non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-28 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3978:
-
Attachment: YARN-3978.002.patch

 Configurably turn off the saving of container info in Generic AHS
 -

 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Affects Versions: 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3978.001.patch, YARN-3978.002.patch


 Depending on how each application's metadata is stored, one week's worth of 
 data stored in the Generic Application History Server's database can grow to 
 be almost a terabyte of local disk space. In order to alleviate this, I 
 suggest that there is a need for a configuration option to turn off saving of 
 non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3250) Support admin cli interface in for Application Priority

2015-08-07 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662133#comment-14662133
 ] 

Eric Payne commented on YARN-3250:
--

Just my 2 cents: I prever {{yarn application --appId Applicationid 
--setPriority value}}

 Support admin cli interface in for Application Priority
 ---

 Key: YARN-3250
 URL: https://issues.apache.org/jira/browse/YARN-3250
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Rohith Sharma K S
 Attachments: 0001-YARN-3250-V1.patch


 Current Application Priority Manager supports only configuration via file. 
 To support runtime configurations for admin cli and REST, a common management 
 interface has to be added which can be shared with NodeLabelsManager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority

2015-08-14 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696957#comment-14696957
 ] 

Eric Payne commented on YARN-4014:
--

{code}
+pw.println( -appId Application ID ApplicationId can be used 
with any other);
+pw.println( sub commands in future. 
Currently it is);
+pw.println( used along only with 
-set-priority);
...
+  ApplicationId can be used with any other sub commands in future.
+  +  Currently it is used along only with -set-priority);
{code}

This is a minor point, but in these 2 places, I would simply state something 
like the following:
{{ID of the affected application.}}

That way, when it is used in the future by other switches, the developer 
doesn't have to remember to change these statements.

 Support user cli interface in for Application Priority
 --

 Key: YARN-4014
 URL: https://issues.apache.org/jira/browse/YARN-4014
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
Reporter: Rohith Sharma K S
Assignee: Rohith Sharma K S
 Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch


 Track the changes for user-RM client protocol i.e ApplicationClientProtocol 
 changes and discussions in this jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-27 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3978:
-
Attachment: YARN-3978.001.patch

Attaching version 001 of the patch. @jeagles, would you like to take a look?

 Configurably turn off the saving of container info in Generic AHS
 -

 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3978.001.patch


 Depending on how each application's metadata is stored, one week's worth of 
 data stored in the Generic Application History Server's database can grow to 
 be almost a terabyte of local disk space. In order to alleviate this, I 
 suggest that there is a need for a configuration option to turn off saving of 
 non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-25 Thread Eric Payne (JIRA)

Eric Payne created YARN-3978:


 Summary: Configurably turn off the saving of container info in 
Generic AHS
 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Reporter: Eric Payne
Assignee: Eric Payne


Depending on how each application's metadata is stored, one week's worth of 
data stored in the Generic Application History Server's database can grow to be 
almost a terabyte of local disk space. In order to alleviate this, I suggest 
that there is a need for a configuration option to turn off saving of non-AM 
container metadata in the GAHS data store.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-25 Thread Eric Payne (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641768#comment-14641768
]

Eric Payne commented on YARN-3978:
--

Use Case: A user launches an application on a secured cluster that runs for
some time and then fails within the AM (perhaps due to OOM in the AM), leaving
no history in the job history server. The user doesn't notice that the job has
failed until after the application has dropped off of the RM's application
store. At this point, if no information was stored in the Generic Application
History Service, a user must rely on a priviledged system administrator to
access the AM logs for them.

It is desirable to activate the Generic Application History service within the
timeline server so that users can access their application's information even
after the RM has forgotten about their application. This app information should
be kept in the GAHS for 1 week, as is done, for example, for logs in the job
history server.

One way that the Generic AHS stores metadata about an application is in an
Entity levelDB. This includes information about each container for each
application. Based on my analysis, the levelDB size grows by at least 2500
bytes per container (uncompressed). This is a conservative estimate as the size
could be much bigger based on the amount of diagnostic information associated
with failed containers.

On very large and busy clusters, the amount needed on the timeline server's
local disk would be between 0.6 TB and 1.0 TB (uncompressed). Even if we assume
90% compression, that's still between 60 GB and 100 GB that will be needed on
the local disk. In addition to this, between 80 GB and 143 GB of metadata
(uncopressed) will need to be cleaned up every day from the levelDB, which will
delay other processing in the timeline server.

The proposal of this JIRA is to add a configuration property that
enables/disables whether or not the GAHS stores container information in the
levelDB. Whith this change, I estimate that the local disk usage would be about
5700 bytes per job, or about 10 GB (uncompressed) per week. Additionally, the
daily cleanup load would only be about 1.5 GB per day.

Configurably turn off the saving of container info in Generic AHS
-

Key: YARN-3978
URL: https://issues.apache.org/jira/browse/YARN-3978
Project: Hadoop YARN
Issue Type: Improvement
Components: timelineserver, yarn
Reporter: Eric Payne
Assignee: Eric Payne

Depending on how each application's metadata is stored, one week's worth of
data stored in the Generic Application History Server's database can grow to
be almost a terabyte of local disk space. In order to alleviate this, I
suggest that there is a need for a configuration option to turn off saving of
non-AM container metadata in the GAHS data store.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-17 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3905:
-
Attachment: YARN-3905.002.patch

Fixing checkstyle bug. I forgot to remove the now-unused {{ContainerID}} import.

 Application History Server UI NPEs when accessing apps run after RM restart
 ---

 Key: YARN-3905
 URL: https://issues.apache.org/jira/browse/YARN-3905
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.7.0, 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3905.001.patch, YARN-3905.002.patch


 From the Application History URL (http://RmHostName:8188/applicationhistory), 
 clicking on the application ID of an app that was run after the RM daemon has 
 been restarted results in a 500 error:
 {noformat}
 Sorry, got error 500
 Please consult RFC 2616 for meanings of the error code.
 {noformat}
 The stack trace is as follows:
 {code}
 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
 applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
 reading history information of all application attempts of application 
 application_1436472584878_0001
 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
 Failed to read the AM container of the application attempt 
 appattempt_1436472584878_0001_01.
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-16 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3905:
-
Attachment: YARN-3905.001.patch

 Application History Server UI NPEs when accessing apps run after RM restart
 ---

 Key: YARN-3905
 URL: https://issues.apache.org/jira/browse/YARN-3905
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.7.0, 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3905.001.patch


 From the Application History URL (http://RmHostName:8188/applicationhistory), 
 clicking on the application ID of an app that was run after the RM daemon has 
 been restarted results in a 500 error:
 {noformat}
 Sorry, got error 500
 Please consult RFC 2616 for meanings of the error code.
 {noformat}
 The stack trace is as follows:
 {code}
 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
 applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
 reading history information of all application attempts of application 
 application_1436472584878_0001
 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
 Failed to read the AM container of the application attempt 
 appattempt_1436472584878_0001_01.
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-09 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996886#comment-14996886
 ] 

Eric Payne commented on YARN-3769:
--

bq. you don't need to do componmentwiseMax here, since minPendingAndPreemptable 
<= headroom, and you can use substractFrom to make code simpler.
[~leftnoteasy], you are right, we do know that {{minPendingAndPreemptable <= 
headroom}}. Thanks for the catch. I will make those changes.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-11 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769.005.patch

[~leftnoteasy], Attaching YARN-3769.005.patch with the changes we discussed.

I have another question that may be an enhancement:
In {{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, the calculation 
of headroom is as follows in this patch:
{code}
Resource headroom = Resources.subtract(
computeUserLimit(app, resources, user, partition,
SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY),
user.getUsed(partition));
{code}
Would it be more efficient to just do the following?
{code}
 Resource headroom =
Resources.subtract(user.getUserResourceLimit(), user.getUsed());
{code}

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI

2015-11-13 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004297#comment-15004297
 ] 

Eric Payne commented on YARN-4226:
--

Since the {{preemptionDisabled}} tag has already shipped with the capacity 
scheduler's REST API, I don't think that changing the name of the tag is an 
option, since users may be relying on that key string. I see only the following 
options:
# Change the value of {{true}} to {{disabled}} and {{false}} to {{enabled}} 
(which may not be an option either for the same reason changing the key is not 
an option)
# Add a new key like {{preemptionStatus}} and have the values be {{enabled}} or 
{{disabled}}
# Make no changes. Leave it the way that it is

> Make capacity scheduler queue's preemption status REST API consistent with GUI
> --
>
> Key: YARN-4226
> URL: https://issues.apache.org/jira/browse/YARN-4226
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> In the capacity scheduler GUI, the preemption status has the following form:
> {code}
> Preemption:   disabled
> {code}
> However, the REST API shows the following for the same status:
> {code}
> preemptionDisabled":true
> {code}
> The latter is confusing and should be consistent with the format in the GUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4354) Public resource localization fails with NPE

2015-11-13 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004908#comment-15004908
 ] 

Eric Payne commented on YARN-4354:
--

+1

Thanks Jason for catching and fixing this.

I also verified that the new test 
({{TestLocalResourcesTrackerImpl#testReleaseWhileDownloading}}) passes with the 
fix and NPEs without it.

And, I ran {{TestResourceLocalizationService}} (the above test that is failing) 
in my local build environment and it passes for me.

> Public resource localization fails with NPE
> ---
>
> Key: YARN-4354
> URL: https://issues.apache.org/jira/browse/YARN-4354
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.2
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4354-unittest.patch, YARN-4354.001.patch, 
> YARN-4354.002.patch
>
>
> I saw public localization on nodemanagers get stuck because it was constantly 
> rejecting requests to the thread pool executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-11-16 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: YARN-4225.001.patch

Attching YARN-4225.001.patch for both trunk and branch-2.8

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-12 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769-branch-2.7.005.patch

Unit tests {{TestAMAuthorization}} {{TestClientRMTokens}} {{TestRM}} 
{{TestWorkPreservingRMRestart}} are all working for me in my local build 
environment.

Attaching branch-2.7 patch, which is a little different, since the 2.7 
preemption monitor doesn't consider labels.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-02 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769.004.patch

[~leftnoteasy], Thank you for your review, and sorry for the late reply.

{quote}
- Why this is needed? MAX_PENDING_OVER_CAPACITY. I think this could be 
problematic, for example, if a queue has capacity = 50, and it's usage is 10 
and it has 45 pending resource, if we set MAX_PENDING_OVER_CAPACITY=0.1, the 
queue cannot preempt resource from other queue.
{quote}
Sorry for the poor naming convention. It is not really being used to check 
against the queue's capacity, it is used to check for a percentage over the 
currently used resources. I changed the name to {{MAX_PENDING_OVER_CURRENT}}.

As you know, there are multiple reasons why preemption could unnecessarily 
preempt resources (I call it "flapping"). Only one of which is the lack of 
consideration for user limit factor. Another is that an app could be requesting 
an 8-gig container, and the preemption monitor could conceivably preempt 8, 
one-gig containers, which would then be rejected by the requesting AM and 
potentially given right back to the preempted app.

The {{MAX_PENDING_OVER_CURRENT}} buffer is an attempt to alleviate that 
particular flapping situation by giving a buffer zone above the currently used 
resources on a particular queue. This is to say that the preemption monitor 
shouldn't consider that queue B is asking for pending resources unless pending 
resources on queue B are above a configured percentage of currently used 
resources on queue B.

If you want, we can pull this out and put it as part of a different JIRA so we 
can document and discuss that particular flapping situation separately.

{quote}
- n LeafQueue, it uses getHeadroom() to compute how many resource that the user 
can use. But I think it may not correct: ... For above queue status, headroom 
for a.a1 is 0 since queue-a's currentResourceLimit is 0.
So instead of using headroom, I think you can use computed-user-limit - 
user.usage(partition) as the headroom. You don't need to consider queue's max 
capacity here, since we will consider queue's max capacity at following logic 
of PCPP.
{quote}
Yes, you are correct. {{getHeadroom}} could be calculating zero headroom when 
we don't want it to. And, I agree that we don't need to limit pending resources 
to max queue capacity when calculating pending resources. The concern for this 
fix is that user limit factor should be considered and limit the pending value. 
The max queue capacity will be considered during the offer stage of the 
preemption calculations.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-03 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987842#comment-14987842
 ] 

Eric Payne commented on YARN-3769:
--

Tests {{hadoop.yarn.server.resourcemanager.TestClientRMTokens}} and 
{{hadoop.yarn.server.resourcemanager.TestAMAuthorization}} are not failing for 
me in may own build environment.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-10-07 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947847#comment-14947847
 ] 

Eric Payne commented on YARN-3769:
--

Investigating test failures and checkstyle warnings

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-10-15 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: (was: YARN-3769.003.patch)

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-10-15 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769.003.patch

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-10-10 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769.003.patch
YARN-3769-branch-2.7.003.patch

[~leftnoteasy], Thanks for all of your help on this JIRA.

Attaching version 003.

{{YARN-3769.003.patch}} applies to both trunk and branch-2

{{YARN-3769-branch-2.7.003.patch}} applies to branch-2.7

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-09 Thread Eric Payne (JIRA)

Eric Payne created YARN-3905:


 Summary: Application History Server UI NPEs when accessing apps 
run after RM restart
 Key: YARN-3905
 URL: https://issues.apache.org/jira/browse/YARN-3905
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.7.1, 2.7.0, 2.8.0
Reporter: Eric Payne
Assignee: Eric Payne


From the Application History URL (http://RmHostName:8188/applicationhistory), 
clicking on the application ID of an app that was run after the RM daemon has 
been restarted results in a 500 error:
{noformat}
Sorry, got error 500
Please consult RFC 2616 for meanings of the error code.
{noformat}

The stack trace is as follows:
{code}
2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading 
history information of all application attempts of application 
application_1436472584878_0001
2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
Failed to read the AM container of the application attempt 
appattempt_1436472584878_0001_01.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
at 
org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
at 
org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
at 
org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-09 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621281#comment-14621281
 ] 

Eric Payne commented on YARN-3905:
--

{{org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable}} 
constructs what it believes should be the AM container ID when creating a new 
{{GetContainerReportRequest}}.
{code}
// AM container is always the first container of the attempt
final GetContainerReportRequest request =
GetContainerReportRequest.newInstance(ContainerId.newContainerId(
  appAttemptReport.getApplicationAttemptId(), 1));
{code}
- After the RM is restarted, container IDs contain an {{e##}} string, which the 
above code doesn't take into consideration
- The AM container is not always _01 due to the way reservations work. We 
have seen non-first AM containers in practice.

As a result of the above code, the container ID in the 
{{GetContainerReportRequest}} may not match the actual AM container ID before 
RM restart, and will not match those for jobs run after the RM is restarted.

So, When {{ApplicationHistoryManagerImpl}} compares the ID of the passed 
container with it's cache from the history store, it can't find a match and 
throws the NPE.

In {{AppBlock#generateApplicationTable}}, instead of constructing the AM's 
container ID, I suggest using appAttemptReport#getAMContainerId instead:
{code}
final GetContainerReportRequest request =
GetContainerReportRequest.newInstance(
appAttemptReport.getAMContainerId());
{code}

 Application History Server UI NPEs when accessing apps run after RM restart
 ---

 Key: YARN-3905
 URL: https://issues.apache.org/jira/browse/YARN-3905
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.7.0, 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne

 From the Application History URL (http://RmHostName:8188/applicationhistory), 
 clicking on the application ID of an app that was run after the RM daemon has 
 been restarted results in a 500 error:
 {noformat}
 Sorry, got error 500
 Please consult RFC 2616 for meanings of the error code.
 {noformat}
 The stack trace is as follows:
 {code}
 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
 applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
 reading history information of all application attempts of application 
 application_1436472584878_0001
 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
 Failed to read the AM container of the application attempt 
 appattempt_1436472584878_0001_01.
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-08-31 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769.001.branch-2.8.patch
YARN-3769.001.branch-2.7.patch

{quote}
One thing I've thought for a while is adding a "lazy preemption" mechanism, 
which is: when a container is marked preempted and wait for 
max_wait_before_time, it becomes a "can_be_killed" container. If there's 
another queue can allocate on a node with "can_be_killed" container, such 
container will be killed immediately to make room the new containers.

I will upload a design doc shortly for review.
{quote}

[~leftnoteasy], because it's been a couple of months since the last activity on 
this JIRA, would it be better to use this JIRA for the purpose of making the 
preemption monitor "user-limit" aware and open a separate JIRA to address a 
redesign?

Towards that end, I am uploading a couple of patches:
- {{YARN-3769.001.branch-2.7.patch}} is a patch to 2.7 (and also 2.6) which we 
have been using internally. This fix has dramatically reduced the instances of 
"ping-pong"-ing as I outlined in [the comment 
above|https://issues.apache.org/jira/browse/YARN-3769?focusedCommentId=14573619=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14573619].
 
- {{YARN-3769.001.branch-2.8.patch}} is similar to the fix made in 2.7, but it 
also takes into consideration node label partitions.
Thanks for your help and please let me know what you think.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Wangda Tan
> Attachments: YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-08-31 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724431#comment-14724431
 ] 

Eric Payne commented on YARN-3769:
--

bq. I didn't make any progress on this, assigned this to you.
No problem. Thanks [~leftnoteasy].

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-09-09 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737684#comment-14737684
 ] 

Eric Payne commented on YARN-3769:
--

Thanks very much [~leftnoteasy]!
I think the above is much more efficient, but I think it needs one small tweak, 
On this line:
{code}
userNameToHeadroom.get(app.getUser()) -= app.getPending(partition);
{code}
If {{app.getPending(partition)}} is larger than 
{{userNameToHeadroom.get(app.getUser())}}, then 
{{userNameToHeadroom.get(app.getUser())}} could easily go negative. I think 
what we may want is something like this:

{code}
Map userNameToHeadroom;

Resource userLimit = computeUserLimit(partition);
Resource pendingAndPreemptable = 0;

for (app in apps) {
if (!userNameToHeadroom.contains(app.getUser())) {
userNameToHeadroom.put(app.getUser(), userLimit - 
app.getUser().getUsed(partition));
}
Resource minPendingAndPreemptable = 
min(userNameToHeadroom.get(app.getUser()), app.getPending(partition));
pendingAndPreemptable += minPendingAndPreemptable;
userNameToHeadroom.get(app.getUser()) -= minPendingAndPreemptable;
}

return pendingAndPreemptable;
{code}

Also, I will work on adding a test case.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4217) Failed AM attempt retries on same failed host

2015-10-01 Thread Eric Payne (JIRA)

Eric Payne created YARN-4217:


 Summary: Failed AM attempt retries on same failed host
 Key: YARN-4217
 URL: https://issues.apache.org/jira/browse/YARN-4217
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications
Affects Versions: 2.7.1
Reporter: Eric Payne


This happens when the cluster is maxed out. One node is going bad, so 
everything that happens on it fails, so the bad node is never busy. Since the 
cluster is maxed out, when the RM looks for a node with available resources, it 
will always find the almost bad one because nothing can run on it so it has 
available resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4217) Failed AM attempt retries on same failed host

2015-10-01 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940016#comment-14940016
 ] 

Eric Payne commented on YARN-4217:
--

One way to fix this would be by blacklisting the bad nodes. However, we need to 
be careful that the cure isn't worse than the disease. For example, Hadoop 0.20 
had black/grey listing of nodes but it was often disabled because it caused 
more problems than it solved. We don't want one misconfigured pipeline spawning 
AMs/tasks that always fail to cause the RM to think all nodes are bad and bring 
the cluster to a halt. It's difficult to discern whether a failure was the 
node's fault or the job's fault (or sometimes neither was at fault).

I think the best approach initially is to implement an application-specific 
blacklisting approach, where the RM will track bad nodes per application rather 
than across applications. That way an AM that isn't working on a node can be 
tried on another node, but a misconfigured/specialized AM won't break the node 
for other AMs/tasks that work just fine on that node. The drawback of course is 
that if the node really is totally bad then each application has to learn
that separately.

> Failed AM attempt retries on same failed host
> -
>
> Key: YARN-4217
> URL: https://issues.apache.org/jira/browse/YARN-4217
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>
> This happens when the cluster is maxed out. One node is going bad, so 
> everything that happens on it fails, so the bad node is never busy. Since the 
> cluster is maxed out, when the RM looks for a node with available resources, 
> it will always find the almost bad one because nothing can run on it so it 
> has available resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels

2015-10-03 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942434#comment-14942434
 ] 

Eric Payne commented on YARN-3216:
--

Hi [~sunilg], [~leftnoteasy], and [~Naganarasimha]. Thank you all for the great 
work.

Have you considered how the Max Application Master Resources will be presented 
in the GUI? I assume it will just be expressed in the existing Max Application 
Master Resources field under the partition-specific tab in the scheduler page. 
Is that correct?

> Max-AM-Resource-Percentage should respect node labels
> -
>
> Key: YARN-3216
> URL: https://issues.apache.org/jira/browse/YARN-3216
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3216.patch, 0002-YARN-3216.patch
>
>
> Currently, max-am-resource-percentage considers default_partition only. When 
> a queue can access multiple partitions, we should be able to compute 
> max-am-resource-percentage based on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-10-03 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769-branch-2.7.002.patch
YARN-3769-branch-2.002.patch

Thank you very much, [~leftnoteasy], for your suggestions and help reviewing 
this patch. I am attaching an updated patch (version 002) for both branch-2.7 
and branch-2. 

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-4217) Failed AM attempt retries on same failed host

2015-10-02 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-4217.
--
Resolution: Duplicate

bq. Eric Payne - is this a duplicate of YARN-2005?
[~vvasudev], yes it is. I did do a search, but I missed that one. Thanks a lot!

> Failed AM attempt retries on same failed host
> -
>
> Key: YARN-4217
> URL: https://issues.apache.org/jira/browse/YARN-4217
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>
> This happens when the cluster is maxed out. One node is going bad, so 
> everything that happens on it fails, so the bad node is never busy. Since the 
> cluster is maxed out, when the RM looks for a node with available resources, 
> it will always find the almost bad one because nothing can run on it so it 
> has available resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI

2015-10-05 Thread Eric Payne (JIRA)

Eric Payne created YARN-4226:


 Summary: Make capacity scheduler queue's preemption status REST 
API consistent with GUI
 Key: YARN-4226
 URL: https://issues.apache.org/jira/browse/YARN-4226
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
Priority: Minor


In the capacity scheduler GUI, the preemption status has the following form:
{code}
Preemption: disabled
{code}
However, the REST API shows the following for the same status:
{code}
preemptionDisabled":true
{code}
The latter is confusing and should be consistent with the format in the GUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-10-05 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Component/s: capacity scheduler

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-10-05 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: (was: YARN-3769-branch-2.002.patch)

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.7.002.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status

2015-10-05 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Summary: Add preemption status to yarn queue -status  (was: Add preemption 
status to {{yarn queue -status}})

> Add preemption status to yarn queue -status
> ---
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4225) Add preemption status to {{yarn queue -status}}

2015-10-05 Thread Eric Payne (JIRA)

Eric Payne created YARN-4225:


 Summary: Add preemption status to {{yarn queue -status}}
 Key: YARN-4225
 URL: https://issues.apache.org/jira/browse/YARN-4225
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-10-05 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769-branch-2.002.patch

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-10-05 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Summary: Add preemption status to yarn queue -status for capacity scheduler 
 (was: Add preemption status to yarn queue -status)

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: (was: YARN-4225.002.patch)

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: YARN-4225.002.patch

Attaching {{YARN-4225.002.patch}}, which implements {{getPreemptionDisabled()}} 
to return a {{Boolean}}, and {{QueueCLI#printQueueInfo}} will check for 
non-null before printing out queue status. Patch applies cleanly to trunk, 
branch-2, and branch-2.8.
{quote}
In General, what is the Hadoop policy when a newer client talks to an older 
server and the protobuf output is different than expected. Should we expose 
some form of the has method, or should we overload the get method as I 
described here?

I would appreciate any additional feedback from the community in general (Vinod 
Kumar Vavilapalli, do you have any thoughts?)
{quote}
[~vinodkv], did you have a chance to think about this? [~jlowe], do you have 
any additional thoughts?

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: YARN-4225.003.patch

Sorry, I mis-named the patch. Should have been {{YARN-4225.003.patch}}

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036560#comment-15036560
 ] 

Eric Payne commented on YARN-4225:
--

Thanks [~leftnoteasy], for your helpful comments.
bq. Do you think is it better to return boolean? I'd prefer to return a default 
value (false) instead of return null
This is the nature of the question that I have about the more general Hadoop 
policy, and which [~jlowe] and I were discussing in the comments above.
Basically, the use case is a newer client is querying an older server, and so 
some of the newer protobuf entries that the client expects may not exist. In 
that case, we have 2 options that I can see:
# The client exposes both the {{get}} protobuf method and the {{has}} protobuf 
method for the structure in question
# We overload the {{get}} protobuf method to do the {{has}} checking internally 
and return NULL if the field doesn't exist.
I actually prefer the second option because it exposes only one method. But, I 
would like to know the opinion of others and if there is already a precedence 
for this use case.


> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034633#comment-15034633
 ] 

Eric Payne commented on YARN-4108:
--

Thanks very much [~leftnoteasy] for creating this POC. Just a quick note: The 
patch no longer applies completely cleanly to trunk and branch-2.8.

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108-design-doc-v1.pdf, YARN-4108.poc.1.patch
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page

2015-12-04 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4422:
-
Attachment: YARN-4422.001.patch

Attaching {{YARN-4422-001.patch}}. [~jeagles] or [~jlowe], would you mind 
taking a look?

The problem was that when the Applications page in the Generic AHS renders, it 
depends on a MASTER_CONTAINER_EVENT_INFO being in the AppAttemptReport. If it's 
not there, it will give up on trying to print start time, node, or log lings. 
The reason that information then appears when you clidk on the app attempt link 
is because when the Application Attempt page renders, it just gets the whole 
list of containers for the app attempt and prints that information for each 
one, including the AM container, but it still doesn't have an indication which 
one is the AM container.

The reason the MASTER_CONTAINER_EVENT_INFO isn't in the AppAttemptReport is 
because that is provided by the REGISTER event in the System Metrics Publisher, 
and since this use case doesn't ever get to the point of AM registration, the 
MASTER_CONTAINER_EVENT_INFO isn't there.

However, in all of these cases, the RM container does get a FINISHED event. I 
fixed this by adding the MASTER_CONTAINER_EVENT_INFO to the FINISHED event.

> Generic AHS sometimes doesn't show started, node, or logs on App page
> -
>
> Key: YARN-4422
> URL: https://issues.apache.org/jira/browse/YARN-4422
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: AppAttemptPage no container or node.jpg, AppPage no logs 
> or node.jpg, YARN-4422.001.patch
>
>
> Sometimes the AM container for an app isn't able to start the JVM. This can 
> happen if bogus JVM options are given to the AM container ( 
> {{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when 
> misconfiguring the AM container's environment variables 
> ({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}})
> When the AM container for an app isn't able to start the JVM, the Application 
> page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and 
> {{Logs}} columns. It _does_ have links for each app attempt, and if you click 
> on one of them, you go to the Application Attempt page, where you can see all 
> containers with links to their logs and nodes, including the AM container. 
> But none of that shows up for the app attempts on the Application page.
> Also, on the Application Attempt page, in the {{Application Attempt 
> Overview}} section, the {{AM Container}} value is {{null}} and the {{Node}} 
> value is {{N/A}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-08 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: YARN-4225.004.patch

Thanks very much [~leftnoteasy], for your review and helpful comments.

{quote}
I'm OK with both approach - existing one in latest patch or simply return false 
if there's no such field in proto.
{quote}
So, if I understand correctly, you are okay with 
{{QueueInfo#getPreemptionDisabled}} returning {{Boolean}} with the possibility 
of returning {{null}} if the field doesn't exist. With that understanding, I'm 
leaving that in the latest patch.
{quote}
2) For QueueCLI, is it better to print "preemption is disabled/enabled" instead 
of "preemption status: disabled/enabled"?
{quote}
Actually, I think that leaving it as "Preemption : disabled/enabled" is more 
consistent with the way the other properties are displayed. What do you think?
{quote}
3) Is it possible to add a simple test to verify end-to-end behavior?
{quote}
I added a couple of tests to {{TestYarnCLI}}. Good suggestion.

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch, YARN-4225.004.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-09 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049115#comment-15049115
 ] 

Eric Payne commented on YARN-4225:
--

I'd like to address the issues raised by the above pre-commit build:

- Unit Tests: The following unit tests failed during the above pre-commit 
build, but they all pass for me in my local build environment:

||Test Name||Modified by this patch||Pre-commit failure||
|hadoop.yarn.client.api.impl.TestAMRMClient|No|Java HotSpot(TM) 64-Bit Server 
VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0|
|hadoop.yarn.client.api.impl.TestNMClient|No|Java HotSpot(TM) 64-Bit Server VM 
warning: ignoring option MaxPermSize=768m; support was removed in 8.0|
|hadoop.yarn.client.api.impl.TestYarnClient|No|TEST TIMED OUT|
|hadoop.yarn.client.cli.TestYarnCLI|Yes|Java HotSpot(TM) 64-Bit Server VM 
warning: ignoring option MaxPermSize=768m; support was removed in 8.0|
|hadoop.yarn.client.TestGetGroups|No|java.net.UnknownHostException: Invalid 
host name: local host is: (unknown); destination host is: "48cbb2d33ebc":8033; 
java.net.UnknownHostException|
|hadoop.yarn.server.resourcemanager.TestAMAuthorization|No|java.net.UnknownHostException:
 Invalid host name: local host is: (unknown); destination host is: 
"48cbb2d33ebc":8030; java.net.UnknownHostException|
|hadoop.yarn.server.resourcemanager.TestClientRMTokens|No|java.lang.NullPointerException:|

- Findbugs warnings:
{{org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getPreemptionDisabled()
 has Boolean return type and returns explicit null At QueueInfoPBImpl.java:and 
returns explicit null At QueueInfoPBImpl.java:[line 402]}}
This is a result of {{QueueInfo#getPreemptionDisabled}} returning a Boolean. 
Again, we could expose the {{hasPreemptionDisabled}} method and use that 
instead.
- JavaDocs warnings/failures: I don't think these are caused by this patch:
{{[WARNING] The requested profile "docs" could not be activated because it does 
not exist.}}
{{[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-javadoc-plugin:2.8.1:javadoc (default-cli) on 
project hadoop-yarn-server-resourcemanager: An error has occurred in JavaDocs 
report generation:}}
{{...}}

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch, YARN-4225.004.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page

2015-12-04 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4422:
-
Attachment: AppAttemptPage no container or node.jpg
AppPage no logs or node.jpg

> Generic AHS sometimes doesn't show started, node, or logs on App page
> -
>
> Key: YARN-4422
> URL: https://issues.apache.org/jira/browse/YARN-4422
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: AppAttemptPage no container or node.jpg, AppPage no logs 
> or node.jpg
>
>
> Sometimes the AM container for an app isn't able to start the JVM. This can 
> happen if bogus JVM options are given to the AM container ( 
> {{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when 
> misconfiguring the AM container's environment variables 
> ({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}})
> When the AM container for an app isn't able to start the JVM, the Application 
> page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and 
> {{Logs}} columns. It _does_ have links for each app attempt, and if you click 
> on one of them, you go to the Application Attempt page, where you can see all 
> containers with links to their logs and nodes, including the AM container. 
> But none of that shows up for the app attempts on the Application page.
> Also, on the Application Attempt page, in the {{Application Attempt 
> Overview}} section, the {{AM Container}} value is {{null}} and the {{Node}} 
> value is {{N/A}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page

2015-12-04 Thread Eric Payne (JIRA)

Eric Payne created YARN-4422:


 Summary: Generic AHS sometimes doesn't show started, node, or logs 
on App page
 Key: YARN-4422
 URL: https://issues.apache.org/jira/browse/YARN-4422
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Eric Payne
Assignee: Eric Payne


Sometimes the AM container for an app isn't able to start the JVM. This can 
happen if bogus JVM options are given to the AM container ( 
{{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when 
misconfiguring the AM container's environment variables 
({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}})

When the AM container for an app isn't able to start the JVM, the Application 
page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and 
{{Logs}} columns. It _does_ have links for each app attempt, and if you click 
on one of them, you go to the Application Attempt page, where you can see all 
containers with links to their logs and nodes, including the AM container. But 
none of that shows up for the app attempts on the Application page.

Also, on the Application Attempt page, in the {{Application Attempt Overview}} 
section, the {{AM Container}} value is {{null}} and the {{Node}} value is 
{{N/A}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-12-01 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769-branch-2.6.001.patch

Attaching {{YARN-3769-branch-2.6.001.patch}} for backport to branch-2.6.

TestLeafQueue unit test for multiple apps by multiple users had to be modified 
specifically to allow for all apps to be active at the same time since the way 
active apps is calculated is different between 2.6 and 2.7.

> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.7.3
>
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.6.001.patch, YARN-3769-branch-2.7.002.patch, 
> YARN-3769-branch-2.7.003.patch, YARN-3769-branch-2.7.005.patch, 
> YARN-3769-branch-2.7.006.patch, YARN-3769-branch-2.7.007.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4461) Redundant nodeLocalityDelay log in LeafQueue

2015-12-16 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061298#comment-15061298
 ] 

Eric Payne commented on YARN-4461:
--

Thanks a lot, [~jlowe] and [~leftnoteasy]!

> Redundant nodeLocalityDelay log in LeafQueue
> 
>
> Key: YARN-4461
> URL: https://issues.apache.org/jira/browse/YARN-4461
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Eric Payne
>Priority: Trivial
> Fix For: 2.8.0
>
> Attachments: YARN-4461.001.patch
>
>
> In LeafQueue#setupQueueConfigs there's a redundant log of nodeLocalityDelay:
> {code}
> "nodeLocalityDelay = " + nodeLocalityDelay + "\n" +
> "labels=" + labelStrBuilder.toString() + "\n" +
> "nodeLocalityDelay = " +  nodeLocalityDelay + "\n" +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-16 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061296#comment-15061296
 ] 

Eric Payne commented on YARN-4225:
--

Thanks a lot, [~leftnoteasy]

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch, YARN-4225.004.patch, YARN-4225.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-14 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057144#comment-15057144
 ] 

Eric Payne commented on YARN-4225:
--

bq. Could you check findbugs warning in latest Jenkins run is related or not? 
There's no link to findbugs result in latest Jenkins report, so I guess it's 
not related.
[~leftnoteasy], is there something wrong with this build? I can get to 
https://builds.apache.org/job/PreCommit-YARN-Build/9968, but many of the other 
links work in the comment above. For example, 
https://builds.apache.org/job/PreCommit-YARN-Build/9968/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn-jdk1.8.0_66.txt
 gets 404. I tried to get to the artifacts page, but that comes up 404 also.

I didn't find any findbugs report.

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch, YARN-4225.004.patch, YARN-4225.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-14 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: YARN-4225.005.patch

bq.Patch looks good, could you mark the findbugs warning needs to be skipped?
Thanks a lot, [~leftnoteasy]. Attaching YARN-4225.005.patch with findbugs 
suppressed for {{org.apache.hadoop.yarn.api.records.impl.pb: 
NP_BOOLEAN_RETURN_NULL}}

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch, YARN-4225.004.patch, YARN-4225.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-17 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062188#comment-15062188
 ] 

Eric Payne commented on YARN-4225:
--

Oh, one more thing, [~leftnoteasy]. I ran testpatch in my own build environment 
and it gave a +1 for the findbugs, so the above must be a glitch in the Apache 
pre-commit build (?).

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch, YARN-4225.004.patch, YARN-4225.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-4461) Redundant nodeLocalityDelay log in LeafQueue

2015-12-15 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-4461:


Assignee: Eric Payne

> Redundant nodeLocalityDelay log in LeafQueue
> 
>
> Key: YARN-4461
> URL: https://issues.apache.org/jira/browse/YARN-4461
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Eric Payne
>Priority: Trivial
>
> In LeafQueue#setupQueueConfigs there's a redundant log of nodeLocalityDelay:
> {code}
> "nodeLocalityDelay = " + nodeLocalityDelay + "\n" +
> "labels=" + labelStrBuilder.toString() + "\n" +
> "nodeLocalityDelay = " +  nodeLocalityDelay + "\n" +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4461) Redundant nodeLocalityDelay log in LeafQueue

2015-12-16 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060109#comment-15060109
 ] 

Eric Payne commented on YARN-4461:
--

The two failing tests above ({{TestClientRMTokens}} and 
{{TestAMAuthorization}}) both work for me in my local environment.

> Redundant nodeLocalityDelay log in LeafQueue
> 
>
> Key: YARN-4461
> URL: https://issues.apache.org/jira/browse/YARN-4461
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Eric Payne
>Priority: Trivial
> Attachments: YARN-4461.001.patch
>
>
> In LeafQueue#setupQueueConfigs there's a redundant log of nodeLocalityDelay:
> {code}
> "nodeLocalityDelay = " + nodeLocalityDelay + "\n" +
> "labels=" + labelStrBuilder.toString() + "\n" +
> "nodeLocalityDelay = " +  nodeLocalityDelay + "\n" +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page

2015-12-10 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051061#comment-15051061
 ] 

Eric Payne commented on YARN-4422:
--

bq. Thanks! Will this fix address MAPREDUCE-5502 or MAPREDUCE-4428? It doesn't 
seem so, but would like to confirm.

[~mingma], thanks for your interest. No, this JIRA does not fix the issue 
documented in MAPREDUCE-5502 or MAPREDUCE-4428. This JIRA only affects the 
Generic application history server's GUI and not the RM Application GUI. Also, 
as documented in those JIRAs, the problem is not a missing link in the GUI, but 
that the log history is missing altogether.

> Generic AHS sometimes doesn't show started, node, or logs on App page
> -
>
> Key: YARN-4422
> URL: https://issues.apache.org/jira/browse/YARN-4422
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 3.0.0, 2.8.0, 2.7.3
>
> Attachments: AppAttemptPage no container or node.jpg, AppPage no logs 
> or node.jpg, YARN-4422.001.patch
>
>
> Sometimes the AM container for an app isn't able to start the JVM. This can 
> happen if bogus JVM options are given to the AM container ( 
> {{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when 
> misconfiguring the AM container's environment variables 
> ({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}})
> When the AM container for an app isn't able to start the JVM, the Application 
> page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and 
> {{Logs}} columns. It _does_ have links for each app attempt, and if you click 
> on one of them, you go to the Application Attempt page, where you can see all 
> containers with links to their logs and nodes, including the AM container. 
> But none of that shows up for the app attempts on the Application page.
> Also, on the Application Attempt page, in the {{Application Attempt 
> Overview}} section, the {{AM Container}} value is {{null}} and the {{Node}} 
> value is {{N/A}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-4390) Consider container request size during CS preemption

2015-12-14 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-4390.
--
Resolution: Duplicate

Closing this ticket in favor of YARN-4108

> Consider container request size during CS preemption
> 
>
> Key: YARN-4390
> URL: https://issues.apache.org/jira/browse/YARN-4390
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.0, 2.8.0, 2.7.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> There are multiple reasons why preemption could unnecessarily preempt 
> containers. One is that an app could be requesting a large container (say 
> 8-GB), and the preemption monitor could conceivably preempt multiple 
> containers (say 8, 1-GB containers) in order to fill the large container 
> request. These smaller containers would then be rejected by the requesting AM 
> and potentially given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 1497 matches

Mail list logo