[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784719#comment-15784719
 ] 

Bibin A Chundatt commented on YARN-6031:


[~templedf]

{quote}
so that when using -remove-application-from-state-store you know what you're 
purging.
{quote}

Another concern about removal of application from store is  already running AM 
will be in dormant state. But if we allow  application to recover and make sure 
the application is killed from scheduler then application will be killed with a 
reason. 

Currently in MR side if AM is running and request for non available label MR AM 
kills application the same could be implemented in all application.


> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5709) Cleanup leader election configs and pluggability

2016-12-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784442#comment-15784442
 ] 

Junping Du commented on YARN-5709:
--

Interesting...Why these javadoc warnings only against jdk v1.8?

> Cleanup leader election configs and pluggability
> 
>
> Key: YARN-5709
> URL: https://issues.apache.org/jira/browse/YARN-5709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: yarn-5709-branch-2.8.01.patch, 
> yarn-5709-branch-2.8.02.patch, yarn-5709-branch-2.8.03.patch, 
> yarn-5709-branch-2.8.patch, yarn-5709-wip.2.patch, yarn-5709.1.patch, 
> yarn-5709.2.patch, yarn-5709.3.patch, yarn-5709.4.patch
>
>
> While reviewing YARN-5677 and YARN-5694, I noticed we could make the 
> curator-based election code cleaner. It is nicer to get this fixed in 2.8 
> before we ship it, but this can be done at a later time as well. 
> # By EmbeddedElector, we meant it was running as part of the RM daemon. Since 
> the Curator-based elector is also running embedded, I feel the code should be 
> checking for {{!curatorBased}} instead of {{isEmbeddedElector}}
> # {{LeaderElectorService}} should probably be named 
> {{CuratorBasedEmbeddedElectorService}} or some such.
> # The code that initializes the elector should be at the same place 
> irrespective of whether it is curator-based or not. 
> # We seem to be caching the CuratorFramework instance in RM. It makes more 
> sense for it to be in RMContext. If others are okay with it, we might even be 
> better of having {{RMContext#getCurator()}} method to lazily create the 
> curator framework and then cache it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-12-28 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784417#comment-15784417
 ] 

zhangshilong commented on YARN-4090:


would you please tell me yarn version you used?
In trunk:
 FairScheduler.getQueueUserAclInfo  will not lock the FSQueue object.
FSQueue object will be locked only when decResourceUsage or incrResourceUsage.
FairScheduler:
{code:java}
  @Override
  public List getQueueUserAclInfo() {
UserGroupInformation user;
try {
  user = UserGroupInformation.getCurrentUser();
} catch (IOException ioe) {
  return new ArrayList();
}

return queueMgr.getRootQueue().getQueueUserAclInfo(user);
  }
{code}
FSParentQueue.java
{code:java}
  @Override
  public List getQueueUserAclInfo(UserGroupInformation user) {
List userAcls = new ArrayList<>();

// Add queue acls
userAcls.add(getUserAclInfo(user));

// Add children queue acls
readLock.lock();
try {
  for (FSQueue child : childQueues) {
userAcls.addAll(child.getQueueUserAclInfo(user));
  }
} finally {
  readLock.unlock();
}
 
return userAcls;
  }
{code}

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg, 
> sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-12-28 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784392#comment-15784392
 ] 

zhangshilong commented on YARN-4090:


[~xinxianyin] [~yufeigu]   This optimization works in our environment very 
well, I hope to continue this issue.

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg, 
> sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved

2016-12-28 Thread Tao Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784388#comment-15784388
 ] 

Tao Yang commented on YARN-6029:


Thanks [~wangda].
Updated priority to Critical and Attached new patch for review.
This patch needs add indentation in synchronized block. Diff code without 
changing space like this:
{code}
   @Override
-  public synchronized CSAssignment assignContainers(Resource clusterResource,
+  public CSAssignment assignContainers(Resource clusterResource,
   FiCaSchedulerNode node, ResourceLimits currentResourceLimits,
   SchedulingMode schedulingMode) {
+synchronized (this) {
   updateCurrentResourceLimits(currentResourceLimits, clusterResource);

   if (LOG.isDebugEnabled()) {
@@ -906,6 +907,7 @@ public synchronized CSAssignment assignContainers(Resource 
clusterResource,
   }

   setPreemptionAllowed(currentResourceLimits, node.getPartition());
+}

 // Check for reserved resources
 RMContainer reservedContainer = node.getReservedContainer();
@@ -923,6 +925,7 @@ public synchronized CSAssignment assignContainers(Resource 
clusterResource,
   }
 }

+synchronized (this) {
   // if our queue cannot access this node, just return
   if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY
   && !accessibleToPartition(node.getPartition())) {
@@ -1019,6 +1022,7 @@ public synchronized CSAssignment 
assignContainers(Resource clusterResource,

   return CSAssignment.NULL_ASSIGNMENT;
 }
+  }
{code}

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by 
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to 
> release a reserved container
> --
>
> Key: YARN-6029
> URL: https://issues.apache.org/jira/browse/YARN-6029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-6029.001.patch, YARN-6029.002.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls 
> YarnClient#getQueueAclsInfo) just at the moment that 
> LeafQueue#assignContainers is called and before notifying parent queue to 
> release resource (should release a reserved container), then ResourceManager 
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized 
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized 
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue 
> root), iterates over children queue acls and is blocked when calling 
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of 
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being 
> completed and is blocked when invoking synchronized 
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of 
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be 
> removed to solve this problem, since this method appears to not affect fields 
> of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved co

2016-12-28 Thread Tao Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6029:
---
Attachment: YARN-6029.002.patch

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by 
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to 
> release a reserved container
> --
>
> Key: YARN-6029
> URL: https://issues.apache.org/jira/browse/YARN-6029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-6029.001.patch, YARN-6029.002.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls 
> YarnClient#getQueueAclsInfo) just at the moment that 
> LeafQueue#assignContainers is called and before notifying parent queue to 
> release resource (should release a reserved container), then ResourceManager 
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized 
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized 
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue 
> root), iterates over children queue acls and is blocked when calling 
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of 
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being 
> completed and is blocked when invoking synchronized 
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of 
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be 
> removed to solve this problem, since this method appears to not affect fields 
> of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784349#comment-15784349
 ] 

Ying Zhang edited comment on YARN-6031 at 12/29/16 3:03 AM:


{quote}
Do you think we can make the log message a bit more explicit, i.e. say that the 
failure was because node labels have been disabled and point out the property 
that the admin should use to disable/enable node labels?
{quote}
Hi [~templedf], the following error message will be printed in RM log:
{noformat}
2016-12-28 01:00:22,694 WARN  resourcemanager.RMAppManager 
(RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission 
failed in validating AM resource request for application application_xx
org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid 
resource request, node label not enabled but request contains label expression
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
... ...
2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager 
(RMAppManager.java:recover(455)) - Failed to recover application 
application_xx
{noformat}
The first error message is printed by the check which we fail at the first 
place, the second error message is printed by the code in the patch. I'm 
thinking this would be enough hint for the root cause.


was (Author: ying zhang):
{quote}
Do you think we can make the log message a bit more explicit, i.e. say that the 
failure was because node labels have been disabled and point out the property 
that the admin should use to disable/enable node labels?
{quote}
Hi [~templedf], the following error message will be printed in RM log:
{noformat}
2016-12-28 01:00:22,694 WARN  resourcemanager.RMAppManager 
(RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission 
failed in validating AM resource request for application 
application_1482915192452_0001
org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid 
resource request, node label not enabled but request contains label expression
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
... ...
2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager 
(RMAppManager.java:recover(455)) - Failed to recover application 
application_1482915192452_0001
{noformat}


> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not 

[jira] [Updated] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved co

2016-12-28 Thread Tao Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6029:
---
Priority: Critical  (was: Blocker)

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by 
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to 
> release a reserved container
> --
>
> Key: YARN-6029
> URL: https://issues.apache.org/jira/browse/YARN-6029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls 
> YarnClient#getQueueAclsInfo) just at the moment that 
> LeafQueue#assignContainers is called and before notifying parent queue to 
> release resource (should release a reserved container), then ResourceManager 
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized 
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized 
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue 
> root), iterates over children queue acls and is blocked when calling 
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of 
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being 
> completed and is blocked when invoking synchronized 
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of 
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be 
> removed to solve this problem, since this method appears to not affect fields 
> of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784349#comment-15784349
 ] 

Ying Zhang edited comment on YARN-6031 at 12/29/16 3:00 AM:


{quote}
Do you think we can make the log message a bit more explicit, i.e. say that the 
failure was because node labels have been disabled and point out the property 
that the admin should use to disable/enable node labels?
{quote}
Hi [~templedf], the following error message will be printed in RM log:
{noformat}
2016-12-28 01:00:22,694 WARN  resourcemanager.RMAppManager 
(RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission 
failed in validating AM resource request for application 
application_1482915192452_0001
org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid 
resource request, node label not enabled but request contains label expression
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
... ...
2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager 
(RMAppManager.java:recover(455)) - Failed to recover application 
application_1482915192452_0001
{noformat}



was (Author: ying zhang):
{quote}
Do you think we can make the log message a bit more explicit, i.e. say that the 
failure was because node labels have been disabled and point out the property 
that the admin should use to disable/enable node labels?
{quote}
Hi [~templedf], the following error message will be printed in RM log:
2016-12-28 01:00:22,694 WARN  resourcemanager.RMAppManager 
(RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission 
failed in validating AM resource request for application 
application_1482915192452_0001
org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid 
resource request, node label not enabled but request contains label expression
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
... ...
2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager 
(RMAppManager.java:recover(455)) - Failed to recover application 
application_1482915192452_0001


> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 

[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784349#comment-15784349
 ] 

Ying Zhang commented on YARN-6031:
--

{quote}
Do you think we can make the log message a bit more explicit, i.e. say that the 
failure was because node labels have been disabled and point out the property 
that the admin should use to disable/enable node labels?
{quote}
Hi [~templedf], the following error message will be printed in RM log:
2016-12-28 01:00:22,694 WARN  resourcemanager.RMAppManager 
(RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission 
failed in validating AM resource request for application 
application_1482915192452_0001
org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid 
resource request, node label not enabled but request contains label expression
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
... ...
2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager 
(RMAppManager.java:recover(455)) - Failed to recover application 
application_1482915192452_0001


> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784332#comment-15784332
 ] 

Ying Zhang commented on YARN-6031:
--

So what's the next move? I'm a little confused. Are we going to address this 
issue by the general approach proposed by [~templedf] and [~sunilg]?

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784327#comment-15784327
 ] 

Ying Zhang commented on YARN-6031:
--

{quote}
We could ignore/reset labels to default in resourcerequest when nodelabels are 
disabled.
{quote}
Agree with Daniel, reset might not be a good idea. Admin should be aware of the 
failure and take proper action.
{quote}
IIUC ignore validation on recovery also should work.
{quote}
I was thinking the same at the first place (see my comment at YARN-4465). Then 
I agree what [~sunilg] said for the same reason as above. We should not hide 
the failure/wrong configuration.
{quote}
IMHO should be acceptable since any application submitted with labels when 
feature is disabled gets rejected.
{quote}
Yes, you're right. I agree with the current way, just want to clarify so that 
everyone is on the same page:-)







> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5685) Non-embedded HA failover is broken

2016-12-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784191#comment-15784191
 ] 

Karthik Kambatla commented on YARN-5685:


On YARN-5709, Jian made the point that it is unlikely we will add a 
non-embedded leader elector, and we could always add a config then. 

I am fine with removing the config for embedded. 

> Non-embedded HA failover is broken
> --
>
> Key: YARN-5685
> URL: https://issues.apache.org/jira/browse/YARN-5685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.0, 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>  Labels: oct16-hard
> Attachments: YARN-5685.001.patch, YARN-5685.002.patch
>
>
> If HA is enabled with automatic failover enabled and embedded failover 
> disabled, all RMs all come up in standby state.  To make one of them active, 
> the {{--forcemanual}} flag must be used when manually triggering the state 
> change.  Should the active go down, the standby will not become active and 
> must be manually transitioned with the {{--forcemanual}} flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5709) Cleanup leader election configs and pluggability

2016-12-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784177#comment-15784177
 ] 

Hadoop QA commented on YARN-5709:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
16s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
27s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
28s{color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
19s{color} | {color:green} branch-2.8 passed with JDK v1.8.0_111 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
29s{color} | {color:green} branch-2.8 passed with JDK v1.7.0_121 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
8s{color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
29s{color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
32s{color} | {color:green} branch-2.8 passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
26s{color} | {color:red} hadoop-yarn-server-resourcemanager in branch-2.8 
failed with JDK v1.8.0_111. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} branch-2.8 passed with JDK v1.7.0_121 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
8s{color} | {color:green} the patch passed with JDK v1.8.0_111 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
35s{color} | {color:green} the patch passed with JDK v1.7.0_121 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
35s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 41s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 4 new + 321 unchanged - 9 fixed = 325 total (was 330) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
1s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
22s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed 
with JDK v1.8.0_111. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed with JDK v1.7.0_121 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
31s{color} | {color:green} hadoop-yarn-api in the patch passed with JDK 
v1.7.0_121. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 25s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_121. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}190m  8s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_111 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | 

[jira] [Updated] (YARN-5830) Avoid preempting AM containers

2016-12-28 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu updated YARN-5830:
---
Attachment: YARN-5830.002.patch

[~kasha], uploaded the new patch for all your comments. YARN-6038 is created 
and updated the TODO in this patch.

> Avoid preempting AM containers
> --
>
> Key: YARN-5830
> URL: https://issues.apache.org/jira/browse/YARN-5830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Karthik Kambatla
>Assignee: Yufei Gu
> Attachments: YARN-5830.001.patch, YARN-5830.002.patch
>
>
> While considering containers for preemption, avoid AM containers unless 
> absolutely necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved

2016-12-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784103#comment-15784103
 ] 

Wangda Tan commented on YARN-6029:
--

And in addition, I suggest to downgrade severity to critical to unblock 2.8, 
since this only happens rarely.

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by 
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to 
> release a reserved container
> --
>
> Key: YARN-6029
> URL: https://issues.apache.org/jira/browse/YARN-6029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls 
> YarnClient#getQueueAclsInfo) just at the moment that 
> LeafQueue#assignContainers is called and before notifying parent queue to 
> release resource (should release a reserved container), then ResourceManager 
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized 
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized 
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue 
> root), iterates over children queue acls and is blocked when calling 
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of 
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being 
> completed and is blocked when invoking synchronized 
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of 
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be 
> removed to solve this problem, since this method appears to not affect fields 
> of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved

2016-12-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784099#comment-15784099
 ] 

Wangda Tan commented on YARN-6029:
--

bq. I'm not clear about this. Is it worth to ensure consistency of acls through 
reducing the efficiency of scheduler?
It gonna be inefficient, previously getQueueInfo hold scheduler lock and that 
causes problems.

bq. We also noticed that it doesn't hold the lock of LeafQueue instance when 
updating acls (CapacityScheduler#setQueueAcls) so that current logic doesn't 
guarantee the consistency of acls.
Yeah you're correct... 
I think we could directly get queue ACL info from CS by invoking 
authorizer#checkPermissions, and we can have a separate lock to protect 
permission get/set. cc: [~jianhe]
But this is should be a separated patch, since we need to fix getQueueInfo as 
well.

I think we can go ahead to fix locks inside LQ#assignContainers, thoughts?

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by 
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to 
> release a reserved container
> --
>
> Key: YARN-6029
> URL: https://issues.apache.org/jira/browse/YARN-6029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls 
> YarnClient#getQueueAclsInfo) just at the moment that 
> LeafQueue#assignContainers is called and before notifying parent queue to 
> release resource (should release a reserved container), then ResourceManager 
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized 
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized 
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue 
> root), iterates over children queue acls and is blocked when calling 
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of 
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being 
> completed and is blocked when invoking synchronized 
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of 
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be 
> removed to solve this problem, since this method appears to not affect fields 
> of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5685) Non-embedded HA failover is broken

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784070#comment-15784070
 ] 

Daniel Templeton commented on YARN-5685:


I would look at it the other way around.  The default assumption should be 
embedded leader election.  If an out-of-process election is added, it will be 
the new and different thing, and it should then have a config param that 
enables it.  Today with no alternative, having a config param for embedded 
makes no sense.  The two options are: cluster works and cluster broken.

If you're really convinced that leadership election is something that will 
change again soon, then I'd still deprecate the embedded option and add a new 
option that selects the election type: "embedded" or "external", with the 
"external" value only added once that's actually a thing.  Heck, throw in 
"manual" as a valid value, and we can also deprecate the config param to enable 
automatic failover.

> Non-embedded HA failover is broken
> --
>
> Key: YARN-5685
> URL: https://issues.apache.org/jira/browse/YARN-5685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.0, 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>  Labels: oct16-hard
> Attachments: YARN-5685.001.patch, YARN-5685.002.patch
>
>
> If HA is enabled with automatic failover enabled and embedded failover 
> disabled, all RMs all come up in standby state.  To make one of them active, 
> the {{--forcemanual}} flag must be used when manually triggering the state 
> change.  Should the active go down, the standby will not become active and 
> must be manually transitioned with the {{--forcemanual}} flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5830) Avoid preempting AM containers

2016-12-28 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784024#comment-15784024
 ] 

Yufei Gu edited comment on YARN-5830 at 12/28/16 11:58 PM:
---

[~kasha], thanks for the review. 
The high-level approach:
1. In the first node, sort the containersToCheck list by putting all non-AM 
containers first. 
2. We check containers one by one. If we are lucky to found a solution without 
any AM container, just return the container list. But if we are not so lucky 
and found a solution with AM container, we record this solution in 
{{potentialContainers}} and {{break}} here to move to next node because we may 
find an solution without AM container in next node.
3. Repeat step 1 and 2 for each node, the only difference is if we've already 
got a solution with AM container, we can {{break}} the loop once we meet first 
AM container. 
4. Return the {{potentialContainers}} if we cannot find any solution without AM 
containers.

This approach, I was trying to avoid sort AM and non-AM containers across nodes 
since it may be super inefficient if # of node are huge. One of my concern is 
that we should check many nodes if there are many nodes with AM containers, 
which may be not efficient. We can put a threshold to avoid it.


was (Author: yufeigu):
[~kasha], thanks for the review. 
The high-level approach:
1. In first nodes, sort the containersToCheck list by putting all non-AM 
containers first. 
2. We check containers one by one. If we are lucky to found a solution without 
any AM container, just return the container list. But if we are not so lucky 
and found a solution with AM container, we record this solution in 
{{potentialContainers}} and {{break}} here to move to next node because we may 
find an solution without AM container in next node.
3. Repeat step 1 and 2 for each node, the only difference is if we've already 
got a solution with AM container, we can {{break}} the loop once we meet first 
AM container. 
4. Return the {{potentialContainers}} if we cannot find any solution without AM 
containers.

One of my concern is that we should check many nodes if there are many nodes 
with AM containers, which may be not efficient. We can put a threshold to avoid 
it.

> Avoid preempting AM containers
> --
>
> Key: YARN-5830
> URL: https://issues.apache.org/jira/browse/YARN-5830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Karthik Kambatla
>Assignee: Yufei Gu
> Attachments: YARN-5830.001.patch
>
>
> While considering containers for preemption, avoid AM containers unless 
> absolutely necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5556) Support for deleting queues without requiring a RM restart

2016-12-28 Thread Naganarasimha Garla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784035#comment-15784035
 ] 

Naganarasimha Garla commented on YARN-5556:
---

Thanks Xuan, will start working on it and update the patch at the
earliest...




> Support for deleting queues without requiring a RM restart
> --
>
> Key: YARN-5556
> URL: https://issues.apache.org/jira/browse/YARN-5556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Xuan Gong
>Assignee: Naganarasimha G R
> Attachments: YARN-5556.v1.001.patch, YARN-5556.v1.002.patch, 
> YARN-5556.v1.003.patch, YARN-5556.v1.004.patch
>
>
> Today, we could add or modify queues without restarting the RM, via a CS 
> refresh. But for deleting queue, we have to restart the ResourceManager. We 
> could support for deleting queues without requiring a RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5830) Avoid preempting AM containers

2016-12-28 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784024#comment-15784024
 ] 

Yufei Gu edited comment on YARN-5830 at 12/28/16 11:53 PM:
---

[~kasha], thanks for the review. 
The high-level approach:
1. In first nodes, sort the containersToCheck list by putting all non-AM 
containers first. 
2. We check containers one by one. If we are lucky to found a solution without 
any AM container, just return the container list. But if we are not so lucky 
and found a solution with AM container, we record this solution in 
{{potentialContainers}} and {{break}} here to move to next node because we may 
find an solution without AM container in next node.
3. Repeat step 1 and 2 for each node, the only difference is if we've already 
got a solution with AM container, we can {{break}} the loop once we meet first 
AM container. 
4. Return the {{potentialContainers}} if we cannot find any solution without AM 
containers.

One of my concern is that we should check many nodes if there are many nodes 
with AM containers, which may be not efficient. We can put a threshold to avoid 
it.


was (Author: yufeigu):
[~kasha], thanks for the review. 
The high-level approach:
1. In first nodes, sort the containersToCheck list by putting all non-AM 
containers first. 
2. We check containers one by one. If we are lucky to found a solution without 
any AM container, just return the container list. But not lucky and we find a 
solution with AM container, we record this solution in {{potentialContainers}} 
and {{break}} here to move to next node because we may find an solution without 
AM container in next node.
3. Repeat step 1 and 2 for each node, the only difference is if we've already 
got a solution with AM container, we can {{break}} the loop once we meet first 
AM container. 
4. Return the {{potentialContainers}} if we cannot find any solution without AM 
containers.

> Avoid preempting AM containers
> --
>
> Key: YARN-5830
> URL: https://issues.apache.org/jira/browse/YARN-5830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Karthik Kambatla
>Assignee: Yufei Gu
> Attachments: YARN-5830.001.patch
>
>
> While considering containers for preemption, avoid AM containers unless 
> absolutely necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-5709) Cleanup leader election configs and pluggability

2016-12-28 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5709:
---
Comment: was deleted

(was: Also, looks to me like this is committed to branch-2, but not trunk.)

> Cleanup leader election configs and pluggability
> 
>
> Key: YARN-5709
> URL: https://issues.apache.org/jira/browse/YARN-5709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: yarn-5709-branch-2.8.01.patch, 
> yarn-5709-branch-2.8.02.patch, yarn-5709-branch-2.8.03.patch, 
> yarn-5709-branch-2.8.patch, yarn-5709-wip.2.patch, yarn-5709.1.patch, 
> yarn-5709.2.patch, yarn-5709.3.patch, yarn-5709.4.patch
>
>
> While reviewing YARN-5677 and YARN-5694, I noticed we could make the 
> curator-based election code cleaner. It is nicer to get this fixed in 2.8 
> before we ship it, but this can be done at a later time as well. 
> # By EmbeddedElector, we meant it was running as part of the RM daemon. Since 
> the Curator-based elector is also running embedded, I feel the code should be 
> checking for {{!curatorBased}} instead of {{isEmbeddedElector}}
> # {{LeaderElectorService}} should probably be named 
> {{CuratorBasedEmbeddedElectorService}} or some such.
> # The code that initializes the elector should be at the same place 
> irrespective of whether it is curator-based or not. 
> # We seem to be caching the CuratorFramework instance in RM. It makes more 
> sense for it to be in RMContext. If others are okay with it, we might even be 
> better of having {{RMContext#getCurator()}} method to lazily create the 
> curator framework and then cache it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5830) Avoid preempting AM containers

2016-12-28 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784024#comment-15784024
 ] 

Yufei Gu commented on YARN-5830:


[~kasha], thanks for the review. 
The high-level approach:
1. In first nodes, sort the containersToCheck list by putting all non-AM 
containers first. 
2. We check containers one by one. If we are lucky to found a solution without 
any AM container, just return the container list. But not lucky and we find a 
solution with AM container, we record this solution in {{potentialContainers}} 
and {{break}} here to move to next node because we may find an solution without 
AM container in next node.
3. Repeat step 1 and 2 for each node, the only difference is if we've already 
got a solution with AM container, we can {{break}} the loop once we meet first 
AM container. 
4. Return the {{potentialContainers}} if we cannot find any solution without AM 
containers.

> Avoid preempting AM containers
> --
>
> Key: YARN-5830
> URL: https://issues.apache.org/jira/browse/YARN-5830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Karthik Kambatla
>Assignee: Yufei Gu
> Attachments: YARN-5830.001.patch
>
>
> While considering containers for preemption, avoid AM containers unless 
> absolutely necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5556) Support for deleting queues without requiring a RM restart

2016-12-28 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784018#comment-15784018
 ] 

Xuan Gong commented on YARN-5556:
-

[~Naganarasimha]
Given all the dependent patches have been committed, could you rebase the 
patch, please ?

> Support for deleting queues without requiring a RM restart
> --
>
> Key: YARN-5556
> URL: https://issues.apache.org/jira/browse/YARN-5556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Xuan Gong
>Assignee: Naganarasimha G R
> Attachments: YARN-5556.v1.001.patch, YARN-5556.v1.002.patch, 
> YARN-5556.v1.003.patch, YARN-5556.v1.004.patch
>
>
> Today, we could add or modify queues without restarting the RM, via a CS 
> refresh. But for deleting queue, we have to restart the ResourceManager. We 
> could support for deleting queues without requiring a RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5755) Enhancements to STOP queue handling

2016-12-28 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong resolved YARN-5755.
-
   Resolution: Duplicate
Fix Version/s: 3.0.0-alpha2
   2.9.0

> Enhancements to STOP queue handling
> ---
>
> Key: YARN-5755
> URL: https://issues.apache.org/jira/browse/YARN-5755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.9.0, 3.0.0-alpha2
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5755) Enhancements to STOP queue handling

2016-12-28 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784014#comment-15784014
 ] 

Xuan Gong commented on YARN-5755:
-

Close this as duplicate. The issue has already been handled in YARN-5756

> Enhancements to STOP queue handling
> ---
>
> Key: YARN-5755
> URL: https://issues.apache.org/jira/browse/YARN-5755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5987) NM configured command to collect heap dump of preempted container

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784003#comment-15784003
 ] 

Daniel Templeton commented on YARN-5987:


Thanks, [~miklos.szeg...@cloudera.com]. Is it possible to leverage any of the 
work on YARN-2261 for this JIRA?

> NM configured command to collect heap dump of preempted container
> -
>
> Key: YARN-5987
> URL: https://issues.apache.org/jira/browse/YARN-5987
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
> Attachments: YARN-5987.000.patch, YARN-5987.001.patch
>
>
> The node manager can kill a container, if it exceeds the assigned memory 
> limits. It would be nice to have a configuration entry to set up a command 
> that can collect additional debug information, if needed. The collected 
> information can be used for root cause analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications

2016-12-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783996#comment-15783996
 ] 

Hudson commented on YARN-4882:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11051 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/11051/])
YARN-4882. Change the log level to DEBUG for recovering completed (rkanter: rev 
f216276d2164c6564632c571fd3adbb03bc8b3e4)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java


> Change the log level to DEBUG for recovering completed applications
> ---
>
> Key: YARN-4882
> URL: https://issues.apache.org/jira/browse/YARN-4882
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Daniel Templeton
>  Labels: oct16-easy
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-4882.001.patch, YARN-4882.002.patch, 
> YARN-4882.003.patch, YARN-4882.004.patch, YARN-4882.005.patch
>
>
> I think for recovering completed applications no need to log as INFO, rather 
> it can be made it as DEBUG.  The problem seen from large cluster is if any 
> issue happens during RM start up and continuously switching , then  RM logs 
> are filled with most with recovering applications only. 
> There are 6 lines are logged for 1 applications as I shown in below logs, 
> then consider RM default value for max-completed applications is 10K. So for 
> each switch 10K*6=60K lines will be added which is not useful I feel.
> {noformat}
> 2016-03-01 10:20:59,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority 
> level is set to application:application_1456298208485_21507
> 2016-03-01 10:20:59,094 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering 
> app: application_1456298208485_21507 with 1 attempts and final state = 
> FINISHED
> 2016-03-01 10:20:59,100 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Recovering attempt: appattempt_1456298208485_21507_01 with final state: 
> FINISHED
> 2016-03-01 10:20:59,107 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1456298208485_21507_01 State change from NEW to FINISHED
> 2016-03-01 10:20:59,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1456298208485_21507 State change from NEW to FINISHED
> 2016-03-01 10:20:59,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith   
> OPERATION=Application Finished - Succeeded  TARGET=RMAppManager 
> RESULT=SUCCESS  APPID=application_1456298208485_21507
> {noformat}
> The main problem is missing important information's from the logs before RM 
> unstable. Even though log roll back is 50 or 100, in a short period all these 
> logs will be rolled out and all the logs contains only RM switching 
> information that too recovering applications!!. 
> I suggest at least completed applications recovery should be logged as DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783989#comment-15783989
 ] 

Daniel Templeton commented on YARN-4882:


Thanks, [~rkanter]!

> Change the log level to DEBUG for recovering completed applications
> ---
>
> Key: YARN-4882
> URL: https://issues.apache.org/jira/browse/YARN-4882
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Daniel Templeton
>  Labels: oct16-easy
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-4882.001.patch, YARN-4882.002.patch, 
> YARN-4882.003.patch, YARN-4882.004.patch, YARN-4882.005.patch
>
>
> I think for recovering completed applications no need to log as INFO, rather 
> it can be made it as DEBUG.  The problem seen from large cluster is if any 
> issue happens during RM start up and continuously switching , then  RM logs 
> are filled with most with recovering applications only. 
> There are 6 lines are logged for 1 applications as I shown in below logs, 
> then consider RM default value for max-completed applications is 10K. So for 
> each switch 10K*6=60K lines will be added which is not useful I feel.
> {noformat}
> 2016-03-01 10:20:59,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority 
> level is set to application:application_1456298208485_21507
> 2016-03-01 10:20:59,094 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering 
> app: application_1456298208485_21507 with 1 attempts and final state = 
> FINISHED
> 2016-03-01 10:20:59,100 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Recovering attempt: appattempt_1456298208485_21507_01 with final state: 
> FINISHED
> 2016-03-01 10:20:59,107 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1456298208485_21507_01 State change from NEW to FINISHED
> 2016-03-01 10:20:59,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1456298208485_21507 State change from NEW to FINISHED
> 2016-03-01 10:20:59,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith   
> OPERATION=Application Finished - Succeeded  TARGET=RMAppManager 
> RESULT=SUCCESS  APPID=application_1456298208485_21507
> {noformat}
> The main problem is missing important information's from the logs before RM 
> unstable. Even though log roll back is 50 or 100, in a short period all these 
> logs will be rolled out and all the logs contains only RM switching 
> information that too recovering applications!!. 
> I suggest at least completed applications recovery should be logged as DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5258) Document Use of Docker with LinuxContainerExecutor

2016-12-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783980#comment-15783980
 ] 

Hadoop QA commented on YARN-5258:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
56s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
17s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 16m 31s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:a9ad5d6 |
| JIRA Issue | YARN-5258 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12844992/YARN-5258.004.patch |
| Optional Tests |  asflicense  mvnsite  xml  |
| uname | Linux f800184c7201 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 
17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 9ca54f4 |
| modules | C: hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site 
U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/14487/console |
| Powered by | Apache Yetus 0.5.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Document Use of Docker with LinuxContainerExecutor
> --
>
> Key: YARN-5258
> URL: https://issues.apache.org/jira/browse/YARN-5258
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>  Labels: oct16-easy
> Attachments: YARN-5258.001.patch, YARN-5258.002.patch, 
> YARN-5258.003.patch, YARN-5258.004.patch
>
>
> There aren't currently any docs that explain how to configure Docker and all 
> of its various options aside from reading all of the JIRAs.  We need to 
> document the configuration, use, and troubleshooting, along with helpful 
> examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications

2016-12-28 Thread Robert Kanter (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783961#comment-15783961
 ] 

Robert Kanter commented on YARN-4882:
-

+1

> Change the log level to DEBUG for recovering completed applications
> ---
>
> Key: YARN-4882
> URL: https://issues.apache.org/jira/browse/YARN-4882
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Daniel Templeton
>  Labels: oct16-easy
> Attachments: YARN-4882.001.patch, YARN-4882.002.patch, 
> YARN-4882.003.patch, YARN-4882.004.patch, YARN-4882.005.patch
>
>
> I think for recovering completed applications no need to log as INFO, rather 
> it can be made it as DEBUG.  The problem seen from large cluster is if any 
> issue happens during RM start up and continuously switching , then  RM logs 
> are filled with most with recovering applications only. 
> There are 6 lines are logged for 1 applications as I shown in below logs, 
> then consider RM default value for max-completed applications is 10K. So for 
> each switch 10K*6=60K lines will be added which is not useful I feel.
> {noformat}
> 2016-03-01 10:20:59,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority 
> level is set to application:application_1456298208485_21507
> 2016-03-01 10:20:59,094 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering 
> app: application_1456298208485_21507 with 1 attempts and final state = 
> FINISHED
> 2016-03-01 10:20:59,100 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Recovering attempt: appattempt_1456298208485_21507_01 with final state: 
> FINISHED
> 2016-03-01 10:20:59,107 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1456298208485_21507_01 State change from NEW to FINISHED
> 2016-03-01 10:20:59,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1456298208485_21507 State change from NEW to FINISHED
> 2016-03-01 10:20:59,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith   
> OPERATION=Application Finished - Succeeded  TARGET=RMAppManager 
> RESULT=SUCCESS  APPID=application_1456298208485_21507
> {noformat}
> The main problem is missing important information's from the logs before RM 
> unstable. Even though log roll back is 50 or 100, in a short period all these 
> logs will be rolled out and all the logs contains only RM switching 
> information that too recovering applications!!. 
> I suggest at least completed applications recovery should be logged as DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications

2016-12-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783954#comment-15783954
 ] 

Hadoop QA commented on YARN-4882:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 252 unchanged - 2 fixed = 252 total (was 254) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 38m 59s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
17s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 59m 51s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:a9ad5d6 |
| JIRA Issue | YARN-4882 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12844988/YARN-4882.005.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 271abc327851 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 
21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 9ca54f4 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/14486/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/14486/testReport/ |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/14486/console |
| Powered by | Apache Yetus 0.5.0-SNAPSHOT   http://yetus.apache.org |


This 

[jira] [Updated] (YARN-6038) Check other resource requests if cannot match the first one while identifying containers to preempt

2016-12-28 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu updated YARN-6038:
---
Issue Type: Sub-task  (was: Improvement)
Parent: YARN-5990

> Check other resource requests if cannot match the first one while identifying 
> containers to preempt
> ---
>
> Key: YARN-6038
> URL: https://issues.apache.org/jira/browse/YARN-6038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Yufei Gu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6038) Check other resource requests if cannot match the first one while identifying containers to preempt

2016-12-28 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu updated YARN-6038:
---
Summary: Check other resource requests if cannot match the first one while 
identifying containers to preempt  (was: Check other resource requests if 
cannot match the first one while identify containers to preempt)

> Check other resource requests if cannot match the first one while identifying 
> containers to preempt
> ---
>
> Key: YARN-6038
> URL: https://issues.apache.org/jira/browse/YARN-6038
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Yufei Gu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6038) Check other resource requests if cannot match the first one while identify containers to preempt

2016-12-28 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu updated YARN-6038:
---
Summary: Check other resource requests if cannot match the first one while 
identify containers to preempt  (was: Check other resource requests if cannot 
match the first one)

> Check other resource requests if cannot match the first one while identify 
> containers to preempt
> 
>
> Key: YARN-6038
> URL: https://issues.apache.org/jira/browse/YARN-6038
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Yufei Gu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5258) Document Use of Docker with LinuxContainerExecutor

2016-12-28 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5258:
---
Attachment: YARN-5258.004.patch

Looks like I missed [~tangzhankun]'s comments before.  This patch addresses 
those and a couple of other minor issues I just caught.  [~sidharta-s] or 
[~vvasudev], any comments?

> Document Use of Docker with LinuxContainerExecutor
> --
>
> Key: YARN-5258
> URL: https://issues.apache.org/jira/browse/YARN-5258
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>  Labels: oct16-easy
> Attachments: YARN-5258.001.patch, YARN-5258.002.patch, 
> YARN-5258.003.patch, YARN-5258.004.patch
>
>
> There aren't currently any docs that explain how to configure Docker and all 
> of its various options aside from reading all of the JIRAs.  We need to 
> document the configuration, use, and troubleshooting, along with helpful 
> examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6038) Check other resource requests if cannot match the first one

2016-12-28 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu updated YARN-6038:
---
Summary: Check other resource requests if cannot match the first one  (was: 
Check other resource requests if we can't match the first one)

> Check other resource requests if cannot match the first one
> ---
>
> Key: YARN-6038
> URL: https://issues.apache.org/jira/browse/YARN-6038
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Yufei Gu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5849) Automatically create YARN control group for pre-mounted cgroups

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783916#comment-15783916
 ] 

Daniel Templeton commented on YARN-5849:


Latest patch looks good to me.  [~bibinchundatt], any additional comments?

> Automatically create YARN control group for pre-mounted cgroups
> ---
>
> Key: YARN-5849
> URL: https://issues.apache.org/jira/browse/YARN-5849
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.7.3, 3.0.0-alpha1, 3.0.0-alpha2
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-5849.000.patch, YARN-5849.001.patch, 
> YARN-5849.002.patch, YARN-5849.003.patch, YARN-5849.004.patch, 
> YARN-5849.005.patch, YARN-5849.006.patch, YARN-5849.007.patch, 
> YARN-5849.008.patch
>
>
> Yarn can be launched with linux-container-executor.cgroups.mount set to 
> false. It will search for the cgroup mount paths set up by the administrator 
> parsing the /etc/mtab file. You can also specify 
> resource.percentage-physical-cpu-limit to limit the CPU resources assigned to 
> containers.
> linux-container-executor.cgroups.hierarchy is the root of the settings of all 
> YARN containers. If this is specified but not created YARN will fail at 
> startup:
> Caused by: java.io.FileNotFoundException: 
> /cgroups/cpu/hadoop-yarn/cpu.cfs_period_us (Permission denied)
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.updateCgroup(CgroupsLCEResourcesHandler.java:263)
> This JIRA is about automatically creating YARN control group in the case 
> above. It reduces the cost of administration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5554) MoveApplicationAcrossQueues does not check user permission on the target queue

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783908#comment-15783908
 ] 

Daniel Templeton commented on YARN-5554:


Let's get this thing closed out.  A few more comments:

* In {{ClientRMService}}, {code}Server.getRemoteAddress(), null, 
targetQueue)||{code} should have a space before the pipes
* In the new {{QueueACLsManager.checkAccess()}}, I'd really appreciate a 
comment that sums up the previous discussion on this JIRA so that the next 
person is less confused than I was
* In {{TestClientRMService.getQueueAclManager()}}, the {{answer()}} method in 
the anonymous inner class should have an {{@Override}} annotation.  Also, I 
think you'll run into problems with Java 7 and the non-final parameters being 
used inside the anonymous inner class
* Same comments for {{createClientRMServiceForMoveApplicationRequest()}}, plus 
you shouldn't need the suppress warnings annotation now that YARN-4457 is in.  
You will need it in branch-2, though.

> MoveApplicationAcrossQueues does not check user permission on the target queue
> --
>
> Key: YARN-5554
> URL: https://issues.apache.org/jira/browse/YARN-5554
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Haibo Chen
>Assignee: Wilfred Spiegelenburg
>  Labels: oct16-medium
> Attachments: YARN-5554.10.patch, YARN-5554.11.patch, 
> YARN-5554.2.patch, YARN-5554.3.patch, YARN-5554.4.patch, YARN-5554.5.patch, 
> YARN-5554.6.patch, YARN-5554.7.patch, YARN-5554.8.patch, YARN-5554.9.patch
>
>
> moveApplicationAcrossQueues operation currently does not check user 
> permission on the target queue. This incorrectly allows one user to move 
> his/her own applications to a queue that the user has no access to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5554) MoveApplicationAcrossQueues does not check user permission on the target queue

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783850#comment-15783850
 ] 

Daniel Templeton commented on YARN-5554:


Yep, I noticed that as well.  The {{remoteAddress}} and {{forwardedAddress}} 
parameters that are the reason for the special casing are actually ignored 
under the covers, resulting in all the schedulers doing the same thing anyway.

> MoveApplicationAcrossQueues does not check user permission on the target queue
> --
>
> Key: YARN-5554
> URL: https://issues.apache.org/jira/browse/YARN-5554
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Haibo Chen
>Assignee: Wilfred Spiegelenburg
>  Labels: oct16-medium
> Attachments: YARN-5554.10.patch, YARN-5554.11.patch, 
> YARN-5554.2.patch, YARN-5554.3.patch, YARN-5554.4.patch, YARN-5554.5.patch, 
> YARN-5554.6.patch, YARN-5554.7.patch, YARN-5554.8.patch, YARN-5554.9.patch
>
>
> moveApplicationAcrossQueues operation currently does not check user 
> permission on the target queue. This incorrectly allows one user to move 
> his/her own applications to a queue that the user has no access to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4882) Change the log level to DEBUG for recovering completed applications

2016-12-28 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-4882:
---
Attachment: YARN-4882.005.patch

Changed the colon in the attempt message to an equals.

> Change the log level to DEBUG for recovering completed applications
> ---
>
> Key: YARN-4882
> URL: https://issues.apache.org/jira/browse/YARN-4882
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Daniel Templeton
>  Labels: oct16-easy
> Attachments: YARN-4882.001.patch, YARN-4882.002.patch, 
> YARN-4882.003.patch, YARN-4882.004.patch, YARN-4882.005.patch
>
>
> I think for recovering completed applications no need to log as INFO, rather 
> it can be made it as DEBUG.  The problem seen from large cluster is if any 
> issue happens during RM start up and continuously switching , then  RM logs 
> are filled with most with recovering applications only. 
> There are 6 lines are logged for 1 applications as I shown in below logs, 
> then consider RM default value for max-completed applications is 10K. So for 
> each switch 10K*6=60K lines will be added which is not useful I feel.
> {noformat}
> 2016-03-01 10:20:59,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority 
> level is set to application:application_1456298208485_21507
> 2016-03-01 10:20:59,094 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering 
> app: application_1456298208485_21507 with 1 attempts and final state = 
> FINISHED
> 2016-03-01 10:20:59,100 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Recovering attempt: appattempt_1456298208485_21507_01 with final state: 
> FINISHED
> 2016-03-01 10:20:59,107 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1456298208485_21507_01 State change from NEW to FINISHED
> 2016-03-01 10:20:59,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1456298208485_21507 State change from NEW to FINISHED
> 2016-03-01 10:20:59,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith   
> OPERATION=Application Finished - Succeeded  TARGET=RMAppManager 
> RESULT=SUCCESS  APPID=application_1456298208485_21507
> {noformat}
> The main problem is missing important information's from the logs before RM 
> unstable. Even though log roll back is 50 or 100, in a short period all these 
> logs will be rolled out and all the logs contains only RM switching 
> information that too recovering applications!!. 
> I suggest at least completed applications recovery should be logged as DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5709) Cleanup leader election configs and pluggability

2016-12-28 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5709:
---
Attachment: yarn-5709-branch-2.8.03.patch

Forgot we were talking about branch-2.8, so the suppress warnings is still 
needed.  In this patch I unjavadoced the comment.  It passes test-patch for me. 
 We'll see.

> Cleanup leader election configs and pluggability
> 
>
> Key: YARN-5709
> URL: https://issues.apache.org/jira/browse/YARN-5709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: yarn-5709-branch-2.8.01.patch, 
> yarn-5709-branch-2.8.02.patch, yarn-5709-branch-2.8.03.patch, 
> yarn-5709-branch-2.8.patch, yarn-5709-wip.2.patch, yarn-5709.1.patch, 
> yarn-5709.2.patch, yarn-5709.3.patch, yarn-5709.4.patch
>
>
> While reviewing YARN-5677 and YARN-5694, I noticed we could make the 
> curator-based election code cleaner. It is nicer to get this fixed in 2.8 
> before we ship it, but this can be done at a later time as well. 
> # By EmbeddedElector, we meant it was running as part of the RM daemon. Since 
> the Curator-based elector is also running embedded, I feel the code should be 
> checking for {{!curatorBased}} instead of {{isEmbeddedElector}}
> # {{LeaderElectorService}} should probably be named 
> {{CuratorBasedEmbeddedElectorService}} or some such.
> # The code that initializes the elector should be at the same place 
> irrespective of whether it is curator-based or not. 
> # We seem to be caching the CuratorFramework instance in RM. It makes more 
> sense for it to be in RMContext. If others are okay with it, we might even be 
> better of having {{RMContext#getCurator()}} method to lazily create the 
> curator framework and then cache it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5275) Timeline application page cannot be loaded when no application submitted/running on the cluster after HADOOP-9613

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783754#comment-15783754
 ] 

Daniel Templeton commented on YARN-5275:


Ping, [~sunilg]...

> Timeline application page cannot be loaded when no application 
> submitted/running on the cluster after HADOOP-9613
> -
>
> Key: YARN-5275
> URL: https://issues.apache.org/jira/browse/YARN-5275
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha1
>Reporter: Tsuyoshi Ozawa
>Priority: Critical
>
> After HADOOP-9613, Timeline Web UI has a problem reported by [~leftnoteasy] 
> and [~sunilg]
> {quote}
> when no application submitted/running on the cluster, applications page 
> cannot be loaded. 
> {quote}
> We should investigate the reason and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783757#comment-15783757
 ] 

Daniel Templeton commented on YARN-2962:


No worries.  I'm happy to help with the review.

> ZKRMStateStore: Limit the number of znodes under a znode
> 
>
> Key: YARN-2962
> URL: https://issues.apache.org/jira/browse/YARN-2962
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-2962.01.patch, YARN-2962.04.patch, 
> YARN-2962.05.patch, YARN-2962.2.patch, YARN-2962.3.patch
>
>
> We ran into this issue where we were hitting the default ZK server message 
> size configs, primarily because the message had too many znodes even though 
> they individually they were all small.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-4401) A failed app recovery should not prevent the RM from starting

2016-12-28 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton resolved YARN-4401.

Resolution: Won't Fix

This JIRA is superseded by YARN-6035, YARN-6036, and YARN-6037, which capture 
the same idea but more supportably.

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783546#comment-15783546
 ] 

Daniel Templeton edited comment on YARN-6031 at 12/28/16 7:22 PM:
--

bq. IIUC ignore validation on recovery also should work.

Then you end up with unschedulable apps in the system, which can't be good.

bq. We could ignore/reset labels to default in resourcerequest when nodelabels 
are disabled.

The issue there is that the labels may have some important meaning to the job, 
so defaulting the labels may be bad.  As applications can have side-effects, I 
think it's better to have the failure up front than let the application 
potentially fail somewhere down the line.  The admin is then immediately made 
aware that he screwed up by disabling a feature that was still in use.


was (Author: templedf):
bq. IIUC ignore validation on recovery also should work.

Then you end up with unschedulable apps in the system, which can't be good.

bg. We could ignore/reset labels to default in resourcerequest when nodelabels 
are disabled.

The issue there is that the labels may have some important meaning to the job, 
so defaulting the labels may be bad.  As applications can have side-effects, I 
think it's better to have the failure up front than let the application 
potentially fail somewhere down the line.  The admin is then immediately made 
aware that he screwed up by disabling a feature that was still in use.

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783546#comment-15783546
 ] 

Daniel Templeton commented on YARN-6031:


bq. IIUC ignore validation on recovery also should work.

Then you end up with unschedulable apps in the system, which can't be good.

bg. We could ignore/reset labels to default in resourcerequest when nodelabels 
are disabled.

The issue there is that the labels may have some important meaning to the job, 
so defaulting the labels may be bad.  As applications can have side-effects, I 
think it's better to have the failure up front than let the application 
potentially fail somewhere down the line.  The admin is then immediately made 
aware that he screwed up by disabling a feature that was still in use.

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5831) Propagate allowPreemptionFrom flag all the way down to the app

2016-12-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783486#comment-15783486
 ] 

Karthik Kambatla commented on YARN-5831:


Thanks for working on this, Yufei. 

Comments on the patch:
# FSAppAttempt: 
## Should we have a method called isPreemptable() in Schedulable and override 
it? Also, we could may be add unit tests for isPreemptable() separately then? 
## canContainerBePreempted(): Nothing to do with this patch, but I wonder if we 
should check isPreemptable() first? 
# updatePreemptionVariables refactor: I like the idea of not recursing for 
updating preemption variables. 
## We should make sure that queue.init() is called in a pre-order fashion on 
config update. Right now, we rely on the iterator (queues.values()). Is that 
guaranteed to show items in preorder? 
## How do we guard against future changes where one updates a parent preemption 
config outside of config changes and that does not propagate to children? How 
likely are such changes? Do we even need to worry about them now? 
## Nothing to do with this patch, but we should likely rename FSQueue#init to 
FSQueue#reinit?


> Propagate allowPreemptionFrom flag all the way down to the app
> --
>
> Key: YARN-5831
> URL: https://issues.apache.org/jira/browse/YARN-5831
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Karthik Kambatla
>Assignee: Yufei Gu
> Attachments: YARN-5831.001.patch, YARN-5831.002.patch
>
>
> FairScheduler allows disallowing preemption from a queue. When checking if 
> preemption for an application is allowed, the new preemption code recurses 
> all the way to the root queue to check this flag. 
> Propagating this information all the way to the app will be more efficient. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6038) Check other resource requests if we can't match the first one

2016-12-28 Thread Yufei Gu (JIRA)
Yufei Gu created YARN-6038:
--

 Summary: Check other resource requests if we can't match the first 
one
 Key: YARN-6038
 URL: https://issues.apache.org/jira/browse/YARN-6038
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Yufei Gu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6037) Add an option to yarn resourcemanager CLI to list all applications that would cause a recovery failure

2016-12-28 Thread Daniel Templeton (JIRA)
Daniel Templeton created YARN-6037:
--

 Summary: Add an option to yarn resourcemanager CLI to list all 
applications that would cause a recovery failure
 Key: YARN-6037
 URL: https://issues.apache.org/jira/browse/YARN-6037
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Daniel Templeton


Today the RM will fail and exit on the first application recovery failure.  It 
is often desirable to know the full list of applications that would fail 
recovery.  This JIRA proposes to add a CLI option to have the RM complete 
recovery regardless of failures and then exit after printing the full list of 
applications that could not be recovered along with reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783397#comment-15783397
 ] 

Bibin A Chundatt edited comment on YARN-6031 at 12/28/16 6:31 PM:
--

As [~sunilg] mentioned earlier ignoring application could create stale 
application in state store.

[~Ying Zhang] IIUC ignore validation on recovery also should work.

{code}
  private static void validateResourceRequest(ResourceRequest resReq,
  Resource maximumResource, QueueInfo queueInfo, RMContext rmContext)
  throws InvalidResourceRequestException {
Configuration conf = rmContext.getYarnConfiguration();
// If Node label is not enabled throw exception
if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) {
  String labelExp = resReq.getNodeLabelExpression();
  if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp)
  || null == labelExp)) {
throw new InvalidLabelResourceRequestException(
"Invalid resource request, node label not enabled "
+ "but request contains label expression");
  }
}
{code}

Thoughts??

{quote}
The current fact is (with or without this fix): application submitted with node 
label expression explicitly specified will fail during recovery
{quote}
IMHO should be acceptable since any application submitted with labels when 
feature is disabled gets rejected.

Solution 2:
We could ignore/reset  labels  to default in  resourcerequest when nodelabels 
are disabled. Havn't looked at impact of the same. 
An elaborate testing would be needed to see how metrics are impacted. 
Disadvantage is client will never get to know that reset happened in RM side

YARN-4562 will try to handle  ignore loading label configuration when disabled.

[~templedf] i do agree that admin would require some way to get application 
info when recovery fails so that bulk update in state store is possible.



was (Author: bibinchundatt):
As [~sunilg] mentioned earlier ignoring application could create stale 
application in state store.

[~Ying Zhang] IIUC ignore validation on recovery also should work.

{code}
  private static void validateResourceRequest(ResourceRequest resReq,
  Resource maximumResource, QueueInfo queueInfo, RMContext rmContext)
  throws InvalidResourceRequestException {
Configuration conf = rmContext.getYarnConfiguration();
// If Node label is not enabled throw exception
if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) {
  String labelExp = resReq.getNodeLabelExpression();
  if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp)
  || null == labelExp)) {
throw new InvalidLabelResourceRequestException(
"Invalid resource request, node label not enabled "
+ "but request contains label expression");
  }
}
{code}

Thoughts??

{quote}
The current fact is (with or without this fix): application submitted with node 
label expression explicitly specified will fail during recovery
{quote}
IMHO should be acceptable since any application submitted with labels when 
feature is disabled gets rejected.

Solution 2:
We could ignore/reset  labels  to default in  resourcerequest when nodelabels 
are disabled. Havn't looked at impact of the same. 
An elaborate testing would be needed to see how metrics are impacted. 
Disadvantage is client will never get to know that reset happened in RM side

YARN-4562 will try to handle  ignore loading label configuration when disabled.

[~templedf] i do agree that admin would require some way to dump application 
info when recovery fails so that bulk update in state store is possible.


> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> 

[jira] [Created] (YARN-6036) Add -show-application-info option to yarn resourcemanager CLI

2016-12-28 Thread Daniel Templeton (JIRA)
Daniel Templeton created YARN-6036:
--

 Summary: Add -show-application-info option to yarn resourcemanager 
CLI
 Key: YARN-6036
 URL: https://issues.apache.org/jira/browse/YARN-6036
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Reporter: Daniel Templeton


If an application has failed, the admin can purge it with 
{{-remove-application-from-state-store}}, but she has no information at that 
time about what the offending job is.  This JIRA proposes to add a 
{{-show-application-info}} switch to the resourcemanager CLI that will print 
the application information without needing a running RM so that the admin can 
make an informed decision.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6035) Add -force-recovery option to yarn resourcemanager

2016-12-28 Thread Daniel Templeton (JIRA)
Daniel Templeton created YARN-6035:
--

 Summary: Add -force-recovery option to yarn resourcemanager
 Key: YARN-6035
 URL: https://issues.apache.org/jira/browse/YARN-6035
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Reporter: Daniel Templeton


If multiple applications cannot be recovered, the admin is forced to repeatedly 
attempt to start the RM, check the logs, and purge the offending app.  Worse, 
the admin has no information about the app that was purged, other than the ID.

This JIRA proposes to add a {{-force-recovery}} option to the resourcemanager 
CLI that will automatically purge any apps that fail to recover after dumping 
the full application profile to a log.  It would nice if the option prompts the 
users with an "are you sure?" before continuing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5824) Verify app starvation under custom preemption thresholds and timeouts

2016-12-28 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu reassigned YARN-5824:
--

Assignee: Yufei Gu

> Verify app starvation under custom preemption thresholds and timeouts
> -
>
> Key: YARN-5824
> URL: https://issues.apache.org/jira/browse/YARN-5824
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Karthik Kambatla
>Assignee: Yufei Gu
>
> YARN-5783 adds basic tests to verify applications are identified to be 
> starved. This JIRA is to add more advanced tests for different values of 
> preemption thresholds and timeouts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783397#comment-15783397
 ] 

Bibin A Chundatt edited comment on YARN-6031 at 12/28/16 6:08 PM:
--

As [~sunilg] mentioned earlier ignoring application could create stale 
application in state store.

[~Ying Zhang] IIUC ignore validation on recovery also should work.

{code}
  private static void validateResourceRequest(ResourceRequest resReq,
  Resource maximumResource, QueueInfo queueInfo, RMContext rmContext)
  throws InvalidResourceRequestException {
Configuration conf = rmContext.getYarnConfiguration();
// If Node label is not enabled throw exception
if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) {
  String labelExp = resReq.getNodeLabelExpression();
  if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp)
  || null == labelExp)) {
throw new InvalidLabelResourceRequestException(
"Invalid resource request, node label not enabled "
+ "but request contains label expression");
  }
}
{code}

Thoughts??

{quote}
The current fact is (with or without this fix): application submitted with node 
label expression explicitly specified will fail during recovery
{quote}
IMHO should be acceptable since any application submitted with labels when 
feature is disabled gets rejected.

Solution 2:
We could ignore/reset  labels  to default in  resourcerequest when nodelabels 
are disabled. Havn't looked at impact of the same. 
An elaborate testing would be needed to see how metrics are impacted. 
Disadvantage is client will never get to know that reset happened in RM side

YARN-4562 will try to handle  ignore loading label configuration when disabled.

[~templedf] i do agree that admin would require some way to dump application 
info when recovery fails so that bulk update in state store is possible.



was (Author: bibinchundatt):
As [~sunilg] mentioned earlier ignoring application could create stale 
application in state store.

[~Ying Zhang] IIUC ignore validation on recovery also should work.

{code}
  private static void validateResourceRequest(ResourceRequest resReq,
  Resource maximumResource, QueueInfo queueInfo, RMContext rmContext)
  throws InvalidResourceRequestException {
Configuration conf = rmContext.getYarnConfiguration();
// If Node label is not enabled throw exception
if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) {
  String labelExp = resReq.getNodeLabelExpression();
  if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp)
  || null == labelExp)) {
throw new InvalidLabelResourceRequestException(
"Invalid resource request, node label not enabled "
+ "but request contains label expression");
  }
}
{code}

Thoughts??

{quote}
The current fact is (with or without this fix): application submitted with node 
label expression explicitly specified will fail during recovery
{quote}
IMHO should be acceptable since any application submitted with labels when 
feature is disabled gets rejected.

Solution 2:
We could ignore/reset  labels  to default in  resourcerequest when nodelabels 
are disabled. Havn't looked at impact of the same. 
An elaborate testing would be needed to see how metrics are impacted.

YARN-4562 will try to handle  ignore loading label configuration when disabled.

[~templedf] i do agree that admin would require some way to dump application 
info when recovery fails so that bulk update in state store is possible.


> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> 

[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783397#comment-15783397
 ] 

Bibin A Chundatt commented on YARN-6031:


As [~sunilg] mentioned earlier ignoring application could create stale 
application in state store.

[~Ying Zhang] IIUC ignore validation on recovery also should work.

{code}
  private static void validateResourceRequest(ResourceRequest resReq,
  Resource maximumResource, QueueInfo queueInfo, RMContext rmContext)
  throws InvalidResourceRequestException {
Configuration conf = rmContext.getYarnConfiguration();
// If Node label is not enabled throw exception
if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) {
  String labelExp = resReq.getNodeLabelExpression();
  if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp)
  || null == labelExp)) {
throw new InvalidLabelResourceRequestException(
"Invalid resource request, node label not enabled "
+ "but request contains label expression");
  }
}
{code}

Thoughts??

{quote}
The current fact is (with or without this fix): application submitted with node 
label expression explicitly specified will fail during recovery
{quote}
IMHO should be acceptable since any application submitted with labels when 
feature is disabled gets rejected.

Solution 2:
We could ignore/reset  labels  to default in  resourcerequest when nodelabels 
are disabled. Havn't looked at impact of the same. 
An elaborate testing would be needed to see how metrics are impacted.

YARN-4562 will try to handle  ignore loading label configuration when disabled.

[~templedf] i do agree that admin would require some way to dump application 
info when recovery fails so that bulk update in state store is possible.


> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5969) FairShareComparator: Cache value of getResourceUsage for better performance

2016-12-28 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783392#comment-15783392
 ] 

Yufei Gu commented on YARN-5969:


Absolutely! [~zsl2007], thanks for working on this. Any contribution to 
community is welcome!

> FairShareComparator: Cache value of getResourceUsage for better performance
> ---
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: 20161206.patch, 20161222.patch, YARN-5969.patch, 
> apprunning_after.png, apprunning_before.png, 
> containerAllocatedDelta_before.png, containerAllocated_after.png, 
> pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5257) Fix unreleased resources and null dereferences

2016-12-28 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783376#comment-15783376
 ] 

Yufei Gu commented on YARN-5257:


Thanks [~rkanter] for the review and commit.

> Fix unreleased resources and null dereferences
> --
>
> Key: YARN-5257
> URL: https://issues.apache.org/jira/browse/YARN-5257
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-5257.001.patch
>
>
> The following code contain potential problems:
> {code}
> Unreleased Resource: Streams  TopCLI.java:738
> Unreleased Resource: Streams  Graph.java:189
> Unreleased Resource: Streams  CgroupsLCEResourcesHandler.java:291
> Unreleased Resource: Streams  UnmanagedAMLauncher.java:195
> Unreleased Resource: Streams  CGroupsHandlerImpl.java:319
> Unreleased Resource: Streams  TrafficController.java:629
> Null Dereference  ApplicationImpl.java:465
> Null Dereference  VisualizeStateMachine.java:52
> Null Dereference  ContainerImpl.java:1089
> Null Dereference  QueueManager.java:219
> Null Dereference  QueueManager.java:232
> Null Dereference  ResourceLocalizationService.java:1016
> Null Dereference  ResourceLocalizationService.java:1023
> Null Dereference  ResourceLocalizationService.java:1040
> Null Dereference  ResourceLocalizationService.java:1052
> Null Dereference  ProcfsBasedProcessTree.java:802
> Null Dereference  TimelineClientImpl.java:639
> Null Dereference  LocalizedResource.java:206
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5798) Handle FSPreemptionThread crashing due to a RuntimeException

2016-12-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783375#comment-15783375
 ] 

Karthik Kambatla commented on YARN-5798:


Oh, and in RMFatalEventType, instead of calling it OTHERS, we should likely 
have a more descriptive name. 

> Handle FSPreemptionThread crashing due to a RuntimeException
> 
>
> Key: YARN-5798
> URL: https://issues.apache.org/jira/browse/YARN-5798
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Yufei Gu
> Attachments: YARN-5798.001.patch
>
>
> YARN-5605 added an FSPreemptionThread. The run() method catches 
> InterruptedException. If this were to run into a RuntimeException, the 
> preemption thread would crash. We should probably fail the RM itself (or 
> transition to standby) when this happens. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5906) Update AppSchedulingInfo to use SchedulingPlacementSet

2016-12-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783373#comment-15783373
 ] 

Hudson commented on YARN-5906:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11050 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/11050/])
YARN-5906. Update AppSchedulingInfo to use SchedulingPlacementSet. (sunilg: rev 
9ca54f4810de182195263bd594afb56dab564105)
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/LocalitySchedulingPlacementSet.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/SchedulingPlacementSet.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimitsByPartition.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java


> Update AppSchedulingInfo to use SchedulingPlacementSet
> --
>
> Key: YARN-5906
> URL: https://issues.apache.org/jira/browse/YARN-5906
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.0.0-alpha2
>
> Attachments: YARN-5906.1.patch, YARN-5906.2.patch, YARN-5906.3.patch, 
> YARN-5906.4.patch, YARN-5906.5.patch
>
>
> Currently AppSchedulingInfo simply stores resource request and scheduler make 
> decision according to stored resource request. For example, CS/FS use 
> slightly different approach to get pending resource request and make delay 
> scheduling decision. 
> There're several benefits of moving pending resource request data structure 
> to SchedulingPlacementSet
> 1) Delay scheduling logic should be agnostic to scheduler, for example CS 
> supports count-based delay and FS supports both of count-based and time-based 
> delay. Ideally scheduler should be able to choose which delay scheduling 
> policy to use.
> 2) In addition to 1., YARN-4902 has proposal to support pluggable delay 
> scheduling behavior in addition to locality-based (host->rack->offswitch). 
> Which requires more flexibility.
> 3) To make YARN-4902 becomes real, instead of directly adding the new 
> resource request API to client, we can make scheduler to use it internally to 
> make sure it is well defined. And AppSchedulingInfo/SchedulingPlacementSet 
> will be the perfect place to isolate which ResourceRequest implementation to 
> use.
> 4) Different scheduling requirement needs different behavior of checking 
> ResourceRequest table.
> This JIRA is the 1st patch of several refactorings. Which moves all 
> ResourceRequest data structure and logics to SchedulingPlacementSet. We need 
> follow changes to make it better structured
> - Make delay scheduling to be a plugin of SchedulingPlacementSet
> - After YARN-4902 get committed, change SchedulingPlacementSet to use 
> YARN-4902 internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5798) Handle FSPreemptionThread crashing due to a RuntimeException

2016-12-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783370#comment-15783370
 ] 

Karthik Kambatla commented on YARN-5798:


Thanks for picking this up, Yufei. 

Comments on the patch:
# Interrupting the thread is the only way to stop the thread. So, I think we 
should just return (and not transition to standby) for InterruptedException.
# For RTE, how about using a custom {{Thread.UncaughtExceptionHandler}}. Other 
threads in FairScheduler do not handle RTEs either. We could use the same 
handler for all these threads. Might not be a bad idea to increase the scope of 
this JIRA and include those as well? 
# We should also add tests. 

> Handle FSPreemptionThread crashing due to a RuntimeException
> 
>
> Key: YARN-5798
> URL: https://issues.apache.org/jira/browse/YARN-5798
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Yufei Gu
> Attachments: YARN-5798.001.patch
>
>
> YARN-5605 added an FSPreemptionThread. The run() method catches 
> InterruptedException. If this were to run into a RuntimeException, the 
> preemption thread would crash. We should probably fail the RM itself (or 
> transition to standby) when this happens. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783232#comment-15783232
 ] 

Daniel Templeton commented on YARN-6031:


I agree that {{-force-recovery}} could cause a significant information loss, 
but it's something that the admin has to do explicitly, and it's only 
application information, so it's not the end of the world.  With a 
{{-dump-application-information}} option, the admin has the choice to either 1) 
look at each app that fails the recovery and decide whether to purge it or do 
something else (like turn node labels back on), or 2) do a bulk purge with 
{{-force-recovery}}.

It might also be good to have another option, something like 
{{-dry-run-recovery}}, that would tell the admin the IDs of all the 
applications that will fail during recovery so that she doesn't have to keep 
doing them one at a time.  In fact, I could even see making that the default 
behavior before failing the resource manager.

In any case, I don't think the approach proposed in this JIRA, to just ignore 
the failed app, is going to work out.


> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783133#comment-15783133
 ] 

Daniel Templeton commented on YARN-5931:


I agree, but it should be "collection", not "collections".

> Document timeout interfaces CLI and REST APIs
> -
>
> Key: YARN-5931
> URL: https://issues.apache.org/jira/browse/YARN-5931
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: ResourceManagerRest.html, YARN-5931.0.patch, 
> YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5906) Update AppSchedulingInfo to use SchedulingPlacementSet

2016-12-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783106#comment-15783106
 ] 

Sunil G commented on YARN-5906:
---

Test case failures are unrelated. Will commit in a short while.

> Update AppSchedulingInfo to use SchedulingPlacementSet
> --
>
> Key: YARN-5906
> URL: https://issues.apache.org/jira/browse/YARN-5906
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5906.1.patch, YARN-5906.2.patch, YARN-5906.3.patch, 
> YARN-5906.4.patch, YARN-5906.5.patch
>
>
> Currently AppSchedulingInfo simply stores resource request and scheduler make 
> decision according to stored resource request. For example, CS/FS use 
> slightly different approach to get pending resource request and make delay 
> scheduling decision. 
> There're several benefits of moving pending resource request data structure 
> to SchedulingPlacementSet
> 1) Delay scheduling logic should be agnostic to scheduler, for example CS 
> supports count-based delay and FS supports both of count-based and time-based 
> delay. Ideally scheduler should be able to choose which delay scheduling 
> policy to use.
> 2) In addition to 1., YARN-4902 has proposal to support pluggable delay 
> scheduling behavior in addition to locality-based (host->rack->offswitch). 
> Which requires more flexibility.
> 3) To make YARN-4902 becomes real, instead of directly adding the new 
> resource request API to client, we can make scheduler to use it internally to 
> make sure it is well defined. And AppSchedulingInfo/SchedulingPlacementSet 
> will be the perfect place to isolate which ResourceRequest implementation to 
> use.
> 4) Different scheduling requirement needs different behavior of checking 
> ResourceRequest table.
> This JIRA is the 1st patch of several refactorings. Which moves all 
> ResourceRequest data structure and logics to SchedulingPlacementSet. We need 
> follow changes to make it better structured
> - Make delay scheduling to be a plugin of SchedulingPlacementSet
> - After YARN-4902 get committed, change SchedulingPlacementSet to use 
> YARN-4902 internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs

2016-12-28 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783083#comment-15783083
 ] 

Rohith Sharma K S commented on YARN-5931:
-

I think this sentence is more meaning full. "When you run a GET operation on 
this resource, a collections of ApplicationTimeout object is returned"

> Document timeout interfaces CLI and REST APIs
> -
>
> Key: YARN-5931
> URL: https://issues.apache.org/jira/browse/YARN-5931
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: ResourceManagerRest.html, YARN-5931.0.patch, 
> YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783073#comment-15783073
 ] 

Sunil G commented on YARN-6031:
---

Yes. Makes sense. This is more less a work for admin then. I am not so sure 
whether RM can take a call and internally remove. But it may be costly as we 
are deleting a user record. But if necessary informations are pushed to logs, 
then we may be good to remove data internally itself. Thoughts'?

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783025#comment-15783025
 ] 

Daniel Templeton commented on YARN-6031:


bq. max_applications may hit and valid apps may get emitted at some point of 
time.

Exactly my concern.  So far, the answer has been that the recovery should just 
fail, and the admin should clear the app.  See your comments in YARN-4401.

Personally, I think that's a bad experience for the admin, but resolving it 
will require a bit of infrastructure work to make something useful happen.  I 
think a {{-force-recovery}} option that purges the bad apps after dumping full 
info to the logs would be a good start.  It would also help to have an option 
to dump the info about an app from the state store without having the RM 
running, so that when using {{-remove-application-from-state-store}} you know 
what you're purging.

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783002#comment-15783002
 ] 

Daniel Templeton commented on YARN-5931:


Thanks for the update.  Two more things:

* This one is still outstanding: "When you run a GET operation on this 
resource, you can obtain a collection of Application Timeout Objects." should 
be "When you run a GET operation on this resource, a collection of Application 
Timeout Objects is returned."
* In the Cluster Applications API, you say that a collection "are" returned.  
It should be "is."

> Document timeout interfaces CLI and REST APIs
> -
>
> Key: YARN-5931
> URL: https://issues.apache.org/jira/browse/YARN-5931
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: ResourceManagerRest.html, YARN-5931.0.patch, 
> YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782995#comment-15782995
 ] 

Sunil G commented on YARN-6031:
---

Yes [~templedf]. You are correct. We will end up having many flaky apps in 
state store. Offline had a chat with [~bibinchundatt] also, and there may be a 
potential problem with that too. max_applications may hit and valid apps may 
get emitted at some point of time.
We can forcefully remove state store, however we may loose information abt it 
as its a failed app. With clear logging, we can evict such apps from state 
store. I am not so sure whether we need to delete immediately or can put to an 
async monitor to delete later. Thoughts?

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782984#comment-15782984
 ] 

Daniel Templeton commented on YARN-6031:


Yep, tests are needed.  Love the long explanatory comment.  Do you think we can 
make the log message a bit more explicit, i.e. say that the failure was because 
node labels have been disabled and point out the property that the admin should 
use to disable/enable node labels?

Also, what happens to the app in the state store?  If we fail to recover it and 
just ignore it, it will sit there forever, I suspect.  It's probably a bad 
thing if the RM and state store don't agree on what apps are active.

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782814#comment-15782814
 ] 

Sunil G edited comment on YARN-6031 at 12/28/16 1:04 PM:
-

Thanks [~Ying Zhang],

Overall approach makes sense to me. You are basically trying to act on checking 
for label disabled check etc only when an exception is thrown. Its more or less 
same as the earlier suggestion. So its ok.

However few more points. With this patch, now the app recovery will continue.
- If App's AM resource request was not having any specific node label, but 
other containers may have. Since we send InvalidResourceRequest on those case, 
it might be fine for now.
- In above case, what will happen to ongoing containers (w.r.t RM's data 
structure)?. If its a running app w/o any outstanding requests, we might need 
to consider running containers as for NO_LABEL. I am not sure whether this will 
happen as of today. I ll also check.

These could be out of scope for this ticket. However lets check opinion from 
other folks as well.

*Note*: Please add more test cases to cover this patch.


was (Author: sunilg):
Thanks [~Ying Zhang],

Overall approach makes sense to me. You are basically trying to act on checking 
for label disabled check etc only when an exception is thrown. Its more or less 
same as the earlier suggestion. So its ok.

However few more points. With this patch, now the app recovery will continue.
- If App's AM resource request was not having any specific node label, but 
other containers may have. Since we send InvalidResourceRequest on those case, 
it might be fine for now.
- In above case, what will happen to ongoing containers. If its a running app 
w/o any outstanding requests, we might need to do label change for running 
containers 

These could be out of scope for this ticket. However lets check opinion from 
other folks as well.

NB: Please add more test cases to cover this patch.

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782814#comment-15782814
 ] 

Sunil G commented on YARN-6031:
---

Thanks [~Ying Zhang],

Overall approach makes sense to me. You are basically trying to act on checking 
for label disabled check etc only when an exception is thrown. Its more or less 
same as the earlier suggestion. So its ok.

However few more points. With this patch, now the app recovery will continue.
- If App's AM resource request was not having any specific node label, but 
other containers may have. Since we send InvalidResourceRequest on those case, 
it might be fine for now.
- In above case, what will happen to ongoing containers. If its a running app 
w/o any outstanding requests, we might need to do label change for running 
containers 

These could be out of scope for this ticket. However lets check opinion from 
other folks as well.

NB: Please add more test cases to cover this patch.

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6024) Capacity Scheduler 'continuous reservation looking' doesn't work when sum of queue's used and reserved resources is equal to max

2016-12-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782613#comment-15782613
 ] 

Sunil G commented on YARN-6024:
---

Committed {{YARN-6024.001.patch}} to trunk/branch-2/branch-2.8 and 
{{YARN-6024-branch-2.7.001.patch}} to branch-2.7. Thanks [~leftnoteasy] for the 
patch and thanks [~Ying Zhang] for additional review.

> Capacity Scheduler 'continuous reservation looking' doesn't work when sum of 
> queue's used and reserved resources is equal to max
> 
>
> Key: YARN-6024
> URL: https://issues.apache.org/jira/browse/YARN-6024
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6024-branch-2.7.001.patch, 
> YARN-6024-branch-2.7.001.patch, YARN-6024.001.patch
>
>
> Found one corner case when continuous reservation looking doesn't work:
> When queue's used=max, the queue's capacity check fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5719) Enforce a C standard for native container-executor

2016-12-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782602#comment-15782602
 ] 

Hudson commented on YARN-5719:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11049 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/11049/])
YARN-5719. Enforce a C standard for native container-executor. (vvasudev: rev 
972da46cb48725ad49d3e0a033742bd1a8228f51)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/CMakeLists.txt


> Enforce a C standard for native container-executor
> --
>
> Key: YARN-5719
> URL: https://issues.apache.org/jira/browse/YARN-5719
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Reporter: Chris Douglas
>Assignee: Chris Douglas
> Fix For: 3.0.0-alpha2
>
> Attachments: YARN-5719.000.patch
>
>
> The {{container-executor}} build should declare the C standard it uses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782566#comment-15782566
 ] 

Ying Zhang edited comment on YARN-6031 at 12/28/16 10:22 AM:
-

Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465:   
swallow the InvalidResourceRequest exception when recovering, only fail the 
recovery for this application and print a error message, then let the rest of 
the recovery continue.

[~sunilg], your suggestion also makes sense to me. Actually, the code change 
using your approach would be made at the same place as in this patch with small 
modification: in function recover(), inside the for loop,  if the conditions 
are met, skip calling "recoverApplication" and log a message like "skip recover 
application ..." instead. Difference is that using this approach we'll always 
check for these conditions even though it might not be a normal case, while 
using the approach in the patch, we just need to react when the exception 
happens. I'm ok with each approach since the overhead is not that big.

Let's see what others think:-) [~leftnoteasy], [~bibinchundatt]

Just want to clarify. The current fact is (with or without this fix): 
application submitted with node label expression explicitly specified will fail 
during recovery, while application submitted without node label expression 
specified will succeed, no matter whether or not there is default node label 
expression for the target queue. This is due to the following code snippet, the 
calling for "checkQueueLabelInLabelManager"  which will check if node label 
exists in node label manager(node label manager has no label at all if Node 
Label being disabled) has been skipped for recovery:
{code:title=SchedulerUtils.java|borderStyle=solid}
  public static void normalizeAndValidateRequest(ResourceRequest resReq,
  Resource maximumResource, String queueName, YarnScheduler scheduler,
  boolean isRecovery, RMContext rmContext, QueueInfo queueInfo)
  throws InvalidResourceRequestException {
... ...

SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo);
if (!isRecovery) {
  validateResourceRequest(resReq, maximumResource, queueInfo, rmContext);  
// calling checkQueueLabelInLabelManager
}
{code}

This is not exactly the same as what happens when submitting a job in normal 
case (i.e., not during recovery). While in normal case, when there is default 
node label expression defined for queue with node label disabled, the 
application will also get rejected due to invalid resource request even if it 
doesn't specify node label expression. I believe this will get fixed after 
YARN-4652 being addressed.
 


was (Author: ying zhang):
Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465:   
swallow the InvalidResourceRequest exception when recovering, only fail the 
recovery for this application and print a error message, then let the rest of 
the recovery continue.

[~sunilg], your suggestion also makes sense to me. Actually, the code change 
using your approach would be made at the same place as in this patch with small 
modification: in function recover(), inside the for loop,  if the conditions 
are met, skip calling "recoverApplication" and log a message like "skip recover 
application ..." instead. Difference is that using this approach we'll always 
check for these conditions even though it might not be a normal case, while 
using the approach in the patch, we just need to react when the exception 
happens. I'm ok with each approach since the overhead is not that big.

Let's see what others think:-) [~leftnoteasy], [~bibinchundatt]

Just want to clarify. The current fact is (with or without this fix): 
application submitted with node label expression explicitly specified will fail 
during recovery, while application submitted without node label expression 
specified will succeed, no matter whether or not there is default node label 
expression for the target queue. This is due to the following code snippet, the 
calling for "checkQueueLabelInLabelManager"  which will check if node label 
exists in node label manager(node label manager has no label at all if Node 
Label being disabled) has been skipped for recovery:
{code:title=SchedulerUtils.java|borderStyle=solid}
  public static void normalizeAndValidateRequest(ResourceRequest resReq,
  Resource maximumResource, String queueName, YarnScheduler scheduler,
  boolean isRecovery, RMContext rmContext, QueueInfo queueInfo)
  throws InvalidResourceRequestException {
... ...

SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo);
if (!isRecovery) {
  validateResourceRequest(resReq, maximumResource, queueInfo, rmContext);  
// calling checkQueueLabelInLabelManager
}
{code}

 

> Application recovery failed after disabling node label
> 

[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved

2016-12-28 Thread Tao Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782578#comment-15782578
 ] 

Tao Yang commented on YARN-6029:


Thanks [~wangda] & [~Naganarasimha] !
{quote}
Agree but IIUC based on 2.8 code its less dependent on locking of child queue 
as acls are updated during reinitialization all the queues at one shot
{quote}
We also noticed that it doesn't hold the lock of LeafQueue instance when 
updating acls (CapacityScheduler#setQueueAcls) so that current logic doesn't 
guarantee the consistency of acls.
{quote}
So to ensure acls are returned appropriately i presume we should be holding the 
lock on CS.getQueueUserAclInfo which is not happening currently in 2.8.
{quote}
I'm not clear about this. Is it worth to ensure consistency of acls through 
reducing the efficiency of scheduler? 

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by 
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to 
> release a reserved container
> --
>
> Key: YARN-6029
> URL: https://issues.apache.org/jira/browse/YARN-6029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls 
> YarnClient#getQueueAclsInfo) just at the moment that 
> LeafQueue#assignContainers is called and before notifying parent queue to 
> release resource (should release a reserved container), then ResourceManager 
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized 
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized 
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue 
> root), iterates over children queue acls and is blocked when calling 
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of 
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being 
> completed and is blocked when invoking synchronized 
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of 
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be 
> removed to solve this problem, since this method appears to not affect fields 
> of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Ying Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying Zhang updated YARN-6031:
-
Attachment: YARN-6031.001.patch

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label

2016-12-28 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782566#comment-15782566
 ] 

Ying Zhang commented on YARN-6031:
--

Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465:   
swallow the InvalidResourceRequest exception when recovering, only fail the 
recovery for this application and print a error message, then let the rest of 
the recovery continue.

[~sunilg], your suggestion also makes sense to me. Actually, the code change 
using your approach would be made at the same place as in this patch with small 
modification: in function recover(), inside the for loop,  if the conditions 
are met, skip calling "recoverApplication" and log a message like "skip recover 
application ..." instead. Difference is that using this approach we'll always 
check for these conditions even though it might not be a normal case, while 
using the approach in the patch, we just need to react when the exception 
happens. I'm ok with each approach since the overhead is not that big.

Let's see what others think:-) [~leftnoteasy], [~bibinchundatt]

Just want to clarify. The current fact is (with or without this fix): 
application submitted with node label expression explicitly specified will fail 
during recovery, while application submitted without node label expression 
specified will succeed, no matter whether or not there is default node label 
expression for the target queue. This is due to the following code snippet, the 
calling for "checkQueueLabelInLabelManager"  which will check if node label 
exists in node label manager(node label manager has no label at all if Node 
Label being disabled) has been skipped for recovery:
{code:title=SchedulerUtils.java|borderStyle=solid}
  public static void normalizeAndValidateRequest(ResourceRequest resReq,
  Resource maximumResource, String queueName, YarnScheduler scheduler,
  boolean isRecovery, RMContext rmContext, QueueInfo queueInfo)
  throws InvalidResourceRequestException {
... ...

SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo);
if (!isRecovery) {
  validateResourceRequest(resReq, maximumResource, queueInfo, rmContext);  
// calling checkQueueLabelInLabelManager
}
{code}

 

> Application recovery failed after disabling node label
> --
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6034) Add support better logging in container-executor

2016-12-28 Thread Varun Vasudev (JIRA)
Varun Vasudev created YARN-6034:
---

 Summary: Add support better logging in container-executor
 Key: YARN-6034
 URL: https://issues.apache.org/jira/browse/YARN-6034
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev


Currently, the container-executor doesn't have any support for log levels, etc. 
Add support for better logging and setting log levels by the invoker or config 
file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6033) Add support for sections in container-executor configuration file

2016-12-28 Thread Varun Vasudev (JIRA)
Varun Vasudev created YARN-6033:
---

 Summary: Add support for sections in container-executor 
configuration file
 Key: YARN-6033
 URL: https://issues.apache.org/jira/browse/YARN-6033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5719) Enforce a C standard for native container-executor

2016-12-28 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-5719:

Assignee: Chris Douglas

> Enforce a C standard for native container-executor
> --
>
> Key: YARN-5719
> URL: https://issues.apache.org/jira/browse/YARN-5719
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Reporter: Chris Douglas
>Assignee: Chris Douglas
> Fix For: 3.0.0-alpha2
>
> Attachments: YARN-5719.000.patch
>
>
> The {{container-executor}} build should declare the C standard it uses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs

2016-12-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782520#comment-15782520
 ] 

Hadoop QA commented on YARN-5931:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
14s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
52s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
41s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
10s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  4m 
40s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 48s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 1 new + 206 unchanged - 0 fixed = 207 total (was 206) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  1m 
 5s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
1s{color} | {color:red} The patch 12 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
34s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
36s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 39m 
45s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
11s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |

[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs

2016-12-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782376#comment-15782376
 ] 

Sunil G commented on YARN-5931:
---

Thanks [~rohithsharma]
Looks fine. I will wait for [~templedf] also.

> Document timeout interfaces CLI and REST APIs
> -
>
> Key: YARN-5931
> URL: https://issues.apache.org/jira/browse/YARN-5931
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: ResourceManagerRest.html, YARN-5931.0.patch, 
> YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5931) Document timeout interfaces CLI and REST APIs

2016-12-28 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-5931:

Attachment: YARN-5931.3.patch

Updated patch fixing review comments

> Document timeout interfaces CLI and REST APIs
> -
>
> Key: YARN-5931
> URL: https://issues.apache.org/jira/browse/YARN-5931
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: ResourceManagerRest.html, YARN-5931.0.patch, 
> YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org