[jira] [Commented] (YARN-6524) Avoid storing unnecessary information in the Memory for the finished apps

2017-04-25 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982974#comment-15982974
 ] 

Jason Lowe commented on YARN-6524:
--

This is a duplicate of YARN-65.

> Avoid storing unnecessary information in the Memory for the finished apps
> -
>
> Key: YARN-6524
> URL: https://issues.apache.org/jira/browse/YARN-6524
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 2.7.3
>Reporter: Naganarasimha G R
>
> Avoid storing unnecessary information in the Memory for the finished apps
> In case of cluster with large number of finished apps, more memory is 
> required to store the unused information i.e. related AM's Container launch 
> like Localization resources, tokens etc. 
> In one such scenario we had around 9k finished apps each with 257 
> LocalResource amounting to 108 kbytes per app and just for 9k apps it was 
> nearly taking ~ 0.8 GB of memory. In Low end machines this would create 
> resource crunch in RM



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6524) Avoid storing unnecessary information in the Memory for the finished apps

2017-04-25 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-6524.
--
Resolution: Duplicate

> Avoid storing unnecessary information in the Memory for the finished apps
> -
>
> Key: YARN-6524
> URL: https://issues.apache.org/jira/browse/YARN-6524
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 2.7.3
>Reporter: Naganarasimha G R
>
> Avoid storing unnecessary information in the Memory for the finished apps
> In case of cluster with large number of finished apps, more memory is 
> required to store the unused information i.e. related AM's Container launch 
> like Localization resources, tokens etc. 
> In one such scenario we had around 9k finished apps each with 257 
> LocalResource amounting to 108 kbytes per app and just for 9k apps it was 
> nearly taking ~ 0.8 GB of memory. In Low end machines this would create 
> resource crunch in RM



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5892) Capacity Scheduler: Support user-specific minimum user limit percent

2017-04-24 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981164#comment-15981164
 ] 

Jason Lowe commented on YARN-5892:
--

I don't understand imposing a hard limit of weight < 100/MULP.  For example, 
take a queue that has a max capacity far larger than normal capacity and the 
MULP is 100%.  With that hard limit, nobody can have a weight > 1.  However two 
or three separate users can all fit in that queue with their minimum limit due 
to queue growth beyond normal capacity, but we're not going to let a single, 
weighted user do the same?  I don't see why there needs to be an upper limit on 
the weight -- it's the same as that many separate users showing up.  At some 
point, yes, the weight could be high enough that the queue could never meet the 
minimum limit of that user, but the same occurs when that many users show up at 
the same time as well.  Some users do not get their minimum because there's too 
much demand.


> Capacity Scheduler: Support user-specific minimum user limit percent
> 
>
> Key: YARN-5892
> URL: https://issues.apache.org/jira/browse/YARN-5892
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: Active users highlighted.jpg, YARN-5892.001.patch, 
> YARN-5892.002.patch, YARN-5892.003.patch, YARN-5892.004.patch, 
> YARN-5892.005.patch, YARN-5892.006.patch, YARN-5892.007.patch, 
> YARN-5892.008.patch, YARN-5892.009.patch, YARN-5892.010.patch
>
>
> Currently, in the capacity scheduler, the {{minimum-user-limit-percent}} 
> property is per queue. A cluster admin should be able to set the minimum user 
> limit percent on a per-user basis within the queue.
> This functionality is needed so that when intra-queue preemption is enabled 
> (YARN-4945 / YARN-2113), some users can be deemed as more important than 
> other users, and resources from VIP users won't be as likely to be preempted.
> For example, if the {{getstuffdone}} queue has a MULP of 25 percent, but user 
> {{jane}} is a power user of queue {{getstuffdone}} and needs to be guaranteed 
> 75 percent, the properties for {{getstuffdone}} and {{jane}} would look like 
> this:
> {code}
>   
> 
> yarn.scheduler.capacity.root.getstuffdone.minimum-user-limit-percent
> 25
>   
>   
> 
> yarn.scheduler.capacity.root.getstuffdone.jane.minimum-user-limit-percent
> 75
>   
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3839) Quit throwing NMNotYetReadyException

2017-04-21 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15978705#comment-15978705
 ] 

Jason Lowe commented on YARN-3839:
--

Please see my [earlier 
comment|https://issues.apache.org/jira/browse/YARN-3839?focusedCommentId=15975315=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15975315].
  The patch is malformed for the {{patch}} command:
{noformat}
$ patch -p1 < YARN-3839.003.patch 
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ContainerManagementProtocol.java
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManager.java
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/DummyContainerManager.java
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/BaseContainerManagerTest.java
patch:  malformed patch at line 369: @@ -75,6 +76,10 @@
{noformat}

The first malformed patch hunk is this one:
{noformat}
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-serv
er-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/container
manager/BaseContainerManagerTest.java b/hadoop-yarn-project/hadoop-yarn/hadoop-y
arn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/s
erver/nodemanager/containermanager/BaseContainerManagerTest.java
index ad0a831..8de5678 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-node
manager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager
/BaseContainerManagerTest.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-node
manager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager
/BaseContainerManagerTest.java
@@ -65,6 +65,7 @@
 import org.apache.hadoop.yarn.server.nodemanager.Context;
 import org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor;
 import org.apache.hadoop.yarn.server.nodemanager.DeletionService;
 import org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService;
 import org.apache.hadoop.yarn.server.nodemanager.LocalRMInterface;
 import org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService;
{noformat}

Note how there are not any lines added/changed/deleted in the hunk.  The patch 
will need to be regenerated, and you can test it with the {{patch}} command on 
a clean view of trunk to know whether the QA bot is going to be able to apply 
it before posting.

> Quit throwing NMNotYetReadyException
> 
>
> Key: YARN-3839
> URL: https://issues.apache.org/jira/browse/YARN-3839
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
> Attachments: YARN-3839.001.patch, YARN-3839.002.patch, 
> YARN-3839.003.patch
>
>
> Quit throwing NMNotYetReadyException when NM has not yet registered with the 
> RM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6501) FSSchedulerNode.java fails to compile with JDK7

2017-04-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976822#comment-15976822
 ] 

Jason Lowe commented on YARN-6501:
--

+1 committing this.

> FSSchedulerNode.java fails to compile with JDK7
> ---
>
> Key: YARN-6501
> URL: https://issues.apache.org/jira/browse/YARN-6501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.0
>Reporter: John Zhuge
>Assignee: John Zhuge
> Attachments: YARN-6501.branch-2.001.patch
>
>
> {noformat}
> [ERROR] 
> /Users/jzhuge/hadoop-commit/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerNode.java:[183,18]
>  cannot find symbol
> [ERROR] symbol:   method 
> putIfAbsent(org.apache.hadoop.yarn.api.records.ApplicationAttemptId,org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt)
> [ERROR] location: variable appIdToAppMap of type 
> java.util.Map
> [ERROR] 
> /Users/jzhuge/hadoop-commit/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerNode.java:[184,29]
>  cannot find symbol
> [ERROR] symbol:   method 
> putIfAbsent(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt,org.apache.hadoop.yarn.api.records.Resource)
> [ERROR] location: variable resourcesPreemptedForApp of type 
> java.util.Map
> {noformat}
> {{Map#putIfAbsent}} is introduced in JDK8.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3839) Quit throwing NMNotYetReadyException

2017-04-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975315#comment-15975315
 ] 

Jason Lowe commented on YARN-3839:
--

The QA bot is probably having issues with the patch since there are a number of 
patch hunks that have no actual changes, e.g.:
{noformat}
@@ -65,6 +65,7 @@
 import org.apache.hadoop.yarn.server.nodemanager.Context;
 import org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor;
 import org.apache.hadoop.yarn.server.nodemanager.DeletionService;
 import org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService;
 import org.apache.hadoop.yarn.server.nodemanager.LocalRMInterface;
 import org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService;
@@ -75,6 +76,10 @@
 import 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application;
 import 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState;
 import 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container;
 import org.apache.hadoop.yarn.server.nodemanager.metrics.NodeManagerMetrics;
 import 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMNullStateStoreService;
 import 
org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager;
{noformat}


> Quit throwing NMNotYetReadyException
> 
>
> Key: YARN-3839
> URL: https://issues.apache.org/jira/browse/YARN-3839
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
> Attachments: YARN-3839.001.patch, YARN-3839.002.patch
>
>
> Quit throwing NMNotYetReadyException when NM has not yet registered with the 
> RM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6272) TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently

2017-04-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974876#comment-15974876
 ] 

Jason Lowe commented on YARN-6272:
--

I've also seen this stacktrace on 2.8:
{noformat}
java.lang.AssertionError: expected:<1> but was:<2>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.doContainerResourceChange(TestAMRMClient.java:920)
at 
org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.testAMRMClientWithContainerResourceChange(TestAMRMClient.java:813)
{noformat}

In the above case, it looks like the nodemanager happened to be heartbeating 
just as the app made the allocate call that asked for the increase request.  In 
that case it was able to process both the increase and the decrease in the same 
heartbeat which the test explicitly does not expect.

The test itself is very fragile.  It's launching a full minicluster and uses 
hardcoded sleeps sprinkled in various places hoping asynchronous events have 
processed in the interim.  That not only directly leads to flaky tests but 
slows down the unit test unnecessarily.  Either the test needs to be made more 
tolerant of all the asynchronous stuff going on or ditch the minicluster and 
explicitly manage the cluster heartbeating.  The former can be done by having 
the test poll via app alloc heartbeats until it gets all the responses it needs 
rather than assume which heartbeats will get which responses.  The latter can 
be done by using MockRM, MockNM, and drain dispatchers so the test knows 
exactly which heartbeats have been completely processed and thus know which app 
alloc calls will get the appropriate responses.  This latter approach would 
also eliminate the need for any arbitrary polling/sleeping intervals and speed 
up the test significantly.


> TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently
> -
>
> Key: YARN-6272
> URL: https://issues.apache.org/jira/browse/YARN-6272
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Ray Chiang
>
> I'm seeing this unit test fail fairly often in trunk:
> testAMRMClientWithContainerResourceChange(org.apache.hadoop.yarn.client.api.impl.TestAMRMClient)
>   Time elapsed: 5.113 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<0>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.doContainerResourceChange(TestAMRMClient.java:1087)
> at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.testAMRMClientWithContainerResourceChange(TestAMRMClient.java:963)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2017-04-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974679#comment-15974679
 ] 

Jason Lowe commented on YARN-2113:
--

The more I think about this, I believe it is completely correct to preempt 
containers youngest to oldest until the next container would put us at or below 
the user limit.  Essentially what we're doing is "rewinding" the scheduler 
decisions for this user until the last container that was legitimately 
allocated given the current user limit.  The scheduler always allocates one 
container beyond the user limit since it checks if the user is currently <= the 
limit _before_ it tacks on the new container.  I don't think we should consider 
older containers since the order the user allocated them in (i.e.: oldest to 
youngest) was "legal" given their current user limit.  It's only the containers 
that started beyond the user limit that are "bonus" and are candidates for 
preemption.

So I don't see the need for a configurable deadzone or checking something with 
the minimum allocation.  It looks like we simply kill youngest to oldest until 
killing the next container would put the user <= their limit.


> Add cross-user preemption within CapacityScheduler's leaf-queue
> ---
>
> Key: YARN-2113
> URL: https://issues.apache.org/jira/browse/YARN-2113
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil G
> Attachments: 
> TestNoIntraQueuePreemptionIfBelowUserLimitAndDifferentPrioritiesWithExtraUsers.txt,
>  YARN-2113.0001.patch, YARN-2113.0002.patch, YARN-2113.0003.patch, 
> YARN-2113.0004.patch, YARN-2113.0005.patch, YARN-2113.0006.patch, 
> YARN-2113.0007.patch, YARN-2113.v0.patch
>
>
> Preemption today only works across queues and moves around resources across 
> queues per demand and usage. We should also have user-level preemption within 
> a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2017-04-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973638#comment-15973638
 ] 

Jason Lowe commented on YARN-2113:
--

Is a deadzone the proper way to fix this?  I'm thinking of a case where the 
user has a particularly large container, larger than the dead zone.  It will 
still flap in this case, correct?  Seems like we should preempt not until we 
fall below the user limit but instead until the _next_ container we would 
preempt would put the user at or below their limit.  The scheduler essentially 
entitles a user to one container beyond the user limit.  If we preempt down to 
a point at or below the user's limit then we've gone one container too far, and 
the scheduler could very well turn around and give the container right back.

Preempting down to one container before we meet or dip below the user limit has 
the advantage that there's not yet another config to setup correctly.  However 
it brings up an interesting scenario where killling the youngest container 
would lower the utilization below the user's limit but killing older, smaller 
containers would not.

> Add cross-user preemption within CapacityScheduler's leaf-queue
> ---
>
> Key: YARN-2113
> URL: https://issues.apache.org/jira/browse/YARN-2113
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil G
> Attachments: 
> TestNoIntraQueuePreemptionIfBelowUserLimitAndDifferentPrioritiesWithExtraUsers.txt,
>  YARN-2113.0001.patch, YARN-2113.0002.patch, YARN-2113.0003.patch, 
> YARN-2113.0004.patch, YARN-2113.0005.patch, YARN-2113.0006.patch, 
> YARN-2113.0007.patch, YARN-2113.v0.patch
>
>
> Preemption today only works across queues and moves around resources across 
> queues per demand and usage. We should also have user-level preemption within 
> a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6467) CSQueueMetrics needs to update the current metrics for default partition only

2017-04-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973466#comment-15973466
 ] 

Jason Lowe commented on YARN-6467:
--

bq. I thought of segregating partition based queue metrics in a different jira

I'm totally OK with fixing the queue metrics so they only show the default 
partition in this jira, assuming those metrics aren't doing anything sane today 
in light of multiple partitions.  We can defer adding the partition dimension 
to the queue metrics in a separate JIRA.  I had already assumed that was the 
case based on this JIRA's title.


> CSQueueMetrics needs to update the current metrics for default partition only
> -
>
> Key: YARN-6467
> URL: https://issues.apache.org/jira/browse/YARN-6467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha2
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Attachments: YARN-6467.001.patch
>
>
> As a followup to YARN-6195, we need to update existing metrics to only 
> default Partition.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5892) Capacity Scheduler: Support user-specific minimum user limit percent

2017-04-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973168#comment-15973168
 ] 

Jason Lowe commented on YARN-5892:
--

bq. Also, weight of users applies to hard limit of user (user limit factor) as 
well. This is a gray area to me, since it may cause some issue of resource 
planning (one more factor apply to maximum resource of user).

I think the weight needs to apply to the user limit factor as well.  
Semantically a user with a weight of 2 should be equivalent to spreading that 
user's load across two "normal" users.  That means a user of weight 2 should 
get twice the normal limit factor, since two users who both hit their ULF means 
twice the ULF load was allocated to the queue.  If we don't apply the weight to 
the ULF as well then the math isn't consistent -- the 2x user isn't exactly 
like having two users sharing a load.


> Capacity Scheduler: Support user-specific minimum user limit percent
> 
>
> Key: YARN-5892
> URL: https://issues.apache.org/jira/browse/YARN-5892
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: Active users highlighted.jpg, YARN-5892.001.patch, 
> YARN-5892.002.patch, YARN-5892.003.patch, YARN-5892.004.patch, 
> YARN-5892.005.patch, YARN-5892.006.patch, YARN-5892.007.patch, 
> YARN-5892.008.patch, YARN-5892.009.patch
>
>
> Currently, in the capacity scheduler, the {{minimum-user-limit-percent}} 
> property is per queue. A cluster admin should be able to set the minimum user 
> limit percent on a per-user basis within the queue.
> This functionality is needed so that when intra-queue preemption is enabled 
> (YARN-4945 / YARN-2113), some users can be deemed as more important than 
> other users, and resources from VIP users won't be as likely to be preempted.
> For example, if the {{getstuffdone}} queue has a MULP of 25 percent, but user 
> {{jane}} is a power user of queue {{getstuffdone}} and needs to be guaranteed 
> 75 percent, the properties for {{getstuffdone}} and {{jane}} would look like 
> this:
> {code}
>   
> 
> yarn.scheduler.capacity.root.getstuffdone.minimum-user-limit-percent
> 25
>   
>   
> 
> yarn.scheduler.capacity.root.getstuffdone.jane.minimum-user-limit-percent
> 75
>   
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5892) Capacity Scheduler: Support user-specific minimum user limit percent

2017-04-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972821#comment-15972821
 ] 

Jason Lowe commented on YARN-5892:
--

I'm +1 for weight == 0.  As long as it doesn't break the code (e.g.: division 
by zero, etc.) and does something semantically consistent with weights then I 
don't see why we should disallow it.  
A practical use of this could be to essentially "pause" a user in a queue -- it 
won't reject the user's app submissions like changing the queue ACLs would, but 
the user will get very little to no resources until the weight becomes non-zero.

I'm also +1 for having the weight be less than 1.  It looks like it works with 
the patch now, and I worry that the longer we keep support for it out of the 
codebase the more difficult it can become to introduce it later.  People will 
see in the existing code that it cannot be less than 1 and end up assuming 
(explicitly or implicitly) that it never will.


> Capacity Scheduler: Support user-specific minimum user limit percent
> 
>
> Key: YARN-5892
> URL: https://issues.apache.org/jira/browse/YARN-5892
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: Active users highlighted.jpg, YARN-5892.001.patch, 
> YARN-5892.002.patch, YARN-5892.003.patch, YARN-5892.004.patch, 
> YARN-5892.005.patch, YARN-5892.006.patch, YARN-5892.007.patch, 
> YARN-5892.008.patch, YARN-5892.009.patch
>
>
> Currently, in the capacity scheduler, the {{minimum-user-limit-percent}} 
> property is per queue. A cluster admin should be able to set the minimum user 
> limit percent on a per-user basis within the queue.
> This functionality is needed so that when intra-queue preemption is enabled 
> (YARN-4945 / YARN-2113), some users can be deemed as more important than 
> other users, and resources from VIP users won't be as likely to be preempted.
> For example, if the {{getstuffdone}} queue has a MULP of 25 percent, but user 
> {{jane}} is a power user of queue {{getstuffdone}} and needs to be guaranteed 
> 75 percent, the properties for {{getstuffdone}} and {{jane}} would look like 
> this:
> {code}
>   
> 
> yarn.scheduler.capacity.root.getstuffdone.minimum-user-limit-percent
> 25
>   
>   
> 
> yarn.scheduler.capacity.root.getstuffdone.jane.minimum-user-limit-percent
> 75
>   
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6480) Timeout is too aggressive for TestAMRestart.testPreemptedAMRestartOnRMRestart

2017-04-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969614#comment-15969614
 ] 

Jason Lowe commented on YARN-6480:
--

+1 lgtm.  Committing this.

> Timeout is too aggressive for TestAMRestart.testPreemptedAMRestartOnRMRestart
> -
>
> Key: YARN-6480
> URL: https://issues.apache.org/jira/browse/YARN-6480
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
> Attachments: YARN-6480.001.patch
>
>
> Timeout is set to 20 seconds, but the test runs regularly at 15 seconds on my 
> machine. Any load and it could timeout. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2985) YARN should support to delete the aggregated logs for Non-MapReduce applications

2017-04-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969088#comment-15969088
 ] 

Jason Lowe commented on YARN-2985:
--

Doing a config for branch-2 seems reasonable.

bq. my understanding is that the timeline server is supposed to replace the 
JHS, even for deployments that only run MR jobs

This is news to me.  The timeline server has no UI, just REST APIs, so there 
minimally needs to be something that provides the javascript necessary for the 
client browser to render a MapReduce-aware UI.  I haven't seen that in trunk, 
and without it the MapReduce JHS must still be running if there's going to be a 
MapReduce UI for completed jobs.

Even without the timeline server completely replacing the MR JHS, the deletion 
service can still be moved in trunk without a config under the following 
conditions:
* The timeline server is considered a critical server that always needs to be 
running (or we simply document that it must be used when log aggregation is 
enabled)
* There's an equivalent way to refresh the config options like can be done with 
the deletion service in the MR JHS today.

> YARN should support to delete the aggregated logs for Non-MapReduce 
> applications
> 
>
> Key: YARN-2985
> URL: https://issues.apache.org/jira/browse/YARN-2985
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.8.0
>Reporter: Xu Yang
>Assignee: Steven Rand
> Attachments: YARN-2985-branch-2-001.patch
>
>
> Before Hadoop 2.6, the LogAggregationService is started in NodeManager. But 
> the AggregatedLogDeletionService is started in mapreduce`s JobHistoryServer. 
> Therefore, the Non-MapReduce application can aggregate their logs to HDFS, 
> but can not delete those logs. Need the NodeManager take over the function of 
> aggregated log deletion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-3839) Quit throwing NMNotYetReadyException

2017-04-14 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-3839:


Assignee: Manikandan R

> Quit throwing NMNotYetReadyException
> 
>
> Key: YARN-3839
> URL: https://issues.apache.org/jira/browse/YARN-3839
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
>
> Quit throwing NMNotYetReadyException when NM has not yet registered with the 
> RM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5617) AMs only intended to run one attempt can be run more than once

2017-04-13 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5617:
-
Attachment: YARN-5617.003.patch

Updated the patch.

> AMs only intended to run one attempt can be run more than once
> --
>
> Key: YARN-5617
> URL: https://issues.apache.org/jira/browse/YARN-5617
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-5617.001.patch, YARN-5617.002.patch, 
> YARN-5617.003.patch
>
>
> There are times when a user only wants to run an application with one 
> attempt.  Examples would be cases where the second AM attempt is not prepared 
> to handle recovery or will accidentally corrupt state (e.g.: by re-executing 
> something from scratch that should not be).  Prior to YARN-614 setting the 
> max attempts to 1 would guarantee the app ran at most one attempt, but now it 
> can run more than one attempt if the attempts fail due to a fault not 
> attributed to the application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3839) Quit throwing NMNotYetReadyException

2017-04-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968022#comment-15968022
 ] 

Jason Lowe commented on YARN-3839:
--

My understanding is the same.  It looks like the existing cases when we throw 
it will already be covered by the NMToken or ContainerToken so we know whether 
the launch is valid or not.  As Vinod pointed out we still need to keep the 
NMNotYetReadyException class around for compatibility with clients but the NM 
would stop throwing the exception.


> Quit throwing NMNotYetReadyException
> 
>
> Key: YARN-3839
> URL: https://issues.apache.org/jira/browse/YARN-3839
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Karthik Kambatla
>
> Quit throwing NMNotYetReadyException when NM has not yet registered with the 
> RM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2985) YARN should support to delete the aggregated logs for Non-MapReduce applications

2017-04-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967748#comment-15967748
 ] 

Jason Lowe commented on YARN-2985:
--

Based on the description of this JIRA, I think there's some confusion here.  
Aggregated logs are deleted for non-MapReduce applications as long as the 
deletion service is running, whether that deletion service is hosted by the 
MapReduce job history server or somewhere else.  That's why the proposed patch 
is so small -- it's simply reusing the same code the JHS is already running.  
The log deletion service looks at the remote log directory in HDFS.  It doesn't 
filter the list of application logs it finds there based on whether it thinks 
the app is MapReduce or not, rather it just treats them as generic 
applications.  It happens to run in the MapReduce history server, but it is 
_not_ MapReduce-specific.  If users don't want to run MapReduce applications 
but want to do log aggregtion then they just need to run the MapReduce history 
server.  They won't use it for MapReduce job history since there are no 
MapReduce jobs, but that server will perform aggregated log retention for *all* 
applications.

Therefore this JIRA is really about adding the ability to relocate the 
aggregated log deletion service from the MapReduce job history server to the 
YARN timeline server.  We don't want two of these things running in the cluster 
if someone has deployed the MapReduce history server and the YARN timeline 
server.  That could lead to error messages in the logs as one of them goes to 
traverse/delete the logs just as the other is already deleting them.  However 
we also don't want to just rip it out of the MapReduce history server and move 
it to the timeline server because the timeline server is still an optional 
server in YARN.

So we either need a way for the user to specify where they want the deletion 
service to run, whether that's the legacy location in the MapReduce history 
server (since they aren't going to run a timeline server which is still an 
optional YARN server) or in the timeline server.  Or we need to just declare 
the timeline server a mandatory server to run (at least for log aggregation 
support) and move it from one to the other.

In addition the MapReduce history server supports dynamic refresh of the log 
deletion service configs, and it would be nice not to lose that ability when it 
is hosted in the timeline server.  That could be a separate JIRA unless we're 
ripping it out of the JHS.  If it can only run in the timeline server then we 
would lose refresh functionality unless that JIRA was completed.

As for unit tests, I agree the existing tests for the deletion service cover 
the correctness of the service itself, so we just need unit tests for the 
timeline server and MapReduce JHS to verify each is starting the deletion 
service or not starting the service based on how the cluster is configured.

> YARN should support to delete the aggregated logs for Non-MapReduce 
> applications
> 
>
> Key: YARN-2985
> URL: https://issues.apache.org/jira/browse/YARN-2985
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.8.0
>Reporter: Xu Yang
>Assignee: Steven Rand
> Attachments: YARN-2985-branch-2-001.patch
>
>
> Before Hadoop 2.6, the LogAggregationService is started in NodeManager. But 
> the AggregatedLogDeletionService is started in mapreduce`s JobHistoryServer. 
> Therefore, the Non-MapReduce application can aggregate their logs to HDFS, 
> but can not delete those logs. Need the NodeManager take over the function of 
> aggregated log deletion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6461) TestRMAdminCLI has very low test timeouts

2017-04-11 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964424#comment-15964424
 ] 

Jason Lowe commented on YARN-6461:
--

+1 lgtm.  Committing this.

> TestRMAdminCLI has very low test timeouts
> -
>
> Key: YARN-6461
> URL: https://issues.apache.org/jira/browse/YARN-6461
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
>Assignee: Eric Badger
> Attachments: YARN-6461.001.patch
>
>
> TestRMAdminCLI has only 500 millisecond timeouts on many of the unit tests.  
> If the test machine or VM is loaded/slow then the tests can report a false 
> positive.
> I'm not sure these tests need explicit timeouts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-04-11 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964375#comment-15964375
 ] 

Jason Lowe commented on YARN-6195:
--

I do not think a separate HADOOP JIRA is necessary here.  Committing this.

> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch, YARN-6195.002.patch, 
> YARN-6195.003.patch, YARN-6195.004.patch, YARN-6195.005.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6461) TestRMAdminCLI has very low test timeouts

2017-04-10 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6461:


 Summary: TestRMAdminCLI has very low test timeouts
 Key: YARN-6461
 URL: https://issues.apache.org/jira/browse/YARN-6461
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
Reporter: Jason Lowe


TestRMAdminCLI has only 500 millisecond timeouts on many of the unit tests.  If 
the test machine or VM is loaded/slow then the tests can report a false 
positive.

I'm not sure these tests need explicit timeouts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6456) Isolation of Docker containers In LinuxContainerExecutor

2017-04-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962801#comment-15962801
 ] 

Jason Lowe commented on YARN-6456:
--

bq. DockerLinuxContainerRuntime mounts containerLocalDirs 
nm-local-dir/usercache/user/appcache/application_1491598755372_0011/ and 
userLocalDirs nm-local-dir/usercache/user/

The application directories are needed so the container can deposit output for 
subsequent tasks to pick up via an auxiliary service (e.g.: maps leaving 
intermediate data so the MapReduce shuffle handler can serve it to reducers).  
In addition the application filecache directory is not sufficient, as it misses 
distributed cache resources that have visibility PRIVATE (instead of 
APPLICATION).


> Isolation of Docker containers In LinuxContainerExecutor
> 
>
> Key: YARN-6456
> URL: https://issues.apache.org/jira/browse/YARN-6456
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Miklos Szegedi
>
> One reason to use Docker containers is to be able to isolate different 
> workloads, even, if they run as the same user.
> I have noticed some issues in the current design:
> 1. DockerLinuxContainerRuntime mounts containerLocalDirs 
> {{nm-local-dir/usercache/user/appcache/application_1491598755372_0011/}} and 
> userLocalDirs {{nm-local-dir/usercache/user/}}, so that a container can see 
> and modify the files of another container. I think the application file cache 
> directory should be enough for the container to run in most of the cases.
> 2. The whole cgroups directory is mounted. Would the container directory be 
> enough?
> 3. There is no way to enforce exclusive use of Docker for all containers. 
> There should be an option that it is not the user but the admin that requires 
> to use Docker.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6451) Create a monitor to check whether we maintain RM (scheduling) invariants

2017-04-07 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961022#comment-15961022
 ] 

Jason Lowe commented on YARN-6451:
--

Interesting idea.  For some of these invariants, would it make more sense to 
put an assert-like hook in the metric code itself?  I'm thinking why hope that 
a periodic interval happens to catch the metric being negative when we can have 
the metric itself protest when someone tries to set it below zero?  As a bonus, 
we'd have access to the stacktrace that triggered it.

I could see this periodic approach being really useful for more complicated 
expressions like validating stats across users, across queues, etc. where it's 
tricky/expensive to evaluate it on a single metric update.

> Create a monitor to check whether we maintain RM (scheduling) invariants
> 
>
> Key: YARN-6451
> URL: https://issues.apache.org/jira/browse/YARN-6451
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-6451.v0.patch, YARN-6451.v1.patch
>
>
> For SLS runs, as well as for live test clusters (and maybe prod), it would be 
> useful to have a mechanism to continuously check whether core invariants of 
> the RM/Scheduler are respected (e.g., no priority inversions, fairness mostly 
> respected, certain latencies within expected range, etc..)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-04-07 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960991#comment-15960991
 ] 

Jason Lowe commented on YARN-6195:
--

Thanks for updating the patch!  At first I thought we had a potential for an 
NPE in CSQueueUtils since there's this comment at the top:
{code}
  /**
   * Update partitioned resource usage, if nodePartition == null, will update
   * used resource for all partitions of this queue.
   */
  public static void updateUsedCapacity(final ResourceCalculator rc,
{code}

However in practice all the callers translate a null label to the no label enum 
so we're good.  Looks like we'd have NPE problems even before this patch if 
nodePartition really was null, so that's a bad comment unrelated to this patch.

+1 for the latest patch.  I'll commit this early next week if there are no 
objections.

> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch, YARN-6195.002.patch, 
> YARN-6195.003.patch, YARN-6195.004.patch, YARN-6195.005.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6443) Allow for Priority order relaxing in favor of improved node/rack locality

2017-04-07 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960755#comment-15960755
 ] 

Jason Lowe commented on YARN-6443:
--

Ah, so this apparently is describing a problem that only can occur if scheduler 
keys are being used?  I'm not sure we need a flag here.  Seems like we simply 
should not guarantee that allocations are returned within a priority group in 
the order they are requested -- they can be returned in any order.  It 
certainly worked that way without scheduler keys.  If you need ordering, that's 
what priorities are for.  In that sense I see this not as an enhancement but 
rather a bugfix.  Or am I misunderstanding the problem?

> Allow for Priority order relaxing in favor of improved node/rack locality 
> --
>
> Key: YARN-6443
> URL: https://issues.apache.org/jira/browse/YARN-6443
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler
>Reporter: Arun Suresh
>Assignee: Hitesh Sharma
>
> Currently the Schedulers examine an applications pending Requests in Priority 
> order. This JIRA proposes to introduce a flag (either via the 
> ApplicationMasterService::registerApplication() or via some Scheduler 
> configuration) to favor an ordering that is baised to the node that is 
> currently heartbeating by relaxing the priority constraint.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6288) Exceptions during aggregated log writes are mishandled

2017-04-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959782#comment-15959782
 ] 

Jason Lowe commented on YARN-6288:
--

+1 for the branch-2.8 patch as well.  The unit test failures are unrelated.

Committing this.

> Exceptions during aggregated log writes are mishandled
> --
>
> Key: YARN-6288
> URL: https://issues.apache.org/jira/browse/YARN-6288
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.8.0, 2.7.3
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Critical
>  Labels: supportability
> Attachments: YARN-6288.01.patch, YARN-6288.02.patch, 
> YARN-6288.03.patch, YARN-6288.04.patch, YARN-6288-branch-2.8-01.patch
>
>
> In AppLogAggregatorImpl.java, if an exception occurs in writing container log 
> to remote filesystem, the exception is not caught and ignored.
> https://github.com/apache/hadoop/blob/f59e36b4ce71d3019ab91b136b6d7646316954e7/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java#L398



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6288) Exceptions during aggregated log writes are mishandled

2017-04-06 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6288:
-
Affects Version/s: 2.8.0
   2.7.3
 Priority: Critical  (was: Minor)
  Summary: Exceptions during aggregated log writes are mishandled  
(was: Refactor AppLogAggregatorImpl#uploadLogsForContainers)
 Target Version/s: 2.8.1
  Component/s: log-aggregation
   Issue Type: Bug  (was: Improvement)

Updating the priority per the discussion on YARN-3760

+1 for the trunk patch.  [~ajisakaa] would you mind creating a patch against 
branch-2.8 as well?

> Exceptions during aggregated log writes are mishandled
> --
>
> Key: YARN-6288
> URL: https://issues.apache.org/jira/browse/YARN-6288
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.8.0, 2.7.3
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Critical
>  Labels: supportability
> Attachments: YARN-6288.01.patch, YARN-6288.02.patch, 
> YARN-6288.03.patch, YARN-6288.04.patch
>
>
> In AppLogAggregatorImpl.java, if an exception occurs in writing container log 
> to remote filesystem, the exception is not caught and ignored.
> https://github.com/apache/hadoop/blob/f59e36b4ce71d3019ab91b136b6d7646316954e7/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java#L398



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6344) Rethinking OFF_SWITCH locality in CapacityScheduler

2017-04-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957849#comment-15957849
 ] 

Jason Lowe commented on YARN-6344:
--

I'd prefer a configured rack locality delay of zero means no additional rack 
delay, but I see that is semantically different than disabling it altogether.  
Specifying a rack locality delay of zero means it will _not_ scale the node 
locality delay based on the request/cluster sizes like it does today, whereas 
setting it to -1 will.  In that sense it's not purely an additional delay.  
Given I don't know the complete backstory on the reasoning behind why it 
behaves the way it does for node locality delay, I can see the desire to leave 
the existing behavior unchanged when this new setting isn't configured.

Patch looks good to me.


> Rethinking OFF_SWITCH locality in CapacityScheduler
> ---
>
> Key: YARN-6344
> URL: https://issues.apache.org/jira/browse/YARN-6344
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Attachments: YARN-6344.001.patch, YARN-6344.002.patch, 
> YARN-6344.003.patch, YARN-6344.004.patch
>
>
> When relaxing locality from node to rack, the {{node-locality-parameter}} is 
> used: when scheduling opportunities for a scheduler key are more than the 
> value of this parameter, we relax locality and try to assign the container to 
> a node in the corresponding rack.
> On the other hand, when relaxing locality to off-switch (i.e., assign the 
> container anywhere in the cluster), we are using a {{localityWaitFactor}}, 
> which is computed based on the number of outstanding requests for a specific 
> scheduler key, which is divided by the size of the cluster. 
> In case of applications that request containers in big batches (e.g., 
> traditional MR jobs), and for relatively small clusters, the 
> localityWaitFactor does not affect relaxing locality much.
> However, in case of applications that request containers in small batches, 
> this load factor takes a very small value, which leads to assigning 
> off-switch containers too soon. This situation is even more pronounced in big 
> clusters.
> For example, if an application requests only one container per request, the 
> locality will be relaxed after a single missed scheduling opportunity.
> The purpose of this JIRA is to rethink the way we are relaxing locality for 
> off-switch assignments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6450) TestContainerManagerWithLCE requires override for each new test added to ContainerManagerTest

2017-04-05 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6450:
-
Attachment: YARN-6450.001.patch

Using {{Assume.assumeTrue(shouldRunTest())}} in the existing setup function 
will automatically skip the tests if the LCE hasn't been configured.  Then we 
don't need to revisit this code every time a new unit test is added to 
ContainerManagerTest.

Attaching a patch that implements this approach.

> TestContainerManagerWithLCE requires override for each new test added to 
> ContainerManagerTest
> -
>
> Key: YARN-6450
> URL: https://issues.apache.org/jira/browse/YARN-6450
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-6450.001.patch
>
>
> Every test in TestContainerManagerWithLCE looks like this:
> {code}
>   @Override
>   public void testSomething() throws Exception {
> // Don't run the test if the binary is not available.
> if (!shouldRunTest()) {
>   LOG.info("LCE binary path is not passed. Not running the test");
>   return;
> }
> LOG.info("Running something");
> super.testSomething();
>   }
> {code}
> If  a new test is added to ContainerManagerTest then by default 
> ContainerManagerTestWithLCE will fail when the LCE has not been configured.  
> This is an unnecessary maintenance burden.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6450) TestContainerManagerWithLCE requires override for each new test added to ContainerManagerTest

2017-04-05 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6450:


 Summary: TestContainerManagerWithLCE requires override for each 
new test added to ContainerManagerTest
 Key: YARN-6450
 URL: https://issues.apache.org/jira/browse/YARN-6450
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe
Assignee: Jason Lowe


Every test in TestContainerManagerWithLCE looks like this:
{code}
  @Override
  public void testSomething() throws Exception {
// Don't run the test if the binary is not available.
if (!shouldRunTest()) {
  LOG.info("LCE binary path is not passed. Not running the test");
  return;
}
LOG.info("Running something");
super.testSomething();
  }
{code}

If  a new test is added to ContainerManagerTest then by default 
ContainerManagerTestWithLCE will fail when the LCE has not been configured.  
This is an unnecessary maintenance burden.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-04-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957290#comment-15957290
 ] 

Jason Lowe commented on YARN-6403:
--

+1 for the latest trunk and 2.8 patches.  Committing this.

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch, 
> YARN-6403.004.patch, YARN-6403.branch-2.8.003.patch, 
> YARN-6403.branch-2.8.004.patch, YARN-6403.branch-2.8.004.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6436) TestSchedulingPolicy#testParseSchedulingPolicy timeout is too low

2017-04-05 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6436:
-
Fix Version/s: 2.8.1

I committed this to branch-2.8 as well.

> TestSchedulingPolicy#testParseSchedulingPolicy timeout is too low
> -
>
> Key: YARN-6436
> URL: https://issues.apache.org/jira/browse/YARN-6436
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Jason Lowe
>Assignee: Eric Badger
> Fix For: 2.9.0, 2.8.1
>
> Attachments: YARN-6436.001.patch, YARN-6436.002.patch
>
>
> The timeout for testParseSchedulingPolicy is only one second.  An I/O hiccup 
> on a VM can make this test fail for the wrong reasons.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6443) Allow for Priority order relaxing in favor of improved node/rack locality

2017-04-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956947#comment-15956947
 ] 

Jason Lowe commented on YARN-6443:
--

Could you elaborate a bit on the use case where an application bothers to 
specify different priorities for requests but is OK with having a lower 
priority request be allocated first because locality was better?  Now that 
there are scheduler keys to help match allocations back to their original 
requests, it seems priority really can be a priority rather than a hack to help 
match them.  If the app doesn't really care what order two different kinds of 
requests are given in as long as they have good locality, why not just submit 
them at the same priority instead of having this extra flag?

> Allow for Priority order relaxing in favor of improved node/rack locality 
> --
>
> Key: YARN-6443
> URL: https://issues.apache.org/jira/browse/YARN-6443
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> Currently the Schedulers examine an applications pending Requests in Priority 
> order. This JIRA proposes to introduce a flag (either via the 
> ApplicationMasterService::registerApplication() or via some Scheduler 
> configuration) to favor an ordering that is baised to the node that is 
> currently heartbeating by relaxing the priority constraint.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-04-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956922#comment-15956922
 ] 

Jason Lowe commented on YARN-6195:
--

I'm totally OK with just reporting the default partition's stats in these new 
fields if everyone agrees that's a reasonable thing to do until QueueMetrics is 
partitioned.  As a bonus, it's significantly cheaper to compute when there are 
many queues configured.

> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch, YARN-6195.002.patch, 
> YARN-6195.003.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-04-04 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6403:
-
Attachment: YARN-6403.branch-2.8.004.patch

Thanks for updating the patch!  Looks good to me.  Posting the same branch-2.8 
patch again so Jenkins can comment on it, as it will only comment on one patch 
at a time if many are posted at once.


> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch, 
> YARN-6403.004.patch, YARN-6403.branch-2.8.003.patch, 
> YARN-6403.branch-2.8.004.patch, YARN-6403.branch-2.8.004.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6436) TestSchedulingPolicy#testParseSchedulingPolicy timeout is too low

2017-04-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1594#comment-1594
 ] 

Jason Lowe commented on YARN-6436:
--

Do we even need a timeout for this test?

> TestSchedulingPolicy#testParseSchedulingPolicy timeout is too low
> -
>
> Key: YARN-6436
> URL: https://issues.apache.org/jira/browse/YARN-6436
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Jason Lowe
>Assignee: Eric Badger
> Attachments: YARN-6436.001.patch
>
>
> The timeout for testParseSchedulingPolicy is only one second.  An I/O hiccup 
> on a VM can make this test fail for the wrong reasons.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-04-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1592#comment-1592
 ] 

Jason Lowe commented on YARN-6195:
--

This seems like a reasonable approach to take until the node label dimensions 
are added to the QueueMetrics, but it would be nice to hear the thoughts from 
[~leftnoteasy] or [~sunilg].

I'll commit this by the end of this week if I don't hear any objections.

> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch, YARN-6195.002.patch, 
> YARN-6195.003.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6437) TestSignalContainer#testSignalRequestDeliveryToNM fails intermittently

2017-04-04 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6437:
-
Attachment: YARN-6437.001.patch

Patch that accumulates the received containers across all the allocate calls in 
the test.

> TestSignalContainer#testSignalRequestDeliveryToNM fails intermittently
> --
>
> Key: YARN-6437
> URL: https://issues.apache.org/jira/browse/YARN-6437
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-6437.001.patch
>
>
> testSignalRequestDeliveryToNM can fail if the containers are returned across 
> multiple scheduling heartbeats.  The loop waiting for all the containers 
> should be accumulating the containers but instead is smashing the same list 
> of containers with whatever the allocate call returns.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6437) TestSignalContainer#testSignalRequestDeliveryToNM fails intermittently

2017-04-04 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6437:


 Summary: TestSignalContainer#testSignalRequestDeliveryToNM fails 
intermittently
 Key: YARN-6437
 URL: https://issues.apache.org/jira/browse/YARN-6437
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
Reporter: Jason Lowe
Assignee: Jason Lowe


testSignalRequestDeliveryToNM can fail if the containers are returned across 
multiple scheduling heartbeats.  The loop waiting for all the containers should 
be accumulating the containers but instead is smashing the same list of 
containers with whatever the allocate call returns.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6436) TestSchedulingPolicy#testParseSchedulingPolicy timeout is too low

2017-04-04 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6436:


 Summary: TestSchedulingPolicy#testParseSchedulingPolicy timeout is 
too low
 Key: YARN-6436
 URL: https://issues.apache.org/jira/browse/YARN-6436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe


The timeout for testParseSchedulingPolicy is only one second.  An I/O hiccup on 
a VM can make this test fail for the wrong reasons.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6406) Garbage Collect unused SchedulerRequestKeys

2017-03-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951542#comment-15951542
 ] 

Jason Lowe commented on YARN-6406:
--

Yep, the refcount was only added because of the possibility of the two types of 
requests.  When there are multiple refs to the key, we can't assume removing 
the last of one type removes all references to the key.  If there is only one 
type that can reference the scheduler key then we don't need to refcount it 
separately.


> Garbage Collect unused SchedulerRequestKeys
> ---
>
> Key: YARN-6406
> URL: https://issues.apache.org/jira/browse/YARN-6406
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> YARN-5540 introduced some optimizations to remove satisfied SchedulerKeys 
> from the AppScheduleingInfo. It looks like after YARN-6040, 
> ScedulerRequestKeys are removed only if the Application sends a 0 
> numContainers requests. While earlier, the outstanding schedulerKeys were 
> also remove as soon as a container is allocated as well.
> An additional optimization we were hoping to include is to remove the 
> ResourceRequests itself once the numContainers == 0, since we see in our 
> clusters that the RM heap space consumption increases drastically due to a 
> large number of ResourceRequests with 0 num containers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2017-03-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951538#comment-15951538
 ] 

Jason Lowe commented on YARN-2113:
--

The answer is no to both questions.  Both users are below the 100% MULP so 
nobody should be cross-user preempted.  Note that in your example, U2 will be 
the bully when it comes to getting more user limit since they have the highest 
priority app.  Until that app stops asking, it will have first right of refusal 
for any resources offered to the queue.  The key difference is that we aren't 
shooting resources to help satisfy the request, so U2 will still have to wait 
for resources to be freed voluntarily since no other users in the queue are 
above their MULP.

To be clear, I definitely see why someone may want the proposed behavior.  It 
all comes down to what people think priority means in light of multiple users.  
In some cases, we may just want the priority to reorder the apps but not 
actively shoot things as part of that reordering (i.e.: no shooting below 
MULP).  In other cases I could see users expecting the scheduler to make room 
for the app regardless of what other users are doing with respect to their 
limit (i.e.: shooting even if below MULP).  I was chatting about this with 
[~nroberts] and he mentioned it was in some ways similar to timesharing 
priorities and realtime priorities in Linux.  Timesharing priorities are 
somewhat nice to each other, but realtime ones are not.  One way to have both 
behaviors is to partition the priorities into those two spaces, i.e.: so 
important that we will shoot *anything* lower to make room priorities and 
"regular" priorities where we don't shoot users below the MULP.  I'm not sure 
that's really the way to go, but it is one solution that I believe would allow 
users/admins to choose which behavior they want for priorities vs. user limits.


> Add cross-user preemption within CapacityScheduler's leaf-queue
> ---
>
> Key: YARN-2113
> URL: https://issues.apache.org/jira/browse/YARN-2113
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil G
> Attachments: 
> TestNoIntraQueuePreemptionIfBelowUserLimitAndDifferentPrioritiesWithExtraUsers.txt,
>  YARN-2113.0001.patch, YARN-2113.0002.patch, YARN-2113.0003.patch, 
> YARN-2113.0004.patch, YARN-2113.v0.patch
>
>
> Preemption today only works across queues and moves around resources across 
> queues per demand and usage. We should also have user-level preemption within 
> a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2017-03-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951445#comment-15951445
 ] 

Jason Lowe commented on YARN-2113:
--

bq. once user submit app3 with highest priority, it should get resource from 
other apps if user3's UL doesn't meet.

I do not agree.  The problem here is that the other users _are not above their 
MULP_.  I thought we were all in agreement that users who stay below their MULP 
are not going to be preempted in-queue.  If they are, then how do they protect 
themselves from users who decide to jack up their priorities in arbitrary ways? 
 The only way is if they, too, participate in the priority arms race.

Let's think about how this is going to play out in practice.  The default queue 
config is 100% MULP.  Alice shows up and submits a variety of apps that overall 
consume 30% of the queue.  Then Bob comes along later and submits a single, 
large app that's big enough to saturate the queue.  Bob happens to submit that 
app with a slightly higher priority.  If the proposed behavior is in place, 
Bob's app will decimate *all* of Alice's active apps.  Is that really desired?  
What do admins tell users that are worried about getting their stuff preempted? 
 How does a user tailor their load so they won't be a victim of preemption?  
What guarantee can they rely on?

> Add cross-user preemption within CapacityScheduler's leaf-queue
> ---
>
> Key: YARN-2113
> URL: https://issues.apache.org/jira/browse/YARN-2113
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil G
> Attachments: 
> TestNoIntraQueuePreemptionIfBelowUserLimitAndDifferentPrioritiesWithExtraUsers.txt,
>  YARN-2113.0001.patch, YARN-2113.0002.patch, YARN-2113.0003.patch, 
> YARN-2113.0004.patch, YARN-2113.v0.patch
>
>
> Preemption today only works across queues and moves around resources across 
> queues per demand and usage. We should also have user-level preemption within 
> a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951182#comment-15951182
 ] 

Jason Lowe commented on YARN-6403:
--

Thanks for updating the patch!

bq. TestApplicationClientProtocolRecords is not exist in branch-2.8, so is it 
ok to place the UT for client-side in 
TestPBImplRecords#testContainerLaunchContextPBImpl?

I'd rather the change appears in the same file so if there are subsequent 
modifications to the code it can be cherry-picked.  Therefore I agree we need a 
new patch for branch-2.8 so it can add the new 
TestApplicationClientProtocolRecords file.  Alternatively we can go with just 
one patch where it adds a new TestContainerLaunchContextPBImpl file that has 
the test.

Otherwise changes in the 2.8 patch look good.  There will need to be a patch 
for trunk at a minimum.  We'll need a separate one for branch-2.8 if the test 
goes in TestApplicationClientProtocolRecords instead of a new 
TestContainerLaunchContextPBImpl file.  Either works for me.

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch, 
> YARN-6403.branch-2.8.003.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (YARN-6411) Clean up the overwrite of createDispatcher() in subclass of MockRM

2017-03-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951087#comment-15951087
 ] 

Jason Lowe commented on YARN-6411:
--

+1 lgtm.  Committing this.

> Clean up the overwrite of createDispatcher() in subclass of MockRM
> --
>
> Key: YARN-6411
> URL: https://issues.apache.org/jira/browse/YARN-6411
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Minor
> Attachments: YARN-6411.001.patch, YARN-6411.002.patch
>
>
> MockRM creates a object of {{DrainDispatcher}} in YARN-3102. We don't need to 
> do the same thing in its subclasses.
> {code}
>   @Override
>   protected Dispatcher createDispatcher() {
> return new DrainDispatcher();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6354) LeveldbRMStateStore can parse invalid keys when recovering reservations

2017-03-30 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949637#comment-15949637
 ] 

Jason Lowe commented on YARN-6354:
--

The TestRMRestart failure is unrelated.

> LeveldbRMStateStore can parse invalid keys when recovering reservations
> ---
>
> Key: YARN-6354
> URL: https://issues.apache.org/jira/browse/YARN-6354
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-6354.001.patch
>
>
> When trying to upgrade an RM to 2.8 it fails with a 
> StringIndexOutOfBoundsException trying to load reservation state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6411) Clean up the overwrite of createDispatcher() in subclass of MockRM

2017-03-30 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949230#comment-15949230
 ] 

Jason Lowe commented on YARN-6411:
--

Thanks for the patch!  Looks good overall.  It would be nice to cleanup the 
checkstyle issues of now unused Dispatcher/DrainDispatcher imports and eclipsed 
fields.

> Clean up the overwrite of createDispatcher() in subclass of MockRM
> --
>
> Key: YARN-6411
> URL: https://issues.apache.org/jira/browse/YARN-6411
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Minor
> Attachments: YARN-6411.001.patch
>
>
> MockRM creates a object of {{DrainDispatcher}} in YARN-3102. We don't need to 
> do the same thing in its subclasses.
> {code}
>   @Override
>   protected Dispatcher createDispatcher() {
> return new DrainDispatcher();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-30 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-6403:


Assignee: Tao Yang
Target Version/s: 2.8.1

Submitting patch so Jenkins can comment on it as well.

bq. For the client-side change, IIUIC the generated protobuf code won't throws 
NPE for this case actually.

The generated protobuf code won't throw NPE for _this_ particular case, but it 
does throw NPE for _other_ fields that you try to set directly to null.  For 
example, if one tried to call setLocalResources(null) on the protobuf (not the 
PBImpl) then the generated protobuf code explicitly throws NPE.  As such, I 
believe it's appropriate to throw NPE in our client check code as well rather 
than a generic RuntimeException.  It's a minor point since the net effect will 
be similar for the client in either case.

The {{localResources != null}} check in checkLocalResources is not necessary 
since the calling code explicitly checks for it already.

The error message should be a bit more specific.  It just logs the local 
resource as a string, but unfortunately that won't log the fact that the 
resource itself is null.  We should change "Got invalid local resource" to 
something like "Null resource URL for local resource".  Throwing NPE instead of 
RuntimeException would at least hint to the user that there's a problem with a 
null field here, but we should also be more explicit in the error message to 
help them along.

This test code:
{code}
boolean throwsException = false;
try {
[]
  containerLaunchContext.setLocalResources(localResources);
} catch (Throwable e) {
  throwsException = true;
  Assert.assertTrue(e.getMessage().contains("Got invalid local resource"));
}
Assert.assertTrue(throwsException);
{code}
can be simplified to this:
{code}
try {
[]
  containerLaunchContext.setLocalResources(localResources);
  Assert.fail("setting an invalid local resource should be an error!");
} catch (RuntimeException e) {
  Assert.assertTrue(e.getMessage().contains("Got invalid local resource"));
}
{code}
Note that we should be checking for the specific exception type we are throwing 
in the test rather than Throwable, since this is essentially part of the client 
API.

testClientFailureWithInvalidResource does not belong in ContainerManagerImpl 
since it has nothing to do with ContainerManagerImpl.  It's really a test for 
ContainerLaunchContextPBImpl and should be moved to an appropriate place in the 
yarn-common project.  TestApplicationClientProtocolRecords looks like a decent 
place since it's already has another test for ContainerLaunchContextPBImpl 
there.  The unit test method should be renamed to something more appropriate 
when moved there, like testCLCPBImplNullResource.


> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> 

[jira] [Updated] (YARN-6354) LeveldbRMStateStore can parse invalid keys when recovering reservations

2017-03-29 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6354:
-
Attachment: YARN-6354.001.patch

Patch that adds a termination check for the reservation key traversal loop and 
a unit test.

> LeveldbRMStateStore can parse invalid keys when recovering reservations
> ---
>
> Key: YARN-6354
> URL: https://issues.apache.org/jira/browse/YARN-6354
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
> Attachments: YARN-6354.001.patch
>
>
> When trying to upgrade an RM to 2.8 it fails with a 
> StringIndexOutOfBoundsException trying to load reservation state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6354) LeveldbRMStateStore can parse invalid keys when recovering reservations

2017-03-29 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-6354:


Assignee: Jason Lowe

> LeveldbRMStateStore can parse invalid keys when recovering reservations
> ---
>
> Key: YARN-6354
> URL: https://issues.apache.org/jira/browse/YARN-6354
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-6354.001.patch
>
>
> When trying to upgrade an RM to 2.8 it fails with a 
> StringIndexOutOfBoundsException trying to load reservation state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-03-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947730#comment-15947730
 ] 

Jason Lowe commented on YARN-6195:
--

Latest patch lgtm, with the caveat that I don't think we can really support 
used capacity and absolute used capacity in the queue metrics without having 
per-partition queue metrics.  Using a max across partitions seems like a 
reasonable value to report given we're trying to squash multiple values into a 
single field.

[~leftnoteasy] do you have any concerns about using a max-across-partitions 
approach here?  If not then I think this is ready to go.

> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch, YARN-6195.002.patch, 
> YARN-6195.003.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6401) terminating signal should be able to specify per application to support graceful-stop

2017-03-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947656#comment-15947656
 ] 

Jason Lowe commented on YARN-6401:
--

Ah, sorry.  I was thinking it was ignoring SIGTERM and thus not cleaning up 
because it would get killed by the subsequent SIGKILL.  Instead it sounds like 
it _is_ responding to SIGTERM but not cleaning up.  Isn't that a bit odd?  The 
whole point of SIGTERM is to request a shutdown of the process rather than 
forcing one.

I'm not an httpd expert, so I started digging into the docs to try to 
understand why it wouldn't do something sane with TERM but does with a 
non-standard signal like WINCH.  Turns out it does handle TERM, but it's 
aggressive such that in-progress requests may be interrupted/canceled.  WINCH 
only advises things to exit, which sounds like active requests could continue 
to be processed but the listen port is no longer monitored so no new requests 
will be processed.

What worries me here is that we can still end up with an unorderly shutdown 
even if YARN sent WINCH instead of TERM. The default delay between the TERM and 
KILL signals is relatively short,  which is why the processing httpd does for 
TERM seems more appropriate here.  If a request could take hundreds of 
milliseconds to process then the KILL is going to arrive too soon after the 
WINCH signal unless the delay between the two signals is widened.  However that 
delay is not a per-app setting, and making it a per-app setting would cause a 
DoS problem.  Containers are often killed because YARN needs the container to 
leave in a timely manner (e.g.: container running beyond limits, preemption, 
etc.).

So I still think this is something better handled by the application framework 
(in this case Slider) rather than YARN.  MapReduce has a similar example.  
MapReduce jobs can be killed via YARN, but it's harsh and things are often lost 
when this occurs.  That's why the {{mapred job -kill}} command first tries to 
kill the job by contacting the AM and requesting it to do an orderly shutdown 
outside of YARN, and only falls back on YARN to terminate the containers if the 
job is unresponsive to the kill request.  I think the same thing applies here.  
If we really want an orderly shutdown to httpd so we won't kill outstanding 
requests (even if they can take a while) then Slider (or some layer on top of 
Slider) should support sending the WINCH signals to the containers for the app 
and then the app can terminate when all containers have completed their 
shutdown.  Then the application can implement an arbitrary, 
application-specific shutdown sequence and timing.  If YARN needs to do the 
killing directly then we cannot wait an arbitrary amount of time for the app to 
cleanup and shutdown gracefully.

I think YARN will still need some support to send the WINCH signal in either 
case.  Currently containers can be sent signals after YARN-1897, but it's only 
a restricted subset that can be translated cross-platform.  That would need to 
be extended to support more arbitrary signals like WINCH.

> terminating signal should be able to specify per application to support 
> graceful-stop
> -
>
> Key: YARN-6401
> URL: https://issues.apache.org/jira/browse/YARN-6401
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: kyungwan nam
>
> when stop container, first send SIGTERM to the process.
> after a while, send SIGKILL if the process is still alive.
> above process is always the same for any application.
> but, to graceful-stop, sometimes it need to send another signal instead of 
> SIGTERM.
> for instance, if apache httpd on slider is running, SIGWINCH should be came 
> to stop gracefully.
> the way to stop gracefully is depend on application.
> it will be good if we can define a signal to terminate per application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers

2017-03-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947326#comment-15947326
 ] 

Jason Lowe commented on YARN-6168:
--

This sounds like the RM isn't waiting long enough for all the live NMs to 
report in before reporting the live containers to the app.  Technically it 
would have to wait up to the full NM expiry interval before it could know for 
sure no more containers are going to be reported by late-heartbeating NMs, so 
once fix would be to hold off AM restarts of container-preserving apps after an 
RM restart until the NM expiry interval has passed since restart.  However I 
don't know if apps are willing to wait that long before their AM recovers.  If 
not then there is always going to be the possibility that not all live 
containers are reported when the AM restarts and registers if an NM ends jup 
heartbeating late.




> Restarted RM may not inform AM about all existing containers
> 
>
> Key: YARN-6168
> URL: https://issues.apache.org/jira/browse/YARN-6168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>
> There appears to be a race condition when an RM is restarted. I had a 
> situation where the RMs and AM were down, but NMs and app containers were 
> still running. When I restarted the RM, the AM restarted, registered with the 
> RM, and received its list of existing containers before the NMs had reported 
> all of their containers to the RM. The AM was only told about some of the 
> app's existing containers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947283#comment-15947283
 ] 

Jason Lowe commented on YARN-6403:
--

Sorry, I completely missed the server-side change in ContainerImpl.

I'm not sure that's the correct place to make the server-side change because 
it's happening so late in the container lifecycle.  It would be better if we 
simply failed the container launch request _immediately_ rather than wait until 
the container transitions all the way to the localizing state.  That way the 
client gets immediate feedback that their request was malformed rather than 
wondering why their container launch mysteriously failed sometime later.

I think it's more appropriate to have 
ContainerManagerImpl#startContainerInternal sanity check the request (which it 
already does to some degree, just not for the local resources) and throw a 
YarnException if the request is malformed.  That way the client will receive a 
failed container start response to their start request, so they will 
immediately know their request was bad.

It would be good to add unit tests for both the client and server changes.

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
> Attachments: YARN-6403.001.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by 

[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947251#comment-15947251
 ] 

Jason Lowe commented on YARN-6403:
--

Thanks for the patch!

This patch is changing the client code but not the server code.  A client who 
doesn't have the fix or a malicious client can still construct a malformed 
protobuf that is missing the resource location.  Minimally the server needs to 
validate the request.  The client-side change is nice to have but technically 
not necessary to fix the issue.

Nit: Speaking of the client side change, I think NullPointerException is more 
appropriate to throw in this case.  That's what the generated protobuf code 
already throws when trying to set protobuf fields to null.


> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
> Attachments: YARN-6403.001.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6406) Garbage Collect unused SchedulerRequestKeys

2017-03-28 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946028#comment-15946028
 ] 

Jason Lowe commented on YARN-6406:
--

I haven't dug into YARN-6040, but in general I'm a big +1 for having the RM 
aggressively remove bookkeeping entries that aren't necessary to improve 
lookup/iteration performance in addition to reducing the heap pressure.  That 
was the whole idea behind YARN-5540.  I don't see why we would need to keep 
scheduler keys or requests around once there are no more containers to allocate 
for them.


> Garbage Collect unused SchedulerRequestKeys
> ---
>
> Key: YARN-6406
> URL: https://issues.apache.org/jira/browse/YARN-6406
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> YARN-5540 introduced some optimizations to remove satisfied SchedulerKeys 
> from the AppScheduleingInfo. It looks like after YARN-6040, 
> ScedulerRequestKeys are removed only if the Application sends a 0 
> numContainers requests. While earlier, the outstanding schedulerKeys were 
> also remove as soon as a container is allocated as well.
> An additional optimization we were hoping to include is to remove the 
> ResourceRequests itself once the numContainers == 0, since we see in our 
> clusters that the RM heap space consumption increases drastically due to a 
> large number of ResourceRequests with 0 num containers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-28 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945116#comment-15945116
 ] 

Jason Lowe commented on YARN-6403:
--

The NM should definitely be hardened against malformed data being sent from the 
untrusted clients.  The NM should just fail the container launch if fields 
necessary to perform the launch are missing from the request.

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6359) TestRM#testApplicationKillAtAcceptedState fails rarely due to race condition

2017-03-27 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944122#comment-15944122
 ] 

Jason Lowe commented on YARN-6359:
--

+1 lgtm.  Will commit this tomorrow if there are no objections.

> TestRM#testApplicationKillAtAcceptedState fails rarely due to race condition
> 
>
> Key: YARN-6359
> URL: https://issues.apache.org/jira/browse/YARN-6359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-6359.001.patch, YARN-6359.002.patch, 
> YARN-6359.003.patch
>
>
> We've seen (very rarely) a test failure in 
> {{TestRM#testApplicationKillAtAcceptedState}}
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRM.testApplicationKillAtAcceptedState(TestRM.java:645)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2113) Add cross-user preemption within CapacityScheduler's leaf-queue

2017-03-27 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944117#comment-15944117
 ] 

Jason Lowe commented on YARN-2113:
--

I agree with Eric here.  I see priority as a way for a user to change the order 
applications are considered for assigning free resources, but the priority is 
moot if the user is above their limit in the queue.  In practice user limit 
essentially trumps app priority, so I believe the preemption policy should 
behave similarly.  Otherwise users can abuse their limits and perform a form of 
DoS attack on other users by artificially inflating their app priorities.

> Add cross-user preemption within CapacityScheduler's leaf-queue
> ---
>
> Key: YARN-2113
> URL: https://issues.apache.org/jira/browse/YARN-2113
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil G
> Attachments: YARN-2113.0001.patch, YARN-2113.0002.patch, 
> YARN-2113.0003.patch, YARN-2113.v0.patch
>
>
> Preemption today only works across queues and moves around resources across 
> queues per demand and usage. We should also have user-level preemption within 
> a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6401) terminating signal should be able to specify per application to support graceful-stop

2017-03-27 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943307#comment-15943307
 ] 

Jason Lowe commented on YARN-6401:
--

Is this something YARN needs to support directly?  This seems straightforward 
to solve on the application side by wrapping the application launch with a 
front-end shell that traps SIGTERM and translates it into the desired signal 
for the "real" process.

> terminating signal should be able to specify per application to support 
> graceful-stop
> -
>
> Key: YARN-6401
> URL: https://issues.apache.org/jira/browse/YARN-6401
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: kyungwan nam
>
> when stop container, first send SIGTERM to the process.
> after a while, send SIGKILL if the process is still alive.
> above process is always the same for any application.
> but, to graceful-stop, sometimes it need to send another signal instead of 
> SIGTERM.
> for instance, if apache httpd on slider is running, SIGWINCH should be came 
> to stop gracefully.
> the way to stop gracefully is depend on application.
> it will be good if we can define a signal to terminate per application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6217) TestLocalCacheDirectoryManager test timeout is too aggressive

2017-03-17 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930513#comment-15930513
 ] 

Jason Lowe commented on YARN-6217:
--

+1 lgtm.  Committing this.

> TestLocalCacheDirectoryManager test timeout is too aggressive
> -
>
> Key: YARN-6217
> URL: https://issues.apache.org/jira/browse/YARN-6217
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Jason Lowe
>Assignee: Miklos Szegedi
> Attachments: YARN-6217.000.patch, YARN-6217.001.patch
>
>
> TestLocalCacheDirectoryManager#testDirectoryStateChangeFromFullToNonFull has 
> only a one second timeout.  If the test machine hits an I/O hiccup it can 
> fail.  The test timeout is too aggressive, and I question whether this test 
> even needs an explicit timeout specified.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6359) TestRM#testApplicationKillAtAcceptedState fails rarely due to race condition

2017-03-17 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930358#comment-15930358
 ] 

Jason Lowe commented on YARN-6359:
--

Thanks for the report and patch!

The timeout in the loop is 80 seconds, but there's a 60 second timeout for the 
entire test which seems weird.  Is that why the loop doesn't check if the 
timeout occurred after it completes?  It'd be nice to use 
GenericTestUtils#waitFor to have it check for timeouts, do the stacktrace if it 
does timeout, etc.

> TestRM#testApplicationKillAtAcceptedState fails rarely due to race condition
> 
>
> Key: YARN-6359
> URL: https://issues.apache.org/jira/browse/YARN-6359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-6359.001.patch
>
>
> We've seen (very rarely) a test failure in 
> {{TestRM#testApplicationKillAtAcceptedState}}
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRM.testApplicationKillAtAcceptedState(TestRM.java:645)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6217) TestLocalCacheDirectoryManager test timeout is too aggressive

2017-03-17 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930346#comment-15930346
 ] 

Jason Lowe commented on YARN-6217:
--

I tend to agree.  Originally there was an edict to put a timeout on each test 
because the build wasn't doing a good job of handling tests that timed out.  
However that has since been fixed.  An explicit test timeout still makes a lot 
of sense when the test has a good chance of deadlocking when broken (e.g.: 
needing to carefully synchronize a number of threads, wait for barriers, doing 
a polling loop, etc.), but I don't think that's the case with the tests here.

> TestLocalCacheDirectoryManager test timeout is too aggressive
> -
>
> Key: YARN-6217
> URL: https://issues.apache.org/jira/browse/YARN-6217
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Jason Lowe
>Assignee: Miklos Szegedi
> Attachments: YARN-6217.000.patch
>
>
> TestLocalCacheDirectoryManager#testDirectoryStateChangeFromFullToNonFull has 
> only a one second timeout.  If the test machine hits an I/O hiccup it can 
> fail.  The test timeout is too aggressive, and I question whether this test 
> even needs an explicit timeout specified.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files

2017-03-17 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930205#comment-15930205
 ] 

Jason Lowe commented on YARN-6315:
--

I tried to run this in an end-to-end test and found it doesn't work in 
practice.  I was under the mistaken impression that the size specified in the 
LocalResourceRequest was used to verify the correct file was being localized, 
but that's not the case.  It only uses the _timestamp_ to verify the correct 
version of the file is being downloaded.  The size is ignored.  In my case the 
request actually contained the value -1 for the size, so it always thought the 
size mismatched and would re-localize the file.  That's not good.

I thought we could pivot from the (now untrustworthy) size in the request to 
the size in the LocalizedResource. That's a value the NM computes directly 
during localization, so that will be correct.  However this is the size of the 
entire directory containing the localized resource (whether that's a file, 
archive, or directory), so it includes extra things like the .crc file from 
LocalFileSystem, etc.  In order to match the sizes we'd have to do the same 
logic being done by the localizer which is a DU of the directory.  That's going 
to be too expensive to do for every local resource lookup by a container launch.


> Improve LocalResourcesTrackerImpl#isResourcePresent to return false for 
> corrupted files
> ---
>
> Key: YARN-6315
> URL: https://issues.apache.org/jira/browse/YARN-6315
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3, 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-6315.001.patch, YARN-6315.002.patch, 
> YARN-6315.003.patch, YARN-6315.004.patch
>
>
> We currently check if a resource is present by making sure that the file 
> exists locally. There can be a case where the LocalizationTracker thinks that 
> it has the resource if the file exists but with size 0 or less than the 
> "expected" size of the LocalResource. This JIRA tracks the change to harden 
> the isResourcePresent call to address that case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6354) LeveldbRMStateStore can parse invalid keys when recovering reservations

2017-03-16 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6354:
-
Priority: Major  (was: Critical)
 Summary: LeveldbRMStateStore can parse invalid keys when recovering 
reservations  (was: RM fails to upgrade to 2.8 with leveldb state store)

I found another instance where a rolling upgrade to 2.8 with leveldb did work 
successfully, so I dug a bit deeper into why this doesn't always fail.  It 
turns out that normally the reservation state keys happen to be the last keys 
in the database and therefore it works.  If the database happens to have any 
relatively short keys after the reservation keys then it breaks.  My local dev 
database had some short, lowercase keys leftover in it from some prior work, 
and that's how I ran into the issue.

Since it looks like this happens to not be a problem for now with "normal" RM 
leveldb databases I lowered the severity and updated the headline accordingly.

> LeveldbRMStateStore can parse invalid keys when recovering reservations
> ---
>
> Key: YARN-6354
> URL: https://issues.apache.org/jira/browse/YARN-6354
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
>
> When trying to upgrade an RM to 2.8 it fails with a 
> StringIndexOutOfBoundsException trying to load reservation state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6354) RM fails to upgrade to 2.8 with leveldb state store

2017-03-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928843#comment-15928843
 ] 

Jason Lowe commented on YARN-6354:
--

Sample stacktrace:
{noformat}
2017-03-16 15:17:26,616 INFO  [main] service.AbstractService 
(AbstractService.java:noteFailure(272)) - Service ResourceManager failed in 
state STARTED; cause: java.lang.StringIndexOutOfBoundsException: String index 
out of range: -17
java.lang.StringIndexOutOfBoundsException: String index out of range: -17
at java.lang.String.substring(String.java:1931)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadReservationState(LeveldbRMStateStore.java:289)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadState(LeveldbRMStateStore.java:274)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:690)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1097)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1137)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1133)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1133)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1173)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1338)
{noformat}

This was broken by YARN-3736.  The recovery code is seeking to the 
RM_RESERVATION_KEY_PREFIX but failing to verify that the keys it sees in the 
loop actually have that key prefix.  Here's the relevant code:
{code}
  iter = new LeveldbIterator(db);
  iter.seek(bytes(RM_RESERVATION_KEY_PREFIX));
  while (iter.hasNext()) {
Entry entry = iter.next();
String key = asString(entry.getKey());

String planReservationString =
key.substring(RM_RESERVATION_KEY_PREFIX.length());
String[] parts = planReservationString.split(SEPARATOR);
if (parts.length != 2) {
  LOG.warn("Incorrect reservation state key " + key);
  continue;
}
{code}

The only way to terminate this loop is when the iterator runs out of keys, 
therefore the iteration loop will scan through *all* the keys in the database 
starting at the reservation key to the end.  If any key encountered is too 
short then we'll get the out of bounds exception when we try to do the 
substring.  

Pinging [~adhoot] and [~asuresh] who were involved in YARN-3736.

> RM fails to upgrade to 2.8 with leveldb state store
> ---
>
> Key: YARN-6354
> URL: https://issues.apache.org/jira/browse/YARN-6354
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
>Priority: Critical
>
> When trying to upgrade an RM to 2.8 it fails with a 
> StringIndexOutOfBoundsException trying to load reservation state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6354) RM fails to upgrade to 2.8 with leveldb state store

2017-03-16 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6354:


 Summary: RM fails to upgrade to 2.8 with leveldb state store
 Key: YARN-6354
 URL: https://issues.apache.org/jira/browse/YARN-6354
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.8.0
Reporter: Jason Lowe
Priority: Critical


When trying to upgrade an RM to 2.8 it fails with a 
StringIndexOutOfBoundsException trying to load reservation state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files

2017-03-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928292#comment-15928292
 ] 

Jason Lowe commented on YARN-6315:
--

Thanks for updating the patch!

Catching Exception is too wide of a net here, IMHO.  It masks serious issues 
like SecurityException (which is not a normal I/O permission denied type of 
error), NullPointerException, IllegalArgumentException, 
UnsupportedOperationException, etc.  If the operation really is unsupported 
then this is going to think every resource is missing after it localizes it 
which isn't good.  It would be a dist cache that doesn't cache.  We should just 
catch NoSuchFileException and IOException.  In the no such file case we can 
simply log it isn't there, but in the IOException case since we don't really 
know what happened we should log the full exception trace rather than just the 
exception message to give proper context for debug.

Nit: The attributes variable declaration should be as close to the usage as 
necessary.  It only needs to be just before the the {{try}} block rather than 
outside the {{if}} block.


> Improve LocalResourcesTrackerImpl#isResourcePresent to return false for 
> corrupted files
> ---
>
> Key: YARN-6315
> URL: https://issues.apache.org/jira/browse/YARN-6315
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3, 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-6315.001.patch, YARN-6315.002.patch, 
> YARN-6315.003.patch
>
>
> We currently check if a resource is present by making sure that the file 
> exists locally. There can be a case where the LocalizationTracker thinks that 
> it has the resource if the file exists but with size 0 or less than the 
> "expected" size of the LocalResource. This JIRA tracks the change to harden 
> the isResourcePresent call to address that case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6349) Container kill request from AM can be lost if container is still recovering

2017-03-16 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6349:


 Summary: Container kill request from AM can be lost if container 
is still recovering
 Key: YARN-6349
 URL: https://issues.apache.org/jira/browse/YARN-6349
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jason Lowe


If container recovery takes an excessive amount of time (e.g.: HDFS is slow) 
then the NM could start servicing requests before all containers have 
recovered.  If an AM tries to kill a container while it is still recovering 
then this kill request could be lost.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6349) Container kill request from AM can be lost if container is still recovering

2017-03-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928185#comment-15928185
 ] 

Jason Lowe commented on YARN-6349:
--

See YARN-4051 for related discussion.

> Container kill request from AM can be lost if container is still recovering
> ---
>
> Key: YARN-6349
> URL: https://issues.apache.org/jira/browse/YARN-6349
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jason Lowe
>
> If container recovery takes an excessive amount of time (e.g.: HDFS is slow) 
> then the NM could start servicing requests before all containers have 
> recovered.  If an AM tries to kill a container while it is still recovering 
> then this kill request could be lost.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes

2017-03-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928139#comment-15928139
 ] 

Jason Lowe commented on YARN-4051:
--

+1 for the branch-2 patch as well.  The unit test failure appears to be 
unrelated, and the test passes for me locally with the patch applied.

Committing this.

> ContainerKillEvent lost when container is still recovering and application 
> finishes
> ---
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch, 
> YARN-4051.08.patch-branch-2
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes

2017-03-15 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926276#comment-15926276
 ] 

Jason Lowe commented on YARN-4051:
--

+1 for the latest patch, however it doesn't apply to branch-2.  Could you 
provide a patch for branch-2 as well?

> ContainerKillEvent lost when container is still recovering and application 
> finishes
> ---
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files

2017-03-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925152#comment-15925152
 ] 

Jason Lowe commented on YARN-6315:
--

Thanks for the patch!  Looks good overall, just some minor nits:

Extra semicolon on "import java.nio.file.Files;;"

The try/catch block covers more than necessary.  Ideally it would not cover the 
checkLocalResource call.

IOExceptions are treated like the resource is there.  The prior exists call 
that this is replacing would return false if an exception occurred.

I think it might be useful to emit at least an info message (if not warn) 
indicating a resource we thought was there is no longer there, or if it's 
corrupted what the size diffs are.  That could help debugging cases where a 
nodemanager keeps relocalizing when it shouldn't.

static Mockito.when import is not near the other static Mockito imports


> Improve LocalResourcesTrackerImpl#isResourcePresent to return false for 
> corrupted files
> ---
>
> Key: YARN-6315
> URL: https://issues.apache.org/jira/browse/YARN-6315
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3, 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-6315.001.patch, YARN-6315.002.patch
>
>
> We currently check if a resource is present by making sure that the file 
> exists locally. There can be a case where the LocalizationTracker thinks that 
> it has the resource if the file exists but with size 0 or less than the 
> "expected" size of the LocalResource. This JIRA tracks the change to harden 
> the isResourcePresent call to address that case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6325) ParentQueue and LeafQueue with same name can cause queue name based operations to fail

2017-03-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15924952#comment-15924952
 ] 

Jason Lowe commented on YARN-6325:
--

I don't have the full backstory on queue name requirements, but I agree it 
seems like a bug given the ambiguity in some APIs.  However since most of the 
user-facing APIs are only used for leaf queues, I can see how the parent/leaf 
conflict potential was missed.  Most of the APIs are only used for leaf queues 
since that's where the apps actually run.

I do worry that if we suddenly enforce something we didn't before that we would 
break some user's long-standing setup.  Seems like something we should fix for 
3.x going forward, but not sure it's worth the compatibility risk in 2.x.  
Thoughts?

> ParentQueue and LeafQueue with same name can cause queue name based 
> operations to fail
> --
>
> Key: YARN-6325
> URL: https://issues.apache.org/jira/browse/YARN-6325
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Jonathan Hung
> Attachments: capacity-scheduler.xml, Screen Shot 2017-03-13 at 
> 2.28.30 PM.png
>
>
> For example, configure capacity scheduler with two leaf queues: {{root.a.a1}} 
> and {{root.b.a}}, with {{yarn.scheduler.capacity.root.queues}} as {{b,a}} (in 
> that order).
> Then add a mapping e.g. {{u:username:a}} to {{capacity-scheduler.xml}} and 
> call {{refreshQueues}}. Operation fails with {noformat}refreshQueues: 
> java.io.IOException: Failed to re-init queues : mapping contains invalid or 
> non-leaf queue a
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.logAndWrapException(AdminService.java:866)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:391)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
>   at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2653)
> Caused by: java.io.IOException: Failed to re-init queues : mapping contains 
> invalid or non-leaf queue a
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:404)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:396)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:386)
>   ... 10 more
> Caused by: java.io.IOException: mapping contains invalid or non-leaf queue a
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:547)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:571)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:595)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:400)
>   ... 12 more
> {noformat}
> Part of the issue is that the {{queues}} map in 
> {{CapacitySchedulerQueueManager}} stores queues by queue name. We could do 
> one of a few things:
> # Disallow ParentQueues and LeafQueues to have the same queue name. (this 
> breaks compatibility)
> # Store queues by queue path instead of queue name. But this might require 
> changes in lots of places, e.g. in this case the queue-mappings would have to 
> map to a queue path instead of a queue name (which also breaks compatibility)
> and possibly others.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Commented] (YARN-3884) App History status not updated when RMContainer transitions from RESERVED to KILLED

2017-03-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15924637#comment-15924637
 ] 

Jason Lowe commented on YARN-3884:
--

If only nodemanagers are reporting then allocations that are never launched 
would also be missed (i.e.: RM hands the AM a bunch of containers, but the AM 
sits on them for a few minutes and releases them without ever launching them).  
App frameworks that perform container reuse will always have this to some 
degree as the allocation races with live containers finishing and getting 
reused, eliminating the need for the allocation that just arrived.

This all comes down to what kind of view we're trying to capture.  If the user 
wants to see what the impact was to the physical nodes then having only the 
nodes report makes sense.  If we need to capture total footprint including 
times when no physical container was running yet then only the RM can report 
that picture.  However I'm not sure the RM needs to report that total picture 
only in terms of individual containers.  It could instead post periodic events 
reporting the aggregate footprint of the app (i.e: same kind of metrics added 
by YARN-415).  We can grab the individual stats of containers that actually 
ran, so subtracting that from the aggregate footprint total gets us the 
aggregate "overhead" in terms of reservations and unlaunched container 
allocations.  Since we're reporting on the order of applications rather than 
containers (something I'd expect the RM to be doing anyway for other reasons) 
then this seems like a reasonable load for the RM to bear and still gets us the 
rollup chargeback metrics.  Thoughts?

> App History status not updated when RMContainer transitions from RESERVED to 
> KILLED
> ---
>
> Key: YARN-3884
> URL: https://issues.apache.org/jira/browse/YARN-3884
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: Suse11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>  Labels: oct16-easy
> Attachments: 0001-YARN-3884.patch, Apphistory Container Status.jpg, 
> Elapsed Time.jpg, Test Result-Container status.jpg, YARN-3884.0002.patch, 
> YARN-3884.0003.patch, YARN-3884.0004.patch, YARN-3884.0005.patch, 
> YARN-3884.0006.patch, YARN-3884.0007.patch, YARN-3884.0008.patch
>
>
> Setup
> ===
> 1 NM 3072 16 cores each
> Steps to reproduce
> ===
> 1.Submit apps  to Queue 1 with 512 mb 1 core
> 2.Submit apps  to Queue 2 with 512 mb and 5 core
> lots of containers get reserved and unreserved in this case 
> {code}
> 2015-07-02 20:45:31,169 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e24_1435849994778_0002_01_13 Container Transitioned from NEW to 
> RESERVED
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Reserved container  application=application_1435849994778_0002 
> resource= queue=QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=, 
> usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, 
> numContainers=5 usedCapacity=1.6410257 absoluteUsedCapacity=0.65625 
> used= cluster=
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting assigned queue: root.QueueA stats: QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=, 
> usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, 
> numContainers=6
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.96875 
> absoluteUsedCapacity=0.96875 used= 
> cluster=
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e24_1435849994778_0001_01_14 Container Transitioned from NEW to 
> ALLOCATED
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
> OPERATION=AM Allocated ContainerTARGET=SchedulerApp 
> RESULT=SUCCESS  APPID=application_1435849994778_0001
> CONTAINERID=container_e24_1435849994778_0001_01_14
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
> Assigned container container_e24_1435849994778_0001_01_14 of capacity 
>  on host host-10-19-92-117:64318, which has 6 
> containers,  used and  available 
> after allocation
> 2015-07-02 

[jira] [Updated] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes

2017-03-13 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-4051:
-
Summary: ContainerKillEvent lost when container is still recovering and 
application finishes  (was: ContainerKillEvent is lost when container is  In 
New State and is recovering)

Thanks for updating the patch!

I'm OK with fixing the lost kill-from-AM event in a separate JIRA, but I 
adjusted the headline of this one to avoid confusion.

Should we use NMNotYetReadyException in the case where the AM tries to kill a 
container still recovering?  We already throw it in similar situations where 
the NM isn't ready to handle the request.

Nits:
- ",because " should be " because "
- ContainerImpl#isRecovering should check recoveredStatus before container 
state since recoveredStatus is the cheaper check and likely to avoid a 
subsequent state check and corresponding lock acquisition.


> ContainerKillEvent lost when container is still recovering and application 
> finishes
> ---
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-03-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15922687#comment-15922687
 ] 

Jason Lowe commented on YARN-6195:
--

Pinging [~leftnoteasy] and [~sunilg] to see if there's an opinion on how best 
to accommodate used capacity metrics before the metrics are split along label 
partitions.

> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch, YARN-6195.002.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-03-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907541#comment-15907541
 ] 

Jason Lowe commented on YARN-6195:
--

Thanks for updating the patch!

I discovered the queue metrics are doing something somewhat sane for labels 
before this change, see YARN-4484.  We probably should do something similar, 
reporting the maximum used% like available is reporting the maximum available.

The findbugs complaint is valid and needs to be fixed.

Nit: We could make the CSQueueUtils call even simpler by retrieving the minimum 
allocation and queue usage from the queue rather than passing them separately.

> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch, YARN-6195.002.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6321) TestResources test timeouts are too aggressive

2017-03-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905561#comment-15905561
 ] 

Jason Lowe commented on YARN-6321:
--

+1 lgtm.  Committing this.

> TestResources test timeouts are too aggressive
> --
>
> Key: YARN-6321
> URL: https://issues.apache.org/jira/browse/YARN-6321
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Jason Lowe
>Assignee: Eric Badger
> Attachments: YARN-6321.001.patch
>
>
> TestResources is using 1 second timeouts which can cause a spurious test 
> failure when the test machine hits an I/O hiccup or other temporary slowness.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6310) OutputStreams in AggregatedLogFormat.LogWriter can be left open upon exceptions

2017-03-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905407#comment-15905407
 ] 

Jason Lowe commented on YARN-6310:
--

+1 lgtm.  Committing this.

> OutputStreams in AggregatedLogFormat.LogWriter can be left open upon 
> exceptions
> ---
>
> Key: YARN-6310
> URL: https://issues.apache.org/jira/browse/YARN-6310
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0-alpha2
>Reporter: Haibo Chen
>Assignee: Haibo Chen
> Attachments: YARN-6310.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6321) TestResources test timeouts are too aggressive

2017-03-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905372#comment-15905372
 ] 

Jason Lowe commented on YARN-6321:
--

Sample stacktrace from a timeout:
{noformat}
java.lang.Exception: test timed out after 1000 milliseconds
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152)
at sun.misc.Resource.getBytes(Resource.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:462)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.getConstructor(Class.java:1825)
at 
org.apache.hadoop.yarn.factories.impl.pb.RecordFactoryPBImpl.newRecordInstance(RecordFactoryPBImpl.java:62)
at org.apache.hadoop.yarn.util.Records.newRecord(Records.java:36)
at 
org.apache.hadoop.yarn.api.records.Resource.newInstance(Resource.java:68)
at 
org.apache.hadoop.yarn.util.resource.TestResources.createResource(TestResources.java:28)
at 
org.apache.hadoop.yarn.util.resource.TestResources.testCompareToWithNoneResource(TestResources.java:43)
{noformat}

Clearly the test is just doing startup-related classpath stuff which happened 
to take a bit longer than normal.

> TestResources test timeouts are too aggressive
> --
>
> Key: YARN-6321
> URL: https://issues.apache.org/jira/browse/YARN-6321
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Jason Lowe
>
> TestResources is using 1 second timeouts which can cause a spurious test 
> failure when the test machine hits an I/O hiccup or other temporary slowness.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6321) TestResources test timeouts are too aggressive

2017-03-10 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6321:


 Summary: TestResources test timeouts are too aggressive
 Key: YARN-6321
 URL: https://issues.apache.org/jira/browse/YARN-6321
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe


TestResources is using 1 second timeouts which can cause a spurious test 
failure when the test machine hits an I/O hiccup or other temporary slowness.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-03-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905293#comment-15905293
 ] 

Jason Lowe commented on YARN-6195:
--

What really needs to happen is a metrics-per-partition similar to how node 
labels were addressed in the scheduler web UI.  In the short term I'd probably 
be OK with the metrics just reflecting the no label case until the per-label 
metrics work is done.  Of course there needs to be TODO comments indicating 
it's not correct when labels are used.

Looking at the patch closer I think the APIs need to be fixed.  Before this 
patch, CSQueue's setUsedCapacity and setAbsoluteUsedCapacity methods were never 
called.  That makes sense, given that the used capacity of a queue is not 
something an external entity should be telling the queue.  The used capacity 
naturally falls out of what the queue is doing internally via container 
allocations and releases, and not because some external entity tells it the 
used capacity should be X%.  I realize it was just a hack to get to the queue's 
metrics, but it's confusing at best.  Given the interfaces aren't used, we 
should rip those out and eliminate that confusion rather than build on it.

Instead of passing the queue to the CSQueueUtils updating method, I think we 
can simply pass the queue metrics instead.  Then that can update both the 
QueueCapacities and the QueueMetrics in the same update.  The metrics will 
still be broken for labels as they are today, but we can add a TODO and file a 
JIRA to fix that going forward.  Actually instead of passing the QueueMetrics 
as an additional parameter, we could pass the CSQueue instead of the 
QueueCapacities argument and retrieve both the capacities and metrics objects 
via trivial accessors on the queue.  That's what other CSQueueUtils methods are 
already doing, e.g.: updateQueueStatistics.



> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6165) Intra-queue preemption occurs even when preemption is turned off for a specific queue.

2017-03-08 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902110#comment-15902110
 ] 

Jason Lowe commented on YARN-6165:
--

+1 lgtm.  The TestRMRestart failure is unrelated and will be fixed by 
YARN-5548.  I'll fixup the whitespace nit during the commit.

> Intra-queue preemption occurs even when preemption is turned off for a 
> specific queue.
> --
>
> Key: YARN-6165
> URL: https://issues.apache.org/jira/browse/YARN-6165
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.0.0-alpha2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-6165.001.patch
>
>
> Intra-queue preemption occurs even when preemption is turned on for the whole 
> cluster ({{yarn.resourcemanager.scheduler.monitor.enable == true}}) but 
> turned off for a specific queue 
> ({{yarn.scheduler.capacity.root.queue1.disable_preemption == true}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6165) Intra-queue preemption occurs even when preemption is turned off for a specific queue.

2017-03-08 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901915#comment-15901915
 ] 

Jason Lowe commented on YARN-6165:
--

I'd like to take a quick look, and this needs a Jenkins run anyway.

> Intra-queue preemption occurs even when preemption is turned off for a 
> specific queue.
> --
>
> Key: YARN-6165
> URL: https://issues.apache.org/jira/browse/YARN-6165
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.0.0-alpha2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-6165.001.patch
>
>
> Intra-queue preemption occurs even when preemption is turned on for the whole 
> cluster ({{yarn.resourcemanager.scheduler.monitor.enable == true}}) but 
> turned off for a specific queue 
> ({{yarn.scheduler.capacity.root.queue1.disable_preemption == true}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4236) Metric for aggregated resources allocation per queue

2017-03-08 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901600#comment-15901600
 ] 

Jason Lowe commented on YARN-4236:
--

Oops spoke too soon.  Just before committing I noticed there's a place that was 
missed.  There are two QueueMetrics#allocateResources methods, and the patch 
only increments the new aggregate metrics in one of the cases.  The case that 
was missed is used for the scenario where an existing container is changing its 
allocation size.  Arguably an increase in container size probably should be 
treated like an allocation of the delta for aggregate calculation purposes.  
For a decrease in size, that's sort of like a release of the delta size, and 
aggregate calculations ignore releases.

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, scheduler
>Reporter: Chang Li
>Assignee: Chang Li
>  Labels: oct16-medium
> Attachments: YARN-4236.2.patch, YARN-4236-3.patch, YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4236) Metric for aggregated resources allocation per queue

2017-03-08 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901575#comment-15901575
 ] 

Jason Lowe commented on YARN-4236:
--

+1 lgtm.  Committing this.

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, scheduler
>Reporter: Chang Li
>Assignee: Chang Li
>  Labels: oct16-medium
> Attachments: YARN-4236.2.patch, YARN-4236-3.patch, YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2017-03-08 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901553#comment-15901553
 ] 

Jason Lowe commented on YARN-4051:
--

Thanks for updating the patch!  In the future, please don't delete patches and 
re-upload them with the same name.  It can lead to very confusing cases where 
Jenkins comments on a patch that happens to have the same name as one of the 
current attachments but isn't actually the patch that was tested.

The following code won't actually cause it to ignore the FINISH_APPS event.  
The {{continue}} in the for loop is degenerate, so all this does is log 
warnings but otherwise is semantically the same logic:
{code}
for (Container container : app.getContainers().values()) {
  if (container.isRecovering()) {
LOG.warn("drop FINISH_APPS event to " + appID + "because container "
+ container.getContainerId() + "is recovering");
continue;
  }
}
{code}

Also this shouldn't be a warning since it's not actually wrong when this 
happens, correct?  Similarly the warn log when ignoring the FINISH_CONTAINERS 
event seems like that should just be an info log at best.

I'm also wondering about the scenario where the kill event is coming in from an 
AM and not the RM.  If a container is still in the recovering state when we 
open up the client service for new requests it seems a client (e.g.: AM) could 
come in and ask for a still-recovering container to be killed.  I think the 
container process will be orphaned if that occurs, since the NM will mistakenly 
believe the container has not been launched yet.

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, YARN-4051.06.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6292) YARN log aggregation doesn't support HDFS/ViewFs namespace other than what is specified in fs.defaultFS

2017-03-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898091#comment-15898091
 ] 

Jason Lowe commented on YARN-6292:
--

What version is involved here?  Is this a duplicate of YARN-3269?

> YARN log aggregation doesn't support HDFS/ViewFs namespace other than what is 
> specified in fs.defaultFS
> ---
>
> Key: YARN-6292
> URL: https://issues.apache.org/jira/browse/YARN-6292
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ang Zhang
>Priority: Minor
>
> 2017-03-02 17:59:13,688 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.IllegalArgumentException: Wrong FS: viewfs://ns-default/tmp/logs, 
> expected: hdfs://nameservice1
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:657)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1215)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1211)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1211)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:194)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:321)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:445)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6195) Export UsedCapacity and AbsoluteUsedCapacity to JMX

2017-03-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898083#comment-15898083
 ] 

Jason Lowe commented on YARN-6195:
--

Thanks for the patch, [~benson.qiu]!

I'm confused about how labels are supposed to interact with the JMX metrics.  
There's only one metric for used capacity in a queue even if there are labels?  
If the used proportion of a capacity is updated, it updates the same metric as 
for other labels or no label?  The metric will always reflect the last update 
given at the time the metric is examined, meaning there could be wildly 
different results from reading to reading even if things aren't moving that 
much on the cluster in reality, e.g.: one reading gets the "fizgig" node 
label's usage while another gets the default/no label usage.

MetricsRegistry.java: "mutable long float" and "mutable float integer" should 
both be "mutable float"

The tabs flagged by the whitespace check need to be removed.


> Export UsedCapacity and AbsoluteUsedCapacity to JMX
> ---
>
> Key: YARN-6195
> URL: https://issues.apache.org/jira/browse/YARN-6195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, yarn
>Affects Versions: 3.0.0-alpha3
>Reporter: Benson Qiu
>Assignee: Benson Qiu
> Attachments: YARN-6195.001.patch
>
>
> `usedCapacity` and `absoluteUsedCapacity` are currently not available as JMX. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6276) Now container kill mechanism may lead process leak

2017-03-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897373#comment-15897373
 ] 

Jason Lowe commented on YARN-6276:
--

Processes escaping from the session is a known problem.  If that's the "leak" 
being discussed here then this seems like a duplicate of YARN-2904.

> Now container kill mechanism may lead process leak
> --
>
> Key: YARN-6276
> URL: https://issues.apache.org/jira/browse/YARN-6276
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: Feng Yuan
>Assignee: Feng Yuan
>
> When kill bash process, YarnChild may didn`t response because fullgc, 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6274) Documentation refers to incorrect nodemanager health checker interval property

2017-03-03 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6274:
-
Summary: Documentation refers to incorrect nodemanager health checker 
interval property  (was: One error in the documentation of hadoop 2.7.3)

> Documentation refers to incorrect nodemanager health checker interval property
> --
>
> Key: YARN-6274
> URL: https://issues.apache.org/jira/browse/YARN-6274
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 2.7.3
>Reporter: Charles Zhang
>Assignee: Weiwei Yang
>Priority: Trivial
>  Labels: beginner, documentation, easyfix
> Attachments: YARN-6274.01.patch
>
>
> I think one parameter in the "Monitoring Health of NodeManagers" section of  
> "Cluster Setup" is wrong.The parameter 
> "yarn.nodemanager.health-checker.script.interval-ms" should be 
> “yarn.nodemanager.health-checker.interval-ms”.See 
> http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/ClusterSetup.html.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6274) One error in the documentation of hadoop 2.7.3

2017-03-03 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-6274:
-
Fix Version/s: (was: 2.7.3)

Thanks for the report, [~rebeyond1218] and for the patch, [~cheersyang]!  

Note that in the future the Fix Version should only be set by the committer 
when the patch is committed, as that field indicates what versions actually 
have the fix.  The Target Version field should be used to indicate which 
versions are being targeted to fix.

+1 patch lgtm.  Committing this.

> One error in the documentation of hadoop 2.7.3
> --
>
> Key: YARN-6274
> URL: https://issues.apache.org/jira/browse/YARN-6274
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 2.7.3
>Reporter: Charles Zhang
>Assignee: Weiwei Yang
>Priority: Trivial
>  Labels: beginner, documentation, easyfix
> Attachments: YARN-6274.01.patch
>
>
> I think one parameter in the "Monitoring Health of NodeManagers" section of  
> "Cluster Setup" is wrong.The parameter 
> "yarn.nodemanager.health-checker.script.interval-ms" should be 
> “yarn.nodemanager.health-checker.interval-ms”.See 
> http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/ClusterSetup.html.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6276) Now container kill mechanism may lead process leak

2017-03-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894382#comment-15894382
 ] 

Jason Lowe commented on YARN-6276:
--

When the nodemanager kills a container it first sends a SIGTERM to the session 
followed shortly afterwards by a SIGKILL.  It should not matter what the 
process is doing, since if it ignores the SIGTERM then the subsequent SIGKILL 
will kill it.  Unlike SIGTERM, SIGKILL is not catchable by the receiving 
process.  Could you elaborate a bit more on how GC activity is involved?

> Now container kill mechanism may lead process leak
> --
>
> Key: YARN-6276
> URL: https://issues.apache.org/jira/browse/YARN-6276
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: Feng Yuan
>Assignee: Feng Yuan
>
> When kill bash process, YarnChild may didn`t response because fullgc, 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6263) NMTokenSecretManagerInRM.createAndGetNMToken is not thread safe

2017-03-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892647#comment-15892647
 ] 

Jason Lowe commented on YARN-6263:
--

+1 lgtm.  The unit test failures do not appear to be related, and the tests 
pass locally for me with the patch applied.  I'll commit this tomorrow if there 
are no objections.

> NMTokenSecretManagerInRM.createAndGetNMToken is not thread safe
> ---
>
> Key: YARN-6263
> URL: https://issues.apache.org/jira/browse/YARN-6263
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0-alpha2
>Reporter: Haibo Chen
>Assignee: Haibo Chen
> Attachments: YARN-6263.01.patch
>
>
> NMTokenSecretManagerInRM.createAndGetNMToken modifies values of a 
> ConcurrentHashMap, which are of type HashTable, but it only acquires read 
> lock.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3884) App History status not updated when RMContainer transitions from RESERVED to KILLED

2017-02-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879197#comment-15879197
 ] 

Jason Lowe commented on YARN-3884:
--

+1 for only publishing metrics for "real" containers that an application can 
act upon.  I'm not sure what the use-case is for publishing reserved container 
details unless it's for RM scheduler debug.  Apps can't act upon reserved 
containers since they don't even know they exist.  A scheduler doesn't even 
need to implement reservations with containers, so what would that scheduler 
post if reserved container events are required?

bq.  Btw, ATSv2 do not track these containers by default because container 
metrics are published by NodeManager.

So ATSv2 will not publish any metric for a container that was allocated to an 
app but never launched?

> App History status not updated when RMContainer transitions from RESERVED to 
> KILLED
> ---
>
> Key: YARN-3884
> URL: https://issues.apache.org/jira/browse/YARN-3884
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: Suse11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>  Labels: oct16-easy
> Attachments: 0001-YARN-3884.patch, Apphistory Container Status.jpg, 
> Elapsed Time.jpg, Test Result-Container status.jpg, YARN-3884.0002.patch, 
> YARN-3884.0003.patch, YARN-3884.0004.patch, YARN-3884.0005.patch, 
> YARN-3884.0006.patch, YARN-3884.0007.patch, YARN-3884.0008.patch
>
>
> Setup
> ===
> 1 NM 3072 16 cores each
> Steps to reproduce
> ===
> 1.Submit apps  to Queue 1 with 512 mb 1 core
> 2.Submit apps  to Queue 2 with 512 mb and 5 core
> lots of containers get reserved and unreserved in this case 
> {code}
> 2015-07-02 20:45:31,169 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e24_1435849994778_0002_01_13 Container Transitioned from NEW to 
> RESERVED
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Reserved container  application=application_1435849994778_0002 
> resource= queue=QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=, 
> usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, 
> numContainers=5 usedCapacity=1.6410257 absoluteUsedCapacity=0.65625 
> used= cluster=
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting assigned queue: root.QueueA stats: QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=, 
> usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, 
> numContainers=6
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.96875 
> absoluteUsedCapacity=0.96875 used= 
> cluster=
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e24_1435849994778_0001_01_14 Container Transitioned from NEW to 
> ALLOCATED
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
> OPERATION=AM Allocated ContainerTARGET=SchedulerApp 
> RESULT=SUCCESS  APPID=application_1435849994778_0001
> CONTAINERID=container_e24_1435849994778_0001_01_14
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
> Assigned container container_e24_1435849994778_0001_01_14 of capacity 
>  on host host-10-19-92-117:64318, which has 6 
> containers,  used and  available 
> after allocation
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1435849994778_0001_01 
> container=Container: [ContainerId: 
> container_e24_1435849994778_0001_01_14, NodeId: host-10-19-92-117:64318, 
> NodeHttpAddress: host-10-19-92-117:65321, Resource: , 
> Priority: 20, Token: null, ] queue=default: capacity=0.2, 
> absoluteCapacity=0.2, usedResources=, 
> usedCapacity=2.0846906, absoluteUsedCapacity=0.4166, numApps=1, 
> numContainers=5 clusterResource=
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting assigned queue: root.default stats: default: capacity=0.2, 
> absoluteCapacity=0.2, 

[jira] [Created] (YARN-6217) TestLocalCacheDirectoryManager test timeout is too aggressive

2017-02-22 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6217:


 Summary: TestLocalCacheDirectoryManager test timeout is too 
aggressive
 Key: YARN-6217
 URL: https://issues.apache.org/jira/browse/YARN-6217
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe


TestLocalCacheDirectoryManager#testDirectoryStateChangeFromFullToNonFull has 
only a one second timeout.  If the test machine hits an I/O hiccup it can fail. 
 The test timeout is too aggressive, and I question whether this test even 
needs an explicit timeout specified.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6214) NullPointer Exception while querying timeline server API

2017-02-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878482#comment-15878482
 ] 

Jason Lowe commented on YARN-6214:
--

It's a little difficult to line up the source with that stacktrace.  The report 
says it's happening on 2.7.1, but I could not get the line numbers to match up 
on that release.  My guess at this point is that there is at least one app on 
the cluster that has not set an application type (i.e.: app type is null) and 
therefore this code in WebServices.java is going to NPE when it tries to 
dereference the application type to trim it:
{code}
  if (checkAppTypes &&
  !appTypes.contains(
  StringUtils.toLowerCase(appReport.getApplicationType().trim( {
{code}

Looks like there's a missing null check on that.  It would be good to verify 
there are results in the original, non-filtered query that are returning 
"null", empty, or missing  tags for an application which would explain 
why we're hitting the NPE when we go to filter on it.

> NullPointer Exception while querying timeline server API
> 
>
> Key: YARN-6214
> URL: https://issues.apache.org/jira/browse/YARN-6214
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.1
>Reporter: Ravi Teja Chilukuri
>
> The apps API works fine and give all applications, including Mapreduce and Tez
> http://:8188/ws/v1/applicationhistory/apps
> But when queried with application types with these APIs, it fails with 
> NullpointerException.
> http://:8188/ws/v1/applicationhistory/apps?applicationTypes=TEZ
> http://:8188/ws/v1/applicationhistory/apps?applicationTypes=MAPREDUCE
> NullPointerExceptionjava.lang.NullPointerException
> Blocked on this issue as we are not able to run analytics on the tez job 
> counters on the prod jobs. 
> Timeline Logs:
> |2017-02-22 11:47:57,183 WARN  webapp.GenericExceptionHandler 
> (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApps(WebServices.java:195)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices.getApps(AHSWebServices.java:96)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
> Complete stacktrace:
> http://pastebin.com/bRgxVabf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6191) CapacityScheduler preemption by container priority can be problematic for MapReduce

2017-02-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870131#comment-15870131
 ] 

Jason Lowe commented on YARN-6191:
--

Thanks, Chris!  Having the AM react to the preemption message in the heartbeat 
will definitely help a lot for common cases, even if it doesn't do any 
work-conserving logic and just kills the reducers.

However there's still an issue because the preemption message is too general.  
For example, if the message says "going to preempt 60GB of resources" and the 
AM kills 10 reducers that are 6GB each on 6 different nodes, the RM can still 
kill the maps because the RM needed 60GB of contiguous resources.  Fixing that 
requires the preemption message to be more expressive/specific so the AM knows 
that its actions will indeed prevent the preemption of other containers.

I still wonder about the logic of preferring lower container priorities 
regardless of how long they've been running.  I'm not sure container priority 
always translates well to how important a container is to the application, and 
we might be better served by preferring to minimize total lost work regardless 
of container priority.

> CapacityScheduler preemption by container priority can be problematic for 
> MapReduce
> ---
>
> Key: YARN-6191
> URL: https://issues.apache.org/jira/browse/YARN-6191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Jason Lowe
>
> A MapReduce job with thousands of reducers and just a couple of maps left to 
> go was running in a preemptable queue.  Periodically other queues would get 
> busy and the RM would preempt some resources from the job, but it _always_ 
> picked the job's map tasks first because they use the lowest priority 
> containers.  Even though the reducers had a shorter running time, most were 
> spared but the maps were always shot.  Since the map tasks ran for a longer 
> time than the preemption period, the job was in a perpetual preemption loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6177) Yarn client should exit with an informative error message if an incompatible Jersey library is used at client

2017-02-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870096#comment-15870096
 ] 

Jason Lowe commented on YARN-6177:
--

The best effort setting was primarily targeting the scenario where the timeline 
server is down and not wanting jobs to fail as a result.  It wasn't intended to 
mask jar conflicts like this.

Maybe I'm misunderstanding option #1 from above, but we should not be changing 
the best-effort default value.  That's an incompatible and surprising change.  

When someone sets best-effort to true then they desire the timeline server to 
work but don't want transient errors to block the job's progress.  This case is 
not a transient error -- the timeline client is never going to work with that 
jar conflict in place.  Therefore I agree with [~gtCarrera9] that we should not 
mask this error.  If it's common then I think we should check for it and 
provide a useful error message stating the user needs to address the classpath 
conflict or disable the timeline client, but the error should still be fatal.

> Yarn client should exit with an informative error message if an incompatible 
> Jersey library is used at client
> -
>
> Key: YARN-6177
> URL: https://issues.apache.org/jira/browse/YARN-6177
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
> Attachments: spark2-job-output-after-besteffort.out, 
> spark2-job-output-after.out, spark2-job-output-before.out, 
> YARN-6177.01.patch, YARN-6177.02.patch, YARN-6177.03.patch, 
> YARN-6177.04.patch, YARN-6177.05.patch
>
>
> Per discussion in YARN-5271, lets provide an error message to suggest user to 
> disable timeline service instead of disabling for them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    5   6   7   8   9   10   11   12   13   14   >