[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-04-02 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393351#comment-14393351
 ] 

zhihai xu commented on YARN-3415:
-

[~sandyr], thanks for the review, The latest patch YARN-3415.002.patch is 
rebased on the latest code base and it passed the Jenkins test. Let me know 
whether you have more comments for the patch.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, 
 YARN-3415.002.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-04-01 Thread Rohit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391738#comment-14391738
 ] 

Rohit Agarwal commented on YARN-3415:
-

+1

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, 
 YARN-3415.002.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-04-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391972#comment-14391972
 ] 

Hadoop QA commented on YARN-3415:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12708850/YARN-3415.002.patch
  against trunk revision 4d14816.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7196//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7196//console

This message is automatically generated.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, 
 YARN-3415.002.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-04-01 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391728#comment-14391728
 ] 

zhihai xu commented on YARN-3415:
-

[~ragarwal], thanks for the review. I uploaded a new patch YARN-3415.002.patch 
which addressed your comment.


 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, 
 YARN-3415.002.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-04-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391218#comment-14391218
 ] 

Sandy Ryza commented on YARN-3415:
--

+1

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-04-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391624#comment-14391624
 ] 

Sandy Ryza commented on YARN-3415:
--

[~ragarwal] did you have any more comments before I commit this?

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-04-01 Thread Rohit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391674#comment-14391674
 ] 

Rohit Agarwal commented on YARN-3415:
-

It looks good.

I have one minor comment:
{code}
+// non-AM container should be allocated
+// check non-AM container allocation is not rejected
+// due to queue MaxAMShare limitation.
+assertEquals(Application5's AM should have 1 container,
+1, app5.getLiveContainers().size());
{code}
The message in the {{assertEquals}} here should be 'Application5 should have 1 
container. Because the AM is expired at this point.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388262#comment-14388262
 ] 

Hadoop QA commented on YARN-3415:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12708381/YARN-3415.001.patch
  against trunk revision b5a22e9.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7167//console

This message is automatically generated.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-31 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388285#comment-14388285
 ] 

zhihai xu commented on YARN-3415:
-

[~sandyr], that is a very good idea to move the call to setAMResource that's 
currently in FairScheduler next to the call to getQueue().addAMResourceUsage().
The new patch YARN-3415.001.patch addressed this issue and it also addressed 
your first two comments.

[~ragarwal], thanks for the review.
First I want to clarify the AM resource usage won't be changed when the AM 
container is completed, It will only be changed when the application attempt is 
removed from scheduler, which will call FSLeafQueue#removeApp.
So currently  Check that AM resource usage becomes 0 is done after all 
application attempts are removed.
{code}
assertEquals(Queue1's AM resource usage should be 0,
0, queue1.getAmResourceUsage().getMemory());
{code}

bq. Add a non-AM container to app5. Handle the nodeUpdate event - check that 
the number of live containers is 2.
The old code already had this test for app1, the test can pass without the 
patch.
{code}
// Still can run non-AM container
createSchedulingRequestExistingApplication(1024, 1, attId1);
scheduler.update();
scheduler.handle(updateEvent);
assertEquals(Application1 should have two running containers,
2, app1.getLiveContainers().size());
{code}

I think your issue is due to the non-AM container allocation is delayed after 
AM container is finished, which cause 0 LiveContainers.
My test simulates complete AM container before non-AM container is allocated, 
the old code will increase the AM resource usage when non-AM container is 
allocated. So without the patch, the test will fail.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388387#comment-14388387
 ] 

Hadoop QA commented on YARN-3415:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12708400/YARN-3415.001.patch
  against trunk revision b5a22e9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7170//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7170//console

This message is automatically generated.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch, YARN-3415.001.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-30 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387425#comment-14387425
 ] 

Sandy Ryza commented on YARN-3415:
--

This looks mostly reasonable.  A few comments:
* In FSAppAttempt, can we change the If this container is used to run AM 
comment to If not running unmanaged, the first container we allocate is always 
the AM. Update the leaf queue's AM usage?
* The four lines of comment in FSLeafQueue could be reduced to If isAMRunning 
is true, we're no running an unmanaged AM.
* Would it make sense to move the call to setAMResource that's currently in 
FairScheduler next to the call to getQueue().addAMResourceUsage() so that the 
queue and attempt resource usage get updated at the same time?


 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-30 Thread Rohit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387763#comment-14387763
 ] 

Rohit Agarwal commented on YARN-3415:
-

I don't understand the newly added tests:
{code}
+// request non-AM container for app5
+createSchedulingRequestExistingApplication(1024, 1, attId5);
+assertEquals(Application5's AM should have 1 container,
+1, app5.getLiveContainers().size());
+// complete AM container before non-AM container is allocated.
+// spark application hit this situation.
+RMContainer amContainer5 = 
(RMContainer)app5.getLiveContainers().toArray()[0];
+ContainerExpiredSchedulerEvent containerExpired =
+new ContainerExpiredSchedulerEvent(amContainer5.getContainerId());
+scheduler.handle(containerExpired);
+assertEquals(Application5's AM should have 0 container,
+0, app5.getLiveContainers().size());
+assertEquals(Queue1's AM resource usage should be 2048 MB memory,
+2048, queue1.getAmResourceUsage().getMemory());
+scheduler.update();
+scheduler.handle(updateEvent);
+// non-AM container should be allocated
+// check non-AM container allocation is not rejected
+// due to queue MaxAMShare limitation.
+assertEquals(Application5's AM should have 1 container,
+1, app5.getLiveContainers().size());
+// check non-AM container allocation won't affect queue AmResourceUsage
+assertEquals(Queue1's AM resource usage should be 2048 MB memory,
+2048, queue1.getAmResourceUsage().getMemory());
{code}
Just before this block, I can see that the AM for app5 is already running and 
is taking 2048MB.
So, in my opinion, the tests should be like:
- Add a non-AM container to app5. Handle the nodeUpdate event - check that the 
number of live containers is 2.
- kill the AM container. Handle the events. Check that number of live 
containers becomes 1. Check that AM resource usage becomes 0.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386042#comment-14386042
 ] 

Hadoop QA commented on YARN-3415:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12708063/YARN-3415.000.patch
  against trunk revision 3d9132d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.server.resourcemanager.TestRMHA
  
org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication
  
org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7143//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7143//console

This message is automatically generated.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-29 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386048#comment-14386048
 ] 

zhihai xu commented on YARN-3415:
-

The test failure is due to HADOOP-11754
{code}
org.mortbay.jetty.webapp.WebAppContext@63eaebdd{/,jar:file:/home/jenkins/.m2/repository/org/apache/hadoop/hadoop-yarn-common/3.0.0-SNAPSHOT/hadoop-yarn-common-3.0.0-SNAPSHOT.jar!/webapps/cluster}
javax.servlet.ServletException: java.lang.RuntimeException: Could not read 
signature secret file: /home/jenkins/hadoop-http-auth-signature-secret
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.initializeSecretProvider(AuthenticationFilter.java:266)
{code}

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-29 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385672#comment-14385672
 ] 

zhihai xu commented on YARN-3415:
-

[~ragarwal], thanks for the comment.
bq. 1. If the above approach is valid - why do we need the getLiveContainers() 
check at all?
totally agree, If we check !isAmRunning(), getLiveContainers() check is 
redundant.

bq. 2. I don't see any place where we are setting amRunning to false once it is 
set to true. Should we do that for completeness?
We don't need to set it false. because each FSAppAttempt has only one AM, once 
FSAppAttempt is removed, it will be garbage collected.

bq. 3. Why is there no getUnmanagedAM() check in removeApp where we are 
subtracting from amResourceUsage. I think the conditions for adding and 
subtracting amResourceUsage should be similar as much as possible.
totally agree, it will be better to check getUnmanagedAM() for readability.
Currently it works, because we check getUnmanagedAM() when we setAMResource in 
FairScheduler#allocate. So if getUnmanagedAM() is true, app.getAMResource() 
will return Resources.none().
And also we can remove the check app.getAMResource() != null because the 
following code will guarantee it will not return null.
{code}
  private Resource _get(String label, ResourceType type) {
try {
  readLock.lock();
  UsageByLabel usage = usages.get(label);
  if (null == usage) {
return Resources.none();
  }
  return normalize(usage.resArr[type.idx]);
} finally {
  readLock.unlock();
}
  }
{code}

About my previous comments
bq. It looks like we should also check isAmRunning at FairScheduler#allocate
Checking isAmRunning at FairScheduler#allocate is not necessary. Because except 
AM container, all other containers for FSAppAttempt will be allocated by AM. 
once AM container is finished, no more FairScheduler#allocate will be called.

I will upload a patch with a test case for this issue.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical

 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-29 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386003#comment-14386003
 ] 

zhihai xu commented on YARN-3415:
-

I uploaded a patch YARN-3415.000.patch for review.
The patch fixed two bugs and did 4 minor code optimizations.
bugs fixed:
1. Checking whether the AM is running before call addAMResourceUsage.
We should only addAMResourceUsage when AM is not running.
Without this fix, the test will fail because queue AmResourceUsage is changed 
by non-AM container.
2. Don’t check non-AM container for queue MaxAMShare limitation.
Without this fix, the test will fail because non-AM container allocation is 
rejected due to MaxAMShare limitation.

code optimizations:
1. remove redundant check for getLiveContainers().size() when 
addAMResourceUsage in FSAppAttempt.
2. remove redundant check for getLiveContainers().size() when check queue 
MaxAMShare(canRunAppAM) in FSAppAttempt.
3. remove redundant check for app.getAMResource() in FSLeafQueue#removeApp.
I didn’t check app.getUnmanagedAM() here, instead I add comments: AmRunning is 
set to true only when getUnmanagedAM() is false.
But checking app.getUnmanagedAM() is also ok for me.
4. check application.isAmRunning() instead of 
application.getLiveContainers().isEmpty() in FairScheduler#allocate.
Because application.getLiveContainers() will consume much more CPU power than 
application.isAmRunning().
FairScheduler#allocate is a function which will be called very frequently.


 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3415.000.patch


 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385347#comment-14385347
 ] 

Sandy Ryza commented on YARN-3415:
--

Thanks for filing this [~ragarwal] and for taking this up [~zxu].  This seems 
like a fairly serious issue.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu

 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-28 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385192#comment-14385192
 ] 

zhihai xu commented on YARN-3415:
-

It looks like we should also check isAmRunning at FairScheduler#allocate 
{code}
   if (!application.getUnmanagedAM()  ask.size() == 1
 application.getLiveContainers().isEmpty()) {
  application.setAMResource(ask.get(0).getCapability());
}
{code}
and FSAppAttempt#assignContainer
{code}
if (getLiveContainers().size() == 0  !getUnmanagedAM()) {
  if (!getQueue().canRunAppAM(getAMResource())) {
return Resources.none();
  }
}
{code}

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu

 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-28 Thread Rohit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385202#comment-14385202
 ] 

Rohit Agarwal commented on YARN-3415:
-

 if (!isAmRunning()  getLiveContainers().size() == 1  !getUnmanagedAM()) {

Few points:
# If the above approach is valid - why do we need the {{getLiveContainers()}} 
check at all?
# I don't see any place where we are setting {{amRunning}} to {{false}} once it 
is set to {{true}}. Should we do that for completeness?
# Why is there no {{getUnmanagedAM()}} check in {{removeApp}} where we are 
subtracting from {{amResourceUsage}}. I think the conditions for adding and 
subtracting {{amResourceUsage}} should be similar as much as possible.

 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: zhihai xu

 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue

2015-03-28 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385185#comment-14385185
 ] 

zhihai xu commented on YARN-3415:
-

I can work on this issue, I read the problem at 
https://github.com/apache/spark/pull/5233#issuecomment-87160289.
It looks like the issue can be fixed by checking whether the AM is running 
before call addAMResourceUsage.
We should only addAMResourceUsage when AM is not running.
{code}
  if (!isAmRunning()  getLiveContainers().size() == 1  
!getUnmanagedAM()) {
getQueue().addAMResourceUsage(container.getResource());
setAmRunning(true);
  }
{code}


 Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
 queue
 -

 Key: YARN-3415
 URL: https://issues.apache.org/jira/browse/YARN-3415
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal

 We encountered this problem while running a spark cluster. The 
 amResourceUsage for a queue became artificially high and then the cluster got 
 deadlocked because the maxAMShare constrain kicked in and no new AM got 
 admitted to the cluster.
 I have described the problem in detail here: 
 https://github.com/apache/spark/pull/5233#issuecomment-87160289
 In summary - the condition for adding the container's memory towards 
 amResourceUsage is fragile. It depends on the number of live containers 
 belonging to the app. We saw that the spark AM went down without explicitly 
 releasing its requested containers and then one of those containers memory 
 was counted towards amResource.
 cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)