[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393351#comment-14393351 ] zhihai xu commented on YARN-3415: - [~sandyr], thanks for the review, The latest patch YARN-3415.002.patch is rebased on the latest code base and it passed the Jenkins test. Let me know whether you have more comments for the patch. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch, > YARN-3415.002.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391972#comment-14391972 ] Hadoop QA commented on YARN-3415: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708850/YARN-3415.002.patch against trunk revision 4d14816. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7196//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7196//console This message is automatically generated. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch, > YARN-3415.002.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391738#comment-14391738 ] Rohit Agarwal commented on YARN-3415: - +1 > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch, > YARN-3415.002.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391728#comment-14391728 ] zhihai xu commented on YARN-3415: - [~ragarwal], thanks for the review. I uploaded a new patch YARN-3415.002.patch which addressed your comment. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch, > YARN-3415.002.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391674#comment-14391674 ] Rohit Agarwal commented on YARN-3415: - It looks good. I have one minor comment: {code} +// non-AM container should be allocated +// check non-AM container allocation is not rejected +// due to queue MaxAMShare limitation. +assertEquals("Application5's AM should have 1 container", +1, app5.getLiveContainers().size()); {code} The message in the {{assertEquals}} here should be 'Application5 should have 1 container." Because the AM is expired at this point. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391624#comment-14391624 ] Sandy Ryza commented on YARN-3415: -- [~ragarwal] did you have any more comments before I commit this? > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391218#comment-14391218 ] Sandy Ryza commented on YARN-3415: -- +1 > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388387#comment-14388387 ] Hadoop QA commented on YARN-3415: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708400/YARN-3415.001.patch against trunk revision b5a22e9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7170//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7170//console This message is automatically generated. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388285#comment-14388285 ] zhihai xu commented on YARN-3415: - [~sandyr], that is a very good idea to move the call to setAMResource that's currently in FairScheduler next to the call to getQueue().addAMResourceUsage(). The new patch YARN-3415.001.patch addressed this issue and it also addressed your first two comments. [~ragarwal], thanks for the review. First I want to clarify the AM resource usage won't be changed when the AM container is completed, It will only be changed when the application attempt is removed from scheduler, which will call FSLeafQueue#removeApp. So currently "Check that AM resource usage becomes 0" is done after all application attempts are removed. {code} assertEquals("Queue1's AM resource usage should be 0", 0, queue1.getAmResourceUsage().getMemory()); {code} bq. Add a non-AM container to app5. Handle the nodeUpdate event - check that the number of live containers is 2. The old code already had this test for app1, the test can pass without the patch. {code} // Still can run non-AM container createSchedulingRequestExistingApplication(1024, 1, attId1); scheduler.update(); scheduler.handle(updateEvent); assertEquals("Application1 should have two running containers", 2, app1.getLiveContainers().size()); {code} I think your issue is due to the non-AM container allocation is delayed after AM container is finished, which cause 0 LiveContainers. My test simulates "complete AM container before non-AM container is allocated", the old code will increase the AM resource usage when non-AM container is allocated. So without the patch, the test will fail. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388262#comment-14388262 ] Hadoop QA commented on YARN-3415: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708381/YARN-3415.001.patch against trunk revision b5a22e9. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7167//console This message is automatically generated. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch, YARN-3415.001.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387763#comment-14387763 ] Rohit Agarwal commented on YARN-3415: - I don't understand the newly added tests: {code} +// request non-AM container for app5 +createSchedulingRequestExistingApplication(1024, 1, attId5); +assertEquals("Application5's AM should have 1 container", +1, app5.getLiveContainers().size()); +// complete AM container before non-AM container is allocated. +// spark application hit this situation. +RMContainer amContainer5 = (RMContainer)app5.getLiveContainers().toArray()[0]; +ContainerExpiredSchedulerEvent containerExpired = +new ContainerExpiredSchedulerEvent(amContainer5.getContainerId()); +scheduler.handle(containerExpired); +assertEquals("Application5's AM should have 0 container", +0, app5.getLiveContainers().size()); +assertEquals("Queue1's AM resource usage should be 2048 MB memory", +2048, queue1.getAmResourceUsage().getMemory()); +scheduler.update(); +scheduler.handle(updateEvent); +// non-AM container should be allocated +// check non-AM container allocation is not rejected +// due to queue MaxAMShare limitation. +assertEquals("Application5's AM should have 1 container", +1, app5.getLiveContainers().size()); +// check non-AM container allocation won't affect queue AmResourceUsage +assertEquals("Queue1's AM resource usage should be 2048 MB memory", +2048, queue1.getAmResourceUsage().getMemory()); {code} Just before this block, I can see that the AM for app5 is already running and is taking 2048MB. So, in my opinion, the tests should be like: - Add a non-AM container to app5. Handle the nodeUpdate event - check that the number of live containers is 2. - kill the AM container. Handle the events. Check that number of live containers becomes 1. Check that AM resource usage becomes 0. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387425#comment-14387425 ] Sandy Ryza commented on YARN-3415: -- This looks mostly reasonable. A few comments: * In FSAppAttempt, can we change the "If this container is used to run AM" comment to "If not running unmanaged, the first container we allocate is always the AM. Update the leaf queue's AM usage"? * The four lines of comment in FSLeafQueue could be reduced to "If isAMRunning is true, we're no running an unmanaged AM." * Would it make sense to move the call to setAMResource that's currently in FairScheduler next to the call to getQueue().addAMResourceUsage() so that the queue and attempt resource usage get updated at the same time? > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386048#comment-14386048 ] zhihai xu commented on YARN-3415: - The test failure is due to HADOOP-11754 {code} org.mortbay.jetty.webapp.WebAppContext@63eaebdd{/,jar:file:/home/jenkins/.m2/repository/org/apache/hadoop/hadoop-yarn-common/3.0.0-SNAPSHOT/hadoop-yarn-common-3.0.0-SNAPSHOT.jar!/webapps/cluster} javax.servlet.ServletException: java.lang.RuntimeException: Could not read signature secret file: /home/jenkins/hadoop-http-auth-signature-secret at org.apache.hadoop.security.authentication.server.AuthenticationFilter.initializeSecretProvider(AuthenticationFilter.java:266) {code} > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386042#comment-14386042 ] Hadoop QA commented on YARN-3415: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708063/YARN-3415.000.patch against trunk revision 3d9132d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMHA org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7143//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7143//console This message is automatically generated. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386003#comment-14386003 ] zhihai xu commented on YARN-3415: - I uploaded a patch YARN-3415.000.patch for review. The patch fixed two bugs and did 4 minor code optimizations. bugs fixed: 1. Checking whether the AM is running before call addAMResourceUsage. We should only addAMResourceUsage when AM is not running. Without this fix, the test will fail because queue AmResourceUsage is changed by non-AM container. 2. Don’t check non-AM container for queue MaxAMShare limitation. Without this fix, the test will fail because non-AM container allocation is rejected due to MaxAMShare limitation. code optimizations: 1. remove redundant check for getLiveContainers().size() when addAMResourceUsage in FSAppAttempt. 2. remove redundant check for getLiveContainers().size() when check queue MaxAMShare(canRunAppAM) in FSAppAttempt. 3. remove redundant check for app.getAMResource() in FSLeafQueue#removeApp. I didn’t check app.getUnmanagedAM() here, instead I add comments: AmRunning is set to true only when getUnmanagedAM() is false. But checking app.getUnmanagedAM() is also ok for me. 4. check application.isAmRunning() instead of application.getLiveContainers().isEmpty() in FairScheduler#allocate. Because application.getLiveContainers() will consume much more CPU power than application.isAmRunning(). FairScheduler#allocate is a function which will be called very frequently. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3415.000.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385672#comment-14385672 ] zhihai xu commented on YARN-3415: - [~ragarwal], thanks for the comment. bq. 1. If the above approach is valid - why do we need the getLiveContainers() check at all? totally agree, If we check !isAmRunning(), getLiveContainers() check is redundant. bq. 2. I don't see any place where we are setting amRunning to false once it is set to true. Should we do that for completeness? We don't need to set it false. because each FSAppAttempt has only one AM, once FSAppAttempt is removed, it will be garbage collected. bq. 3. Why is there no getUnmanagedAM() check in removeApp where we are subtracting from amResourceUsage. I think the conditions for adding and subtracting amResourceUsage should be similar as much as possible. totally agree, it will be better to check getUnmanagedAM() for readability. Currently it works, because we check getUnmanagedAM() when we setAMResource in FairScheduler#allocate. So if getUnmanagedAM() is true, app.getAMResource() will return Resources.none(). And also we can remove the check app.getAMResource() != null because the following code will guarantee it will not return null. {code} private Resource _get(String label, ResourceType type) { try { readLock.lock(); UsageByLabel usage = usages.get(label); if (null == usage) { return Resources.none(); } return normalize(usage.resArr[type.idx]); } finally { readLock.unlock(); } } {code} About my previous comments bq. It looks like we should also check isAmRunning at FairScheduler#allocate Checking isAmRunning at FairScheduler#allocate is not necessary. Because except AM container, all other containers for FSAppAttempt will be allocated by AM. once AM container is finished, no more FairScheduler#allocate will be called. I will upload a patch with a test case for this issue. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu >Priority: Critical > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385347#comment-14385347 ] Sandy Ryza commented on YARN-3415: -- Thanks for filing this [~ragarwal] and for taking this up [~zxu]. This seems like a fairly serious issue. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385202#comment-14385202 ] Rohit Agarwal commented on YARN-3415: - > if (!isAmRunning() && getLiveContainers().size() == 1 && !getUnmanagedAM()) { Few points: # If the above approach is valid - why do we need the {{getLiveContainers()}} check at all? # I don't see any place where we are setting {{amRunning}} to {{false}} once it is set to {{true}}. Should we do that for completeness? # Why is there no {{getUnmanagedAM()}} check in {{removeApp}} where we are subtracting from {{amResourceUsage}}. I think the conditions for adding and subtracting {{amResourceUsage}} should be similar as much as possible. > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385192#comment-14385192 ] zhihai xu commented on YARN-3415: - It looks like we should also check isAmRunning at FairScheduler#allocate {code} if (!application.getUnmanagedAM() && ask.size() == 1 && application.getLiveContainers().isEmpty()) { application.setAMResource(ask.get(0).getCapability()); } {code} and FSAppAttempt#assignContainer {code} if (getLiveContainers().size() == 0 && !getUnmanagedAM()) { if (!getQueue().canRunAppAM(getAMResource())) { return Resources.none(); } } {code} > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: zhihai xu > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385185#comment-14385185 ] zhihai xu commented on YARN-3415: - I can work on this issue, I read the problem at https://github.com/apache/spark/pull/5233#issuecomment-87160289. It looks like the issue can be fixed by checking whether the AM is running before call addAMResourceUsage. We should only addAMResourceUsage when AM is not running. {code} if (!isAmRunning() && getLiveContainers().size() == 1 && !getUnmanagedAM()) { getQueue().addAMResourceUsage(container.getResource()); setAmRunning(true); } {code} > Non-AM containers can be counted towards amResourceUsage of a fairscheduler > queue > - > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)