[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836516#comment-17836516 ] mumu commented on YARN-3415: [~zxu] Hello, I am using version 2.7.2 and after merging this patch, I found that the issue did not occur. However, I found that AM Used Resources is greater than AM Max Resources and tasks can still be submitted. > Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler > queue > -- > > Key: YARN-3415 > URL: https://issues.apache.org/jira/browse/YARN-3415 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Rohit Agarwal >Assignee: Zhihai Xu >Priority: Critical > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: YARN-3415.000.patch, YARN-3415.001.patch, > YARN-3415.002.patch > > > We encountered this problem while running a spark cluster. The > amResourceUsage for a queue became artificially high and then the cluster got > deadlocked because the maxAMShare constrain kicked in and no new AM got > admitted to the cluster. > I have described the problem in detail here: > https://github.com/apache/spark/pull/5233#issuecomment-87160289 > In summary - the condition for adding the container's memory towards > amResourceUsage is fragile. It depends on the number of live containers > belonging to the app. We saw that the spark AM went down without explicitly > releasing its requested containers and then one of those containers memory > was counted towards amResource. > cc - [~sandyr] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394313#comment-14394313 ] Hudson commented on YARN-3415: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #152 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/152/]) YARN-3415. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue (Zhihai Xu via Sandy Ryza) (sandy: rev 6a6a59db7f1bfda47c3c14fb49676a7b22d2eb06) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394335#comment-14394335 ] Hudson commented on YARN-3415: -- FAILURE: Integrated in Hadoop-Yarn-trunk #886 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/886/]) YARN-3415. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue (Zhihai Xu via Sandy Ryza) (sandy: rev 6a6a59db7f1bfda47c3c14fb49676a7b22d2eb06) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/CHANGES.txt Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394507#comment-14394507 ] Hudson commented on YARN-3415: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #143 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/143/]) YARN-3415. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue (Zhihai Xu via Sandy Ryza) (sandy: rev 6a6a59db7f1bfda47c3c14fb49676a7b22d2eb06) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394522#comment-14394522 ] Hudson commented on YARN-3415: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2084 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2084/]) YARN-3415. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue (Zhihai Xu via Sandy Ryza) (sandy: rev 6a6a59db7f1bfda47c3c14fb49676a7b22d2eb06) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394822#comment-14394822 ] Hudson commented on YARN-3415: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #153 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/153/]) YARN-3415. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue (Zhihai Xu via Sandy Ryza) (sandy: rev 6a6a59db7f1bfda47c3c14fb49676a7b22d2eb06) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394876#comment-14394876 ] Hudson commented on YARN-3415: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2102 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2102/]) YARN-3415. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue (Zhihai Xu via Sandy Ryza) (sandy: rev 6a6a59db7f1bfda47c3c14fb49676a7b22d2eb06) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393412#comment-14393412 ] Hudson commented on YARN-3415: -- FAILURE: Integrated in Hadoop-trunk-Commit #7497 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7497/]) YARN-3415. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue (Zhihai Xu via Sandy Ryza) (sandy: rev 6a6a59db7f1bfda47c3c14fb49676a7b22d2eb06) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/CHANGES.txt Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393476#comment-14393476 ] zhihai xu commented on YARN-3415: - Thanks [~ragarwal] for valuable feedback and filing this issue. Thanks [~sandyr] for valuable feedback and committing the patch! Greatly appreciated. Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue -- Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)