[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629410#comment-14629410 ] wangfeng commented on YARN-2809: failed when patching this to hadoop2.6.0,console output: patch -u -p0 YARN-2809-v3.patch patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Hunk #1 succeeded at 984 (offset -16 lines). patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java Hunk #1 FAILED at 22. Hunk #2 succeeded at 33 (offset -4 lines). Hunk #3 succeeded at 71 (offset -5 lines). Hunk #4 succeeded at 105 (offset -5 lines). Hunk #5 succeeded at 266 (offset -10 lines). Hunk #6 succeeded at 338 (offset -10 lines). 1 out of 6 hunks FAILED -- saving rejects to file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java.rej patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323883#comment-14323883 ] Hudson commented on YARN-2809: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #97 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/97/]) YARN-2809. Implement workaround for linux kernel panic when removing cgroup. Contributed by Nathan Roberts (jlowe: rev 3f5431a22fcef7e3eb9aceeefe324e5b7ac84049) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316360#comment-14316360 ] Hudson commented on YARN-2809: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2052 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2052/]) YARN-2809. Implement workaround for linux kernel panic when removing cgroup. Contributed by Nathan Roberts (jlowe: rev 3f5431a22fcef7e3eb9aceeefe324e5b7ac84049) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/CHANGES.txt Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316277#comment-14316277 ] Hudson commented on YARN-2809: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2033 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2033/]) YARN-2809. Implement workaround for linux kernel panic when removing cgroup. Contributed by Nathan Roberts (jlowe: rev 3f5431a22fcef7e3eb9aceeefe324e5b7ac84049) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316331#comment-14316331 ] Hudson commented on YARN-2809: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #102 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/102/]) YARN-2809. Implement workaround for linux kernel panic when removing cgroup. Contributed by Nathan Roberts (jlowe: rev 3f5431a22fcef7e3eb9aceeefe324e5b7ac84049) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315987#comment-14315987 ] Hudson commented on YARN-2809: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #101 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/101/]) YARN-2809. Implement workaround for linux kernel panic when removing cgroup. Contributed by Nathan Roberts (jlowe: rev 3f5431a22fcef7e3eb9aceeefe324e5b7ac84049) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316009#comment-14316009 ] Hudson commented on YARN-2809: -- FAILURE: Integrated in Hadoop-Yarn-trunk #835 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/835/]) YARN-2809. Implement workaround for linux kernel panic when removing cgroup. Contributed by Nathan Roberts (jlowe: rev 3f5431a22fcef7e3eb9aceeefe324e5b7ac84049) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314500#comment-14314500 ] Hudson commented on YARN-2809: -- FAILURE: Integrated in Hadoop-trunk-Commit #7063 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7063/]) YARN-2809. Implement workaround for linux kernel panic when removing cgroup. Contributed by Nathan Roberts (jlowe: rev 3f5431a22fcef7e3eb9aceeefe324e5b7ac84049) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.7.0 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309920#comment-14309920 ] Jason Lowe commented on YARN-2809: -- +1 lgtm. Will commit this early next week if there are no objections. Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309343#comment-14309343 ] Hadoop QA commented on YARN-2809: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697032/YARN-2809-v2.patch against trunk revision 1425e3d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6535//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6535//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6535//console This message is automatically generated. Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-2809-v2.patch, YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225085#comment-14225085 ] Hadoop QA commented on YARN-2809: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683621/YARN-2809.patch against trunk revision 61a2510. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5934//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5934//console This message is automatically generated. Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-2809.patch Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14198560#comment-14198560 ] Nathan Roberts commented on YARN-2809: -- Stack trace: {noformat} [8150d4a8] ? panic+0xa7/0x16f [815116d4] ? oops_end+0xe4/0x100 [81046bfb] ? no_context+0xfb/0x260 [81449058] ? dev_hard_start_xmit+0x308/0x530 [81046e85] ? __bad_area_nosemaphore+0x125/0x1e0 [812773a9] ? cpumask_next_and+0x29/0x50 [81046f53] ? bad_area_nosemaphore+0x13/0x20 [810476b1] ? __do_page_fault+0x321/0x480 [81056881] ? update_curr+0xe1/0x1f0 [81065905] ? enqueue_entity+0x125/0x410 [810524e3] ? set_next_buddy+0x43/0x50 [810570e0] ? check_preempt_wakeup+0x1c0/0x260 [81065ceb] ? enqueue_task_fair+0xfb/0x100 [8105230c] ? check_preempt_curr+0x7c/0x90 [815135fe] ? do_page_fault+0x3e/0xa0 [815109b5] ? page_fault+0x25/0x30 [81056b19] ? update_cfs_shares+0x29/0x170 [81065363] ? dequeue_entity+0x113/0x2e0 [810664da] ? dequeue_task_fair+0x6a/0x130 [81055ebe] ? dequeue_task+0x8e/0xb0 [81055f03] ? deactivate_task+0x23/0x30 [8150dc99] ? thread_return+0x127/0x76e [810e6e1e] ? call_rcu+0xe/0x10 [8107196f] ? release_task+0x33f/0x4b0 [81073837] ? do_exit+0x5b7/0x870 [81073b48] ? do_group_exit+0x58/0xd0 [81088e36] ? get_signal_to_deliver+0x1f6/0x460 [8100a265] ? do_signal+0x75/0x800 [810dc675] ? __audit_syscall_exit+0x265/0x290 [8100aa80] ? do_notify_resume+0x90/0xc0 [8100b341] ? int_signal+0x12/0x17 {noformat} What's happening is that CgroupsLCEResourcesHandler is attempting to delete the cgroup before all the tasks within the cgroup have exited (explained later). It tries every 20ms to remove the cgroup until successful, or a timeout (default 1 second) expires. Sometimes these attempts hit a race within the kernel where the last task has not completely finished tearing down, yet it is far enough down that the cgroup is able to be removed. This leaves a NULL pointer around which results in the panic. The kernel has been fixed and most recent distributions will have the fix. However, there are older kernel versions out there that would benefit from a simple workaround. The proposed workaround is to wait until the tasks file within the cgroup is empty, and then delay a small amount of time before attempting to delete the cgroup. One question is why are there still tasks in the cgroup? Don't have a complete answer here and some of the details may be slightly off, but do know the following: The processtree within a mapreduce cgroup looks like bash -c - java ... When map or reduce processing is complete, the AM is informed, who then informs the NM so that the container can be torn down. A SIGTERM is sent to the session (bash is session leader). bash is much quicker at exiting than everything else so it exits and its parent (container-executor) gets a SIGCHILD and starts cleaning up, this includes removing the cgroup which gets us into the race described above. Implement workaround for linux kernel panic when removing cgroup Key: YARN-2809 URL: https://issues.apache.org/jira/browse/YARN-2809 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Environment: RHEL 6.4 Reporter: Nathan Roberts Assignee: Nathan Roberts Some older versions of linux have a bug that can cause a kernel panic when the LCE attempts to remove a cgroup. It is a race condition so it's a bit rare but on a few thousand node cluster it can result in a couple of panics per day. This is the commit that likely (haven't verified) fixes the problem in linux: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)