[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-07-16 Thread wangfeng (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629410#comment-14629410
 ] 

wangfeng commented on YARN-2809:


failed when patching this to hadoop2.6.0,console output:
 patch -u -p0  YARN-2809-v3.patch

patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
Hunk #1 succeeded at 984 (offset -16 lines).
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
Hunk #1 FAILED at 22.
Hunk #2 succeeded at 33 (offset -4 lines).
Hunk #3 succeeded at 71 (offset -5 lines).
Hunk #4 succeeded at 105 (offset -5 lines).
Hunk #5 succeeded at 266 (offset -10 lines).
Hunk #6 succeeded at 338 (offset -10 lines).
1 out of 6 hunks FAILED -- saving rejects to file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java.rej
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java

 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323883#comment-14323883
 ] 

Hudson commented on YARN-2809:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #97 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/97/])
YARN-2809. Implement workaround for linux kernel panic when removing cgroup. 
Contributed by Nathan Roberts (jlowe: rev 
3f5431a22fcef7e3eb9aceeefe324e5b7ac84049)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java


 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316360#comment-14316360
 ] 

Hudson commented on YARN-2809:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2052 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2052/])
YARN-2809. Implement workaround for linux kernel panic when removing cgroup. 
Contributed by Nathan Roberts (jlowe: rev 
3f5431a22fcef7e3eb9aceeefe324e5b7ac84049)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* hadoop-yarn-project/CHANGES.txt


 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316277#comment-14316277
 ] 

Hudson commented on YARN-2809:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2033 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2033/])
YARN-2809. Implement workaround for linux kernel panic when removing cgroup. 
Contributed by Nathan Roberts (jlowe: rev 
3f5431a22fcef7e3eb9aceeefe324e5b7ac84049)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java


 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316331#comment-14316331
 ] 

Hudson commented on YARN-2809:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #102 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/102/])
YARN-2809. Implement workaround for linux kernel panic when removing cgroup. 
Contributed by Nathan Roberts (jlowe: rev 
3f5431a22fcef7e3eb9aceeefe324e5b7ac84049)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt


 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315987#comment-14315987
 ] 

Hudson commented on YARN-2809:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #101 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/101/])
YARN-2809. Implement workaround for linux kernel panic when removing cgroup. 
Contributed by Nathan Roberts (jlowe: rev 
3f5431a22fcef7e3eb9aceeefe324e5b7ac84049)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java


 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316009#comment-14316009
 ] 

Hudson commented on YARN-2809:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #835 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/835/])
YARN-2809. Implement workaround for linux kernel panic when removing cgroup. 
Contributed by Nathan Roberts (jlowe: rev 
3f5431a22fcef7e3eb9aceeefe324e5b7ac84049)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314500#comment-14314500
 ] 

Hudson commented on YARN-2809:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7063 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7063/])
YARN-2809. Implement workaround for linux kernel panic when removing cgroup. 
Contributed by Nathan Roberts (jlowe: rev 
3f5431a22fcef7e3eb9aceeefe324e5b7ac84049)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt


 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309920#comment-14309920
 ] 

Jason Lowe commented on YARN-2809:
--

+1 lgtm.  Will commit this early next week if there are no objections.

 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-02-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309343#comment-14309343
 ] 

Hadoop QA commented on YARN-2809:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12697032/YARN-2809-v2.patch
  against trunk revision 1425e3d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6535//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6535//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6535//console

This message is automatically generated.

 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: YARN-2809-v2.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2014-11-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225085#comment-14225085
 ] 

Hadoop QA commented on YARN-2809:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12683621/YARN-2809.patch
  against trunk revision 61a2510.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5934//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5934//console

This message is automatically generated.

 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2014-11-05 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14198560#comment-14198560
 ] 

Nathan Roberts commented on YARN-2809:
--

Stack trace:
 {noformat}
[8150d4a8] ? panic+0xa7/0x16f
 [815116d4] ? oops_end+0xe4/0x100
 [81046bfb] ? no_context+0xfb/0x260
 [81449058] ? dev_hard_start_xmit+0x308/0x530
 [81046e85] ? __bad_area_nosemaphore+0x125/0x1e0
 [812773a9] ? cpumask_next_and+0x29/0x50
 [81046f53] ? bad_area_nosemaphore+0x13/0x20
 [810476b1] ? __do_page_fault+0x321/0x480
 [81056881] ? update_curr+0xe1/0x1f0
 [81065905] ? enqueue_entity+0x125/0x410
 [810524e3] ? set_next_buddy+0x43/0x50
 [810570e0] ? check_preempt_wakeup+0x1c0/0x260
 [81065ceb] ? enqueue_task_fair+0xfb/0x100
 [8105230c] ? check_preempt_curr+0x7c/0x90
 [815135fe] ? do_page_fault+0x3e/0xa0
 [815109b5] ? page_fault+0x25/0x30
 [81056b19] ? update_cfs_shares+0x29/0x170
 [81065363] ? dequeue_entity+0x113/0x2e0
 [810664da] ? dequeue_task_fair+0x6a/0x130
 [81055ebe] ? dequeue_task+0x8e/0xb0
 [81055f03] ? deactivate_task+0x23/0x30
 [8150dc99] ? thread_return+0x127/0x76e
 [810e6e1e] ? call_rcu+0xe/0x10
 [8107196f] ? release_task+0x33f/0x4b0
 [81073837] ? do_exit+0x5b7/0x870
 [81073b48] ? do_group_exit+0x58/0xd0
 [81088e36] ? get_signal_to_deliver+0x1f6/0x460
 [8100a265] ? do_signal+0x75/0x800
 [810dc675] ? __audit_syscall_exit+0x265/0x290
 [8100aa80] ? do_notify_resume+0x90/0xc0
 [8100b341] ? int_signal+0x12/0x17
{noformat}
What's happening is that CgroupsLCEResourcesHandler is attempting to delete the 
cgroup before all the tasks within the cgroup have exited (explained later). It 
tries every 20ms to remove the cgroup until successful, or a timeout (default 1 
second) expires. Sometimes these attempts hit a race within the kernel where 
the last task has not completely finished tearing down, yet it is far enough 
down that the cgroup is able to be removed. This leaves a NULL pointer around 
which results in the panic.

The kernel has been fixed and most recent distributions will have the fix. 
However, there are older kernel versions out there that would benefit from a 
simple workaround. The proposed workaround is to wait until the tasks file 
within the cgroup is empty, and then delay a small amount of time before 
attempting to delete the cgroup. 

One question is why are there still tasks in the cgroup? Don't have a complete 
answer here and some of the details may be slightly off, but do know the 
following: The processtree within a mapreduce  cgroup looks like bash -c - 
java ... 
When map or reduce processing is complete, the AM is informed, who then informs 
the NM so that the container can be torn down. A SIGTERM is sent to the session 
(bash is session leader). bash is much quicker at exiting than everything else 
so it exits and its parent (container-executor) gets a SIGCHILD and starts 
cleaning up, this includes removing the cgroup which gets us into the race 
described above. 








 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts

 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)