[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789102#comment-13789102 ] Hudson commented on YARN-465: - SUCCESS: Integrated in Hadoop-Yarn-trunk #356 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/356/]) YARN-465. fix coverage org.apache.hadoop.yarn.server.webproxy. Contributed by Aleksey Gorshkov and Andrey Klochkov (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1530095) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Andrey Klochkov Fix For: 2.3.0 Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789204#comment-13789204 ] Hudson commented on YARN-465: - FAILURE: Integrated in Hadoop-Hdfs-trunk #1546 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1546/]) YARN-465. fix coverage org.apache.hadoop.yarn.server.webproxy. Contributed by Aleksey Gorshkov and Andrey Klochkov (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1530095) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Andrey Klochkov Fix For: 2.3.0 Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789212#comment-13789212 ] Hudson commented on YARN-465: - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1572 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1572/]) YARN-465. fix coverage org.apache.hadoop.yarn.server.webproxy. Contributed by Aleksey Gorshkov and Andrey Klochkov (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1530095) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Andrey Klochkov Fix For: 2.3.0 Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1284: - Attachment: YARN-1284.patch The patch changes the deleteCgroup() method to retry the delete in a loop (retrying every 20ms) until it succeeds or it times out (500ms). Also, this is done for all containers, not only for AM containers. It also introdcues a configuration knob for the timeout. Other changes, such as method signatures and initConfig() method are to enable unittesting of the new logic. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1284: - Target Version/s: 2.2.1 LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789262#comment-13789262 ] Hadoop QA commented on YARN-1284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607362/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2144//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2144//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2144//console This message is automatically generated. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1284: - Attachment: YARN-1284.patch LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789355#comment-13789355 ] Hadoop QA commented on YARN-1284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607374/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2145//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2145//console This message is automatically generated. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1284: - Attachment: YARN-1284.patch LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789429#comment-13789429 ] Hadoop QA commented on YARN-1284: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607388/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2146//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2146//console This message is automatically generated. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.
[ https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved YARN-250. - Resolution: Won't Fix Add a generic mechanism to the resource manager for client communication with the scheduler. Key: YARN-250 URL: https://issues.apache.org/jira/browse/YARN-250 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza In MR1 the fair scheduler allowed the queue of a running job to be changed through the web UI. For feature parity, this should be supported in MR2, but the web UI seems an inappropriate place to put it because of complicated security implications and lack of programmatic access. A command line tool makes more sense. A generic mechanism could be leveraged by other schedulers to support similar types of functionality and would allow us to avoid making changes to all the plumbing each time functionality is added. Other possible uses might include suspending pools or fetching scheduler-specific information. This functionality could be made available through an RPC server within each scheduler, but that would require reserving another port, and would introduce unnecessary confusion if different schedulers implemented the same mechanism in a different way. A client should be able to send an RPC with a set of key/value pairs, which would be passed to the scheduler. The RPC would return a string (or set of key/value pairs). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.
[ https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789585#comment-13789585 ] Bikas Saha commented on YARN-250: - Seemed useful. Any reason for a wont fix? Add a generic mechanism to the resource manager for client communication with the scheduler. Key: YARN-250 URL: https://issues.apache.org/jira/browse/YARN-250 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza In MR1 the fair scheduler allowed the queue of a running job to be changed through the web UI. For feature parity, this should be supported in MR2, but the web UI seems an inappropriate place to put it because of complicated security implications and lack of programmatic access. A command line tool makes more sense. A generic mechanism could be leveraged by other schedulers to support similar types of functionality and would allow us to avoid making changes to all the plumbing each time functionality is added. Other possible uses might include suspending pools or fetching scheduler-specific information. This functionality could be made available through an RPC server within each scheduler, but that would require reserving another port, and would introduce unnecessary confusion if different schedulers implemented the same mechanism in a different way. A client should be able to send an RPC with a set of key/value pairs, which would be passed to the scheduler. The RPC would return a string (or set of key/value pairs). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.
[ https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789605#comment-13789605 ] Sandy Ryza commented on YARN-250: - My original thinking was that this would be useful for allowing applications to be moved between Fair Scheduler queues from the command line. But my current thinking on that capability is that it would probably be useful to have for all schedulers. I thought other use cases would come up, but nothing has jumped out at me recently. If there are other things a mechanism like this would be useful for, I'm certainly not against adding it. Add a generic mechanism to the resource manager for client communication with the scheduler. Key: YARN-250 URL: https://issues.apache.org/jira/browse/YARN-250 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza In MR1 the fair scheduler allowed the queue of a running job to be changed through the web UI. For feature parity, this should be supported in MR2, but the web UI seems an inappropriate place to put it because of complicated security implications and lack of programmatic access. A command line tool makes more sense. A generic mechanism could be leveraged by other schedulers to support similar types of functionality and would allow us to avoid making changes to all the plumbing each time functionality is added. Other possible uses might include suspending pools or fetching scheduler-specific information. This functionality could be made available through an RPC server within each scheduler, but that would require reserving another port, and would introduce unnecessary confusion if different schedulers implemented the same mechanism in a different way. A client should be able to send an RPC with a set of key/value pairs, which would be passed to the scheduler. The RPC would return a string (or set of key/value pairs). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.
[ https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789644#comment-13789644 ] Bikas Saha commented on YARN-250: - Will this functionality be added to the fair scheduler? If yes, then how would it be exposed in the RM without some discovery logic? Add a generic mechanism to the resource manager for client communication with the scheduler. Key: YARN-250 URL: https://issues.apache.org/jira/browse/YARN-250 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza In MR1 the fair scheduler allowed the queue of a running job to be changed through the web UI. For feature parity, this should be supported in MR2, but the web UI seems an inappropriate place to put it because of complicated security implications and lack of programmatic access. A command line tool makes more sense. A generic mechanism could be leveraged by other schedulers to support similar types of functionality and would allow us to avoid making changes to all the plumbing each time functionality is added. Other possible uses might include suspending pools or fetching scheduler-specific information. This functionality could be made available through an RPC server within each scheduler, but that would require reserving another port, and would introduce unnecessary confusion if different schedulers implemented the same mechanism in a different way. A client should be able to send an RPC with a set of key/value pairs, which would be passed to the scheduler. The RPC would return a string (or set of key/value pairs). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1285) Inconsistency of default yarn.acl.enable value
Zhijie Shen created YARN-1285: - Summary: Inconsistency of default yarn.acl.enable value Key: YARN-1285 URL: https://issues.apache.org/jira/browse/YARN-1285 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen In yarn-default.xml, yarn.acl.enable is true while in YarnConfiguration, DEFAULT_YARN_ACL_ENABLE is false. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.
[ https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789687#comment-13789687 ] Sandy Ryza commented on YARN-250: - The mechanism this JIRA proposed would make most sense for features specific to a single scheduler. As queue-moving functionality is probably something that the Capacity Scheduler would like at some point as well, I think it would make more sense to directly add it as a ResourceManager API. Add a generic mechanism to the resource manager for client communication with the scheduler. Key: YARN-250 URL: https://issues.apache.org/jira/browse/YARN-250 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza In MR1 the fair scheduler allowed the queue of a running job to be changed through the web UI. For feature parity, this should be supported in MR2, but the web UI seems an inappropriate place to put it because of complicated security implications and lack of programmatic access. A command line tool makes more sense. A generic mechanism could be leveraged by other schedulers to support similar types of functionality and would allow us to avoid making changes to all the plumbing each time functionality is added. Other possible uses might include suspending pools or fetching scheduler-specific information. This functionality could be made available through an RPC server within each scheduler, but that would require reserving another port, and would introduce unnecessary confusion if different schedulers implemented the same mechanism in a different way. A client should be able to send an RPC with a set of key/value pairs, which would be passed to the scheduler. The RPC would return a string (or set of key/value pairs). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789769#comment-13789769 ] Alejandro Abdelnur commented on YARN-1284: -- tested in a cluster using cgroups and works as expected, both the delete and the timeouts. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
[ https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1283: Attachment: YARN-1283.20131008.1.patch Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY - Key: YARN-1283 URL: https://issues.apache.org/jira/browse/YARN-1283 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.1-beta Reporter: Yesha Vora Assignee: Omkar Vinit Joshi Labels: newbie Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect The url to track the job. Currently, its printing http://RM:httpsport/proxy/application_1381162886563_0001/ instead https://RM:httpsport/proxy/application_1381162886563_0001/ http://hostname:8088/proxy/application_1381162886563_0001/ is invalid hadoop jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1381162886563_0001 13/10/07 18:39:40 INFO impl.YarnClientImpl: Submitted application application_1381162886563_0001 to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.Job: The url to track the job: http://hostname:8088/proxy/application_1381162886563_0001/ 13/10/07 18:39:40 INFO mapreduce.Job: Running job: job_1381162886563_0001 13/10/07 18:39:46 INFO mapreduce.Job: Job job_1381162886563_0001 running in uber mode : false 13/10/07 18:39:46 INFO mapreduce.Job: map 0% reduce 0% 13/10/07 18:39:53 INFO mapreduce.Job: map 100% reduce 0% 13/10/07 18:39:58 INFO mapreduce.Job: map 100% reduce 100% 13/10/07 18:39:58 INFO mapreduce.Job: Job job_1381162886563_0001 completed successfully 13/10/07 18:39:58 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=26 FILE: Number of bytes written=177279 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=48 HDFS: Number of bytes written=0 HDFS: Number of read operations=1 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=1 Launched reduce tasks=1 Other local map
[jira] [Commented] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
[ https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789784#comment-13789784 ] Omkar Vinit Joshi commented on YARN-1283: - fixing javadoc warning. Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY - Key: YARN-1283 URL: https://issues.apache.org/jira/browse/YARN-1283 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.1-beta Reporter: Yesha Vora Assignee: Omkar Vinit Joshi Labels: newbie Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect The url to track the job. Currently, its printing http://RM:httpsport/proxy/application_1381162886563_0001/ instead https://RM:httpsport/proxy/application_1381162886563_0001/ http://hostname:8088/proxy/application_1381162886563_0001/ is invalid hadoop jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1381162886563_0001 13/10/07 18:39:40 INFO impl.YarnClientImpl: Submitted application application_1381162886563_0001 to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.Job: The url to track the job: http://hostname:8088/proxy/application_1381162886563_0001/ 13/10/07 18:39:40 INFO mapreduce.Job: Running job: job_1381162886563_0001 13/10/07 18:39:46 INFO mapreduce.Job: Job job_1381162886563_0001 running in uber mode : false 13/10/07 18:39:46 INFO mapreduce.Job: map 0% reduce 0% 13/10/07 18:39:53 INFO mapreduce.Job: map 100% reduce 0% 13/10/07 18:39:58 INFO mapreduce.Job: map 100% reduce 100% 13/10/07 18:39:58 INFO mapreduce.Job: Job job_1381162886563_0001 completed successfully 13/10/07 18:39:58 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=26 FILE: Number of bytes written=177279 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=48 HDFS: Number of bytes written=0 HDFS: Number of read operations=1 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=1 Launched reduce
[jira] [Created] (YARN-1286) Schedulers doesn't check whether ACL is enabled or not when adding an application
Zhijie Shen created YARN-1286: - Summary: Schedulers doesn't check whether ACL is enabled or not when adding an application Key: YARN-1286 URL: https://issues.apache.org/jira/browse/YARN-1286 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen The schedulers doesn't check whether ACL is enabled or not when an application is added. However, QueueACLsManager will check this for ClientRMService to get application(s) and kill application. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
[ https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789854#comment-13789854 ] Hadoop QA commented on YARN-1283: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607454/YARN-1283.20131008.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.TestClientServiceDelegate org.apache.hadoop.mapred.TestJobCleanup The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.v2.TestUberAM The test build failed in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2147//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2147//console This message is automatically generated. Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY - Key: YARN-1283 URL: https://issues.apache.org/jira/browse/YARN-1283 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.1-beta Reporter: Yesha Vora Assignee: Omkar Vinit Joshi Labels: newbie Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect The url to track the job. Currently, its printing http://RM:httpsport/proxy/application_1381162886563_0001/ instead https://RM:httpsport/proxy/application_1381162886563_0001/ http://hostname:8088/proxy/application_1381162886563_0001/ is invalid hadoop jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated.
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789862#comment-13789862 ] Tsuyoshi OZAWA commented on YARN-556: - Hi Bikas, can you share the current state about this JIRA? RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789861#comment-13789861 ] Sandy Ryza commented on YARN-1284: -- A few nits. Otherwise LGTM. {code} + //package private for testing purposes + private long deleteCgroupTimeout; + Clock clock; {code} Comment should go before the second variable. Also there should be a space after the //. {code} + //visible for testing {code} Should the VisibleForTesting annotation be used? This is in two places. {code} +LOG.debug(deleteCgroup: + cgroupPath); {code} Should be surrounded by if (LOG.isDebugEnabled()) {code} +//file exists {code} Space after //? LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789864#comment-13789864 ] Sandy Ryza commented on YARN-1284: -- Oh, and also: {code{ +if (! new File(cgroupPath).delete()) { + LOG.warn(Unable to delete cgroup at: + cgroupPath +, tried to delete for + + deleteCgroupTimeout + ms); } {code} If the file was already deleted, delete() will return false and we'll log the warning even though nothing went wrong. Instead, we should just check if (!deleted). LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789876#comment-13789876 ] Sandy Ryza commented on YARN-321: - Was a design doc ever written up for this? The HistoryStorageDemo.java is a good start for understanding some of the interfaces, but it would be helpful to have something that explains things like what the Application History Service's role is, how it interacts with the RM, and key differences and similarities with the Job History Server. Generic application history service --- Key: YARN-321 URL: https://issues.apache.org/jira/browse/YARN-321 Project: Hadoop YARN Issue Type: Improvement Reporter: Luke Lu Assignee: Vinod Kumar Vavilapalli Attachments: HistoryStorageDemo.java The mapreduce job history server currently needs to be deployed as a trusted server in sync with the mapreduce runtime. Every new application would need a similar application history server. Having to deploy O(T*V) (where T is number of type of application, V is number of version of application) trusted servers is clearly not scalable. Job history storage handling itself is pretty generic: move the logs and history data into a particular directory for later serving. Job history data is already stored as json (or binary avro). I propose that we create only one trusted application history server, which can have a generic UI (display json as a tree of strings) as well. Specific application/version can deploy untrusted webapps (a la AMs) to query the application history server and interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789919#comment-13789919 ] Bikas Saha commented on YARN-556: - Thanks for the reminder. Based on the attached proposal, I am going to create sub-tasks of this jira. Contributors are free to pick up those tasks. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
[ https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1283: Attachment: YARN-1283.20131008.2.patch Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY - Key: YARN-1283 URL: https://issues.apache.org/jira/browse/YARN-1283 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.1-beta Reporter: Yesha Vora Assignee: Omkar Vinit Joshi Labels: newbie Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch, YARN-1283.20131008.2.patch After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect The url to track the job. Currently, its printing http://RM:httpsport/proxy/application_1381162886563_0001/ instead https://RM:httpsport/proxy/application_1381162886563_0001/ http://hostname:8088/proxy/application_1381162886563_0001/ is invalid hadoop jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1381162886563_0001 13/10/07 18:39:40 INFO impl.YarnClientImpl: Submitted application application_1381162886563_0001 to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.Job: The url to track the job: http://hostname:8088/proxy/application_1381162886563_0001/ 13/10/07 18:39:40 INFO mapreduce.Job: Running job: job_1381162886563_0001 13/10/07 18:39:46 INFO mapreduce.Job: Job job_1381162886563_0001 running in uber mode : false 13/10/07 18:39:46 INFO mapreduce.Job: map 0% reduce 0% 13/10/07 18:39:53 INFO mapreduce.Job: map 100% reduce 0% 13/10/07 18:39:58 INFO mapreduce.Job: map 100% reduce 100% 13/10/07 18:39:58 INFO mapreduce.Job: Job job_1381162886563_0001 completed successfully 13/10/07 18:39:58 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=26 FILE: Number of bytes written=177279 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=48 HDFS: Number of bytes written=0 HDFS: Number of read operations=1 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=1 Launched reduce tasks=1
[jira] [Commented] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
[ https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789949#comment-13789949 ] Omkar Vinit Joshi commented on YARN-1283: - MAPREDUCE-5552 is tracking TestJobCleanup failure Fixed other test case. Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY - Key: YARN-1283 URL: https://issues.apache.org/jira/browse/YARN-1283 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.1-beta Reporter: Yesha Vora Assignee: Omkar Vinit Joshi Labels: newbie Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch, YARN-1283.20131008.2.patch After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect The url to track the job. Currently, its printing http://RM:httpsport/proxy/application_1381162886563_0001/ instead https://RM:httpsport/proxy/application_1381162886563_0001/ http://hostname:8088/proxy/application_1381162886563_0001/ is invalid hadoop jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1381162886563_0001 13/10/07 18:39:40 INFO impl.YarnClientImpl: Submitted application application_1381162886563_0001 to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.Job: The url to track the job: http://hostname:8088/proxy/application_1381162886563_0001/ 13/10/07 18:39:40 INFO mapreduce.Job: Running job: job_1381162886563_0001 13/10/07 18:39:46 INFO mapreduce.Job: Job job_1381162886563_0001 running in uber mode : false 13/10/07 18:39:46 INFO mapreduce.Job: map 0% reduce 0% 13/10/07 18:39:53 INFO mapreduce.Job: map 100% reduce 0% 13/10/07 18:39:58 INFO mapreduce.Job: map 100% reduce 100% 13/10/07 18:39:58 INFO mapreduce.Job: Job job_1381162886563_0001 completed successfully 13/10/07 18:39:58 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=26 FILE: Number of bytes written=177279 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=48 HDFS: Number of bytes written=0 HDFS: Number of read operations=1 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job
[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1284: - Attachment: YARN-1284.patch Addressing Sandy's comments. Reworked the while loop logic using a do-while block, seems a bit cleaner that way. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789978#comment-13789978 ] Alejandro Abdelnur commented on YARN-1284: -- For the record, I've spend a couple hours trying an alternate approach suggested by [~rvs] while chatting offline about this. His suggestion was to initialize a trash cgroup next to the containers cgroups and when a container is cleanup transition the container/tasks to the trash/tasks, doing the equivalent of a {{cat container/tasks trash/tasks}}. Tried doing that but it seems some of the Java IO native calls make a system call which is not supported by the cgroups filesystem implementation and I was getting the following stack trace: {code} java.io.IOException: Argument list too long java.io.IOException: Argument list too long at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:318) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:80) ... {code} Given this, beside that I didn't get it to work properly, I would not be comfortable doing this as this may behave different in different Linux versions. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789977#comment-13789977 ] Hadoop QA commented on YARN-1284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607499/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2149//console This message is automatically generated. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1284: - Attachment: YARN-1284.patch Updating patch with one last change (which it was not in my git cache), the default timeout is not 1000ms (up from 500ms). While testing this in a 4 nodes cluster running pi 500 500, there was one occurrence of a left container cgroup because of a timeout. This was done in a cluster running in VMs, which it would explain the 500ms timeout, but still I'd rather bump it up given that the wait will break as soon as the cgroup is deleted and the attempts are every 20ms. LCE: Race condition leaves dangling cgroups entries for killed containers - Key: YARN-1284 URL: https://issues.apache.org/jira/browse/YARN-1284 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, YARN-1284.patch When LCE cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like: {code} 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_11 is : 143 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_11 of type UPDATE_DIAGNOSTICS_MSG 2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 {code} CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers. Still, waiting for extra 500ms seems too expensive. We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1258) Allow configuring the Fair Scheduler root queue
[ https://issues.apache.org/jira/browse/YARN-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789989#comment-13789989 ] Alejandro Abdelnur commented on YARN-1258: -- LGTM, just 1 minor thing, instead doing: {code} if (!(node instanceof Element)) { continue; } {code} I would do: {code} if (node instanceof Element) { //ALL THE REST OF THE FORLOOP BLOCK } {code} +1 after this and a jenkins +1 Allow configuring the Fair Scheduler root queue --- Key: YARN-1258 URL: https://issues.apache.org/jira/browse/YARN-1258 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.1.1-beta Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-1258.patch This would be useful for acls, maxRunningApps, scheduling modes, etc. The allocation file should be able to accept both: * An implicit root queue * A root queue at the top of the hierarchy with all queues under/inside of it -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
[ https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790011#comment-13790011 ] Hadoop QA commented on YARN-1283: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607494/YARN-1283.20131008.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.TestJobCleanup org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.v2.TestUberAM {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2148//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2148//console This message is automatically generated. Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY - Key: YARN-1283 URL: https://issues.apache.org/jira/browse/YARN-1283 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.1-beta Reporter: Yesha Vora Assignee: Omkar Vinit Joshi Labels: newbie Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch, YARN-1283.20131008.2.patch After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect The url to track the job. Currently, its printing http://RM:httpsport/proxy/application_1381162886563_0001/ instead https://RM:httpsport/proxy/application_1381162886563_0001/ http://hostname:8088/proxy/application_1381162886563_0001/ is invalid hadoop jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at hostname/100.00.00.000:8032 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
[jira] [Created] (YARN-1287) Consolidate MockClocks
Sandy Ryza created YARN-1287: Summary: Consolidate MockClocks Key: YARN-1287 URL: https://issues.apache.org/jira/browse/YARN-1287 Project: Hadoop YARN Issue Type: Improvement Reporter: Sandy Ryza A bunch of different tests have near-identical implementations of MockClock. TestFairScheduler, TestFSSchedulerApp, and TestCgroupsLCEResourcesHandler for example. They should be consolidated into a single MockClock. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-321: Attachment: ApplicationHistoryServiceHighLevel.pdf Generic application history service --- Key: YARN-321 URL: https://issues.apache.org/jira/browse/YARN-321 Project: Hadoop YARN Issue Type: Improvement Reporter: Luke Lu Assignee: Vinod Kumar Vavilapalli Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, HistoryStorageDemo.java The mapreduce job history server currently needs to be deployed as a trusted server in sync with the mapreduce runtime. Every new application would need a similar application history server. Having to deploy O(T*V) (where T is number of type of application, V is number of version of application) trusted servers is clearly not scalable. Job history storage handling itself is pretty generic: move the logs and history data into a particular directory for later serving. Job history data is already stored as json (or binary avro). I propose that we create only one trusted application history server, which can have a generic UI (display json as a tree of strings) as well. Specific application/version can deploy untrusted webapps (a la AMs) to query the application history server and interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790047#comment-13790047 ] Sandy Ryza commented on YARN-321: - Thanks Vinod and Zhijie. Didn't see the comment. I'm going to attach your outline as a pdf to make it a little easier for passers-by to learn about. Here's the google doc it came from if you want to edit: https://docs.google.com/document/d/1cNsdGyLuagR8lzfeQrAclOAd-AdkVwgST6OG8Zzp43M/edit#heading=h.15p1lkmmm9g8 Generic application history service --- Key: YARN-321 URL: https://issues.apache.org/jira/browse/YARN-321 Project: Hadoop YARN Issue Type: Improvement Reporter: Luke Lu Assignee: Vinod Kumar Vavilapalli Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, HistoryStorageDemo.java The mapreduce job history server currently needs to be deployed as a trusted server in sync with the mapreduce runtime. Every new application would need a similar application history server. Having to deploy O(T*V) (where T is number of type of application, V is number of version of application) trusted servers is clearly not scalable. Job history storage handling itself is pretty generic: move the logs and history data into a particular directory for later serving. Job history data is already stored as json (or binary avro). I propose that we create only one trusted application history server, which can have a generic UI (display json as a tree of strings) as well. Specific application/version can deploy untrusted webapps (a la AMs) to query the application history server and interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1258) Allow configuring the Fair Scheduler root queue
[ https://issues.apache.org/jira/browse/YARN-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1258: - Attachment: YARN-1258-1.patch Allow configuring the Fair Scheduler root queue --- Key: YARN-1258 URL: https://issues.apache.org/jira/browse/YARN-1258 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.1.1-beta Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-1258-1.patch, YARN-1258.patch This would be useful for acls, maxRunningApps, scheduling modes, etc. The allocation file should be able to accept both: * An implicit root queue * A root queue at the top of the hierarchy with all queues under/inside of it -- This message was sent by Atlassian JIRA (v6.1#6144)