[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts
[ https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178004#comment-14178004 ] Allen Wittenauer commented on YARN-2161: I've re-opened YARN-2701. Fix build on macosx: YARN parts --- Key: YARN-2161 URL: https://issues.apache.org/jira/browse/YARN-2161 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.6.0 Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch When compiling on macosx with -Pnative, there are several warning and errors, fix this would help hadoop developers with macosx env. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178007#comment-14178007 ] Allen Wittenauer commented on YARN-2701: FWIW, I've re-opened this JIRA because I'm -1 on the suggested code fix. It was clearly done for patch expediency reasons rather than a proper fix. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2707) Potential null dereference in FSDownload
[ https://issues.apache.org/jira/browse/YARN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178026#comment-14178026 ] Hadoop QA commented on YARN-2707: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676034/YARN-2707.v01.patch against trunk revision 171f237. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5479//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5479//console This message is automatically generated. Potential null dereference in FSDownload Key: YARN-2707 URL: https://issues.apache.org/jira/browse/YARN-2707 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Gera Shegalov Priority: Minor Attachments: YARN-2707.v01.patch Here is related code in call(): {code} Pattern pattern = null; String p = resource.getPattern(); if (p != null) { pattern = Pattern.compile(p); } unpack(new File(dTmp.toUri()), new File(dFinal.toUri()), pattern); {code} In unpack(): {code} RunJar.unJar(localrsrc, dst, pattern); {code} unJar() would dereference the pattern without checking whether it is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: YARN-2198.15.patch .15.patch rebased to current trunk, .gitignore conflict resolved Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178057#comment-14178057 ] Hadoop QA commented on YARN-2198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676038/YARN-2198.15.patch against trunk revision 171f237. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5480//console This message is automatically generated. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2279) Add UTs to cover timeline server authentication
[ https://issues.apache.org/jira/browse/YARN-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-2279: - Assignee: Zhijie Shen Add UTs to cover timeline server authentication --- Key: YARN-2279 URL: https://issues.apache.org/jira/browse/YARN-2279 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: test Attachments: YARN-2279.1.patch Currently, timeline server authentication is lacking unit tests. We have to verify each incremental patch manually. It's good to add some unit tests here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2279) Add UTs to cover timeline server authentication
[ https://issues.apache.org/jira/browse/YARN-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2279: -- Attachment: YARN-2279.1.patch After refactoring the authentication code. The end-to-end test has already been added. Use this Jira to attach patch to enhance the test coverage: 1. Verify put domain API 2. Verify encrypted channel. Add UTs to cover timeline server authentication --- Key: YARN-2279 URL: https://issues.apache.org/jira/browse/YARN-2279 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Labels: test Attachments: YARN-2279.1.patch Currently, timeline server authentication is lacking unit tests. We have to verify each incremental patch manually. It's good to add some unit tests here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: (was: YARN-2198.15.patch) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, YARN-2198.14.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: YARN-2198.15.patch Reload .15.patch with TestLCE fix Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2279) Add UTs to cover timeline server authentication
[ https://issues.apache.org/jira/browse/YARN-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178099#comment-14178099 ] Hadoop QA commented on YARN-2279: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676042/YARN-2279.1.patch against trunk revision 171f237. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5481//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5481//console This message is automatically generated. Add UTs to cover timeline server authentication --- Key: YARN-2279 URL: https://issues.apache.org/jira/browse/YARN-2279 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: test Attachments: YARN-2279.1.patch Currently, timeline server authentication is lacking unit tests. We have to verify each incremental patch manually. It's good to add some unit tests here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178125#comment-14178125 ] Hadoop QA commented on YARN-2198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676048/YARN-2198.15.patch against trunk revision 171f237. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5482//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5482//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5482//console This message is automatically generated. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178218#comment-14178218 ] Remus Rusanu commented on YARN-2198: The 2 new hadoop-common Findbugs are unrelated to the patch: - Inconsistent synchronization of org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.delegationTokenSequenceNumber; locked 71% of time - Dereference of the result of readLine() without nullcheck in org.apache.hadoop.tracing.SpanReceiverHost.getUniqueLocalTraceFileName() Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cntic updated YARN-2681: Affects Version/s: (was: 2.4.0) 2.5.1 Support bandwidth enforcement for containers while reading from HDFS Key: YARN-2681 URL: https://issues.apache.org/jira/browse/YARN-2681 Project: Hadoop YARN Issue Type: New Feature Components: capacityscheduler, nodemanager, resourcemanager Affects Versions: 2.5.1 Environment: Linux Reporter: cntic Attachments: Traffic Control Design.png To read/write data from HDFS on data node, applications establise TCP/IP connections with the datanode. The HDFS read can be controled by setting Linux Traffic Control (TC) subsystem on the data node to make filters on appropriate connections. The current cgroups net_cls concept can not be applied on the node where the container is launched, netheir on data node since: - TC hanldes outgoing bandwidth only, so it can be set on container node (HDFS read = incoming data for the container) - Since HDFS data node is handled by only one process, it is not possible to use net_cls to separate connections from different containers to the datanode. Tasks: 1) Extend Resource model to define bandwidth enforcement rate 2) Monitor TCP/IP connection estabilised by container handling process and its child processes 3) Set Linux Traffic Control rules on data node base on address:port pairs in order to enforce bandwidth of outgoing data -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cntic updated YARN-2681: Attachment: HADOOP-2681.patch Support bandwidth enforcement for containers while reading from HDFS Key: YARN-2681 URL: https://issues.apache.org/jira/browse/YARN-2681 Project: Hadoop YARN Issue Type: New Feature Components: capacityscheduler, nodemanager, resourcemanager Affects Versions: 2.5.1 Environment: Linux Reporter: cntic Attachments: HADOOP-2681.patch, Traffic Control Design.png To read/write data from HDFS on data node, applications establise TCP/IP connections with the datanode. The HDFS read can be controled by setting Linux Traffic Control (TC) subsystem on the data node to make filters on appropriate connections. The current cgroups net_cls concept can not be applied on the node where the container is launched, netheir on data node since: - TC hanldes outgoing bandwidth only, so it can be set on container node (HDFS read = incoming data for the container) - Since HDFS data node is handled by only one process, it is not possible to use net_cls to separate connections from different containers to the datanode. Tasks: 1) Extend Resource model to define bandwidth enforcement rate 2) Monitor TCP/IP connection estabilised by container handling process and its child processes 3) Set Linux Traffic Control rules on data node base on address:port pairs in order to enforce bandwidth of outgoing data -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178309#comment-14178309 ] Hudson commented on YARN-2673: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #719 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/719/]) YARN-2673. Made timeline client put APIs retry if ConnectException happens. Contributed by Li Lu. (zjshen: rev 89427419a3c5eaab0f73bae98d675979b9efab5f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178305#comment-14178305 ] Hudson commented on YARN-2701: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #719 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/719/]) YARN-2701. Potential race condition in startLocalizer when using LinuxContainerExecutor. Contributed by Xuan Gong (jianhe: rev 2839365f230165222f63129979ea82ada79ec56e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java Missing file for YARN-2701 (jianhe: rev 4fa1fb3193bf39fcb1bd7f8f8391a78f69c3c302) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockContainerLocalizer.java Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container
[ https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178310#comment-14178310 ] Hudson commented on YARN-2717: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #719 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/719/]) YARN-2717. Avoided duplicate logging when container logs are not found. Contributed by Xuan Gong. (zjshen: rev 171f2376d23d51b61b9c9b3804ee86dbd4de033a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java * hadoop-yarn-project/CHANGES.txt containerLogNotFound log shows multiple time for the same container --- Key: YARN-2717 URL: https://issues.apache.org/jira/browse/YARN-2717 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2717.1.patch containerLogNotFound is called multiple times when the container log for the same container does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178303#comment-14178303 ] Hudson commented on YARN-1879: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #719 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/719/]) Missing file for YARN-1879 (jianhe: rev 4a78a752286effbf1a0d8695325f9d7464a09fb4) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestApplicationMasterServiceProtocolOnHA.java Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Fix For: 2.6.0 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178322#comment-14178322 ] Hadoop QA commented on YARN-2681: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676069/HADOOP-2681.patch against trunk revision 171f237. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 20 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5483//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5483//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5483//console This message is automatically generated. Support bandwidth enforcement for containers while reading from HDFS Key: YARN-2681 URL: https://issues.apache.org/jira/browse/YARN-2681 Project: Hadoop YARN Issue Type: New Feature Components: capacityscheduler, nodemanager, resourcemanager Affects Versions: 2.5.1 Environment: Linux Reporter: cntic Attachments: HADOOP-2681.patch, Traffic Control Design.png To read/write data from HDFS on data node, applications establise TCP/IP connections with the datanode. The HDFS read can be controled by setting Linux Traffic Control (TC) subsystem on the data node to make filters on appropriate connections. The current cgroups net_cls concept can not be applied on the node where the container is launched, netheir on data node since: - TC hanldes outgoing bandwidth only, so it can be set on container node (HDFS read = incoming data for the container) - Since HDFS data node is handled by only one process, it is not possible to use net_cls to separate connections from different containers to the datanode. Tasks: 1) Extend Resource model to define bandwidth enforcement rate 2) Monitor TCP/IP connection estabilised by container handling process and its child processes 3) Set Linux Traffic Control rules on data node base on address:port pairs in order to enforce bandwidth of outgoing data -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178422#comment-14178422 ] Hudson commented on YARN-1879: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1908 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1908/]) Missing file for YARN-1879 (jianhe: rev 4a78a752286effbf1a0d8695325f9d7464a09fb4) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestApplicationMasterServiceProtocolOnHA.java Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Fix For: 2.6.0 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178424#comment-14178424 ] Hudson commented on YARN-2701: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1908 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1908/]) YARN-2701. Potential race condition in startLocalizer when using LinuxContainerExecutor. Contributed by Xuan Gong (jianhe: rev 2839365f230165222f63129979ea82ada79ec56e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java Missing file for YARN-2701 (jianhe: rev 4fa1fb3193bf39fcb1bd7f8f8391a78f69c3c302) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockContainerLocalizer.java Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container
[ https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178429#comment-14178429 ] Hudson commented on YARN-2717: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1908 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1908/]) YARN-2717. Avoided duplicate logging when container logs are not found. Contributed by Xuan Gong. (zjshen: rev 171f2376d23d51b61b9c9b3804ee86dbd4de033a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java * hadoop-yarn-project/CHANGES.txt containerLogNotFound log shows multiple time for the same container --- Key: YARN-2717 URL: https://issues.apache.org/jira/browse/YARN-2717 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2717.1.patch containerLogNotFound is called multiple times when the container log for the same container does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS
[ https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178427#comment-14178427 ] Hudson commented on YARN-2582: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1908 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1908/]) YARN-2582. Fixed Log CLI and Web UI for showing aggregated logs of LRS. Contributed Xuan Gong. (zjshen: rev e90718fa5a0e7c18592af61534668acebb9db51b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/LogsCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestLogsCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/TestAggregatedLogsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogAggregationUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/log/AggregatedLogsBlock.java Log related CLI and Web UI changes for Aggregated Logs in LRS - Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178428#comment-14178428 ] Hudson commented on YARN-2673: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1908 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1908/]) YARN-2673. Made timeline client put APIs retry if ConnectException happens. Contributed by Li Lu. (zjshen: rev 89427419a3c5eaab0f73bae98d675979b9efab5f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178494#comment-14178494 ] Hudson commented on YARN-2701: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1933/]) YARN-2701. Potential race condition in startLocalizer when using LinuxContainerExecutor. Contributed by Xuan Gong (jianhe: rev 2839365f230165222f63129979ea82ada79ec56e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java * hadoop-yarn-project/CHANGES.txt Missing file for YARN-2701 (jianhe: rev 4fa1fb3193bf39fcb1bd7f8f8391a78f69c3c302) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockContainerLocalizer.java Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS
[ https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178497#comment-14178497 ] Hudson commented on YARN-2582: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1933/]) YARN-2582. Fixed Log CLI and Web UI for showing aggregated logs of LRS. Contributed Xuan Gong. (zjshen: rev e90718fa5a0e7c18592af61534668acebb9db51b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/LogsCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestLogsCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/TestAggregatedLogsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/log/AggregatedLogsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogAggregationUtils.java * hadoop-yarn-project/CHANGES.txt Log related CLI and Web UI changes for Aggregated Logs in LRS - Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container
[ https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178499#comment-14178499 ] Hudson commented on YARN-2717: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1933/]) YARN-2717. Avoided duplicate logging when container logs are not found. Contributed by Xuan Gong. (zjshen: rev 171f2376d23d51b61b9c9b3804ee86dbd4de033a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java containerLogNotFound log shows multiple time for the same container --- Key: YARN-2717 URL: https://issues.apache.org/jira/browse/YARN-2717 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2717.1.patch containerLogNotFound is called multiple times when the container log for the same container does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178492#comment-14178492 ] Hudson commented on YARN-1879: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1933/]) Missing file for YARN-1879 (jianhe: rev 4a78a752286effbf1a0d8695325f9d7464a09fb4) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestApplicationMasterServiceProtocolOnHA.java Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Fix For: 2.6.0 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2719) Windows: Wildcard classpath variables not expanded against resources contained in archives
Craig Welch created YARN-2719: - Summary: Windows: Wildcard classpath variables not expanded against resources contained in archives Key: YARN-2719 URL: https://issues.apache.org/jira/browse/YARN-2719 Project: Hadoop YARN Issue Type: Bug Reporter: Craig Welch Assignee: Craig Welch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
Craig Welch created YARN-2720: - Summary: Windows: Wildcard classpath variables not expanded against resources contained in archives Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Reporter: Craig Welch Assignee: Craig Welch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Add Cgroup support for RedHat 7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178586#comment-14178586 ] Wei Yan commented on YARN-2194: --- Thanks your comments, [~beckham007]. bq. startSystemdSlice/stopSystemdSlice needs root privilege? Yes, systemctl start/stop slice needs root privilege. bq. Let container-executor to run sudo systemctl start ? You mean adding start/stop slice function in the container-executor, and let SystemdLCEResourceHandler invokes these functions? Add Cgroup support for RedHat 7 --- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2194-1.patch In previous versions of RedHat, we can build custom cgroup hierarchies with use of the cgconfig command from the libcgroup package. From RedHat 7, package libcgroup is deprecated and it is not recommended to use it since it can easily create conflicts with the default cgroup hierarchy. The systemd is provided and recommended for cgroup management. We need to add support for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-2720: Component/s: nodemanager Target Version/s: 2.6.0 Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Craig Welch Assignee: Craig Welch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178608#comment-14178608 ] Naganarasimha G R commented on YARN-2495: - Hi [~aw] , bq. I don't think you understand the use case at all. In fact, it's clear you need to re-read the sample script. It does not get updated with every new JDK. It's smart enough to update the label regardless of the JDK that is installed... I meant like script needs to be modified for new label set for example, currently admin as configured for JDK Labels and further if he wanted to add label related to some native lib version, *As admin will knows all the valid native lib versions in the system(or can be automated to get this list) while modifying the script he will be able to configure the valid labels too*. bq. which means the only friction to operations is point is going to be updating this 'valid label list' on the RM. Seems like maintenance wise it might become difficult for example, once the valid JDK labels are loaded admins will forget about this feature and later on based on the req, some other admin/person might update the JDK and he might not be aware about such a script exists which updates the labels based on JDK or native libs version. So he might miss to update the valid labels and that node might not be useful or wrong labels will will be tagged to it as new labels are not updated. So i feel Allen's scenario needs to be addressed. As [~Wangda] suggested i feel centralized Label validation can be made configurable. Please provide opinion on this. Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2719) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch resolved YARN-2719. --- Resolution: Duplicate Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2719 URL: https://issues.apache.org/jira/browse/YARN-2719 Project: Hadoop YARN Issue Type: Bug Reporter: Craig Welch Assignee: Craig Welch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2720: -- Attachment: YARN-2720.2.patch Patch which tracks unexpanded wildcard classpath entries and adds them to the final classpath which is used when the container is launched Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2720.2.patch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-90: --- Summary: NodeManager should identify failed disks becoming good again (was: NodeManager should identify failed disks becoming good back again) +1 latest patch lgtm as well. Committing this. NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178708#comment-14178708 ] Allen Wittenauer commented on YARN-2701: So let's summarize the current situation: * previous code had a potential race condition. this is bad. * reverting the code broke the portability for OSes that don't yet support the relatively new \*at routines. this is equally bad, as it breaks a significant segment of the developer community. There is a middle ground here that solves both of these problems: introduce \*at routines as a compile-time dependency. We should be able to detect if the current libc has mkdirat, opendirat, etc and, if not, compile our own in from sources like Free/Net/OpenBSD's implementation. Let's revert the revert, then build a new patch that does the above. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178714#comment-14178714 ] Hudson commented on YARN-90: FAILURE: Integrated in Hadoop-trunk-Commit #6301 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6301/]) YARN-90. NodeManager should identify failed disks becoming good again. Contributed by Varun Vasudev (jlowe: rev 6f2028bd1514d90b831f889fd0ee7f2ba5c15000) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDirectoryCollection.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2694) Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2694: - Attachment: YARN-2694-20141021-1.patch Attached updated patch fixed test failure of TestContainerAllocation Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch, YARN-2694-20141021-1.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest with multiple node labels will make user limit computation becomes tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2720: -- Attachment: YARN-2720.3.patch It's hacky and not always possible to pass back the additional classpath info using the environment, so change the createJarWithClassPath signature to return an array Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2720.2.patch, YARN-2720.3.patch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178761#comment-14178761 ] Vinod Kumar Vavilapalli commented on YARN-2715: --- Quick comments on the patch - Similar to the configuration that you changed, we also need to get rid of RM_WEBAPP_DELEGATION_TOKEN_AUTH_FILTER (is this a compatible change?)? - ResourceManager.startWepApp() used to allow loading common auth-filter, we now require our custom filter for various reasons? /cc [~vvasudev] - Test-case: -- There's a ref to timeline-service in the patch -- No need to start the entire mini-cluster - starting RM is enough? Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178762#comment-14178762 ] Allen Wittenauer commented on YARN-2495: bq. while modifying the script he will be able to configure the valid labels too. The script can be updated *independently* of changing the running configuration files. Changing the xml comfigs will also require a *coordinated* reconfigure of the RM. That isn't realistic, especially for things such as rolling upgrades. HARM, of course, makes the situation even worse. Additionally, I'm sure the label validation code will spam the RM logs every time it gets an invalid label, which is pretty much a please fill the log directory action. The *only* scenario I can think of where label validation has a practical use is if AMs and/or containers are allowed to inject labels. But that should be a different control structure altogether and have zero impact on administrator controlled labels. bq. Seems like maintenance wise it might become difficult for example, Label validation actually makes your example worse because now the labels disappear completely. Is it a problem with the script or is it a problem with the label definition? bq. i feel centralized Label validation can be made configurable. Please provide opinion on this. Just disable it completely. I'm still waiting to hear what practical application this bug would have. Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2720: -- Attachment: YARN-2720.4.patch Updated version with unit test modification and white space fixes Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178794#comment-14178794 ] Kannan Rajah commented on YARN-2468: Xuan, I looked through the code changes and have a question about uploading logs for unfinished containers. Let's say we have already uploaded syslog for a container at time T1. At time T2, the container is still running and when the log aggregation is triggered again, will it re-upload the same syslog file? That seems to be the case. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.11.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-2720: Hadoop Flags: Reviewed +1 for the patch, pending Jenkins run. I've verified that this works in my environment with a few test runs. Thank you for fixing this, Craig. Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178809#comment-14178809 ] Xuan Gong commented on YARN-2468: - [~rkannan82] bq. Xuan, I looked through the code changes and have a question about uploading logs for unfinished containers. Let's say we have already uploaded syslog for a container at time T1. At time T2, the container is still running and when the log aggregation is triggered again, will it re-upload the same syslog file? That seems to be the case. It will not. EveryTime after we do the log aggregation, we will save the information for aggregated log file with (containerId.toString() + _ + file.getName() + _+ file.lastModified()). So, in next run, before we start to upload logs, we will check the log file whether it exists in the savedAggregatedLogFileCache (uploadedFileMeta in AppLogAggregatorImpl), if it exists, we will skip. Otherwise, we will upload it. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.11.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2709: Attachment: YARN-2709-102114.patch Hi [~zjshen], I updated my patch according to your comments. Specifically: 1. TimelineClientConnectionRetry is only used in test, so I added a visiblefortesting tag to it. I set the other two to private. 2, 4, 7. fixed 3, 6. I changed the unit test code to use kerberos, and now the mock is not necessary. So I merged the getDelegationTokenInternal to run(). 5. Fixed, but I thought you meant TestTimelineClient? (I didn't add any new imports in TimelineClientImpl. ) Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178864#comment-14178864 ] Kannan Rajah commented on YARN-2468: Thanks. But what about the case where the file was modified. Let's say 10 more lines were added to the syslog file. Doesn't it upload the full file again? Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.11.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178869#comment-14178869 ] Xuan Gong commented on YARN-2468: - bq. Thanks. But what about the case where the file was modified. Let's say 10 more lines were added to the syslog file. Doesn't it upload the full file again? This is the pre-requirement: We will rely on user’s log application (such as log4j) to do the rollover for the logs. Users need to set up their log application properly. For our side, we upload every logs in our log dirs. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.11.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178867#comment-14178867 ] Vinod Kumar Vavilapalli commented on YARN-2715: --- More comments - AdminService should also start recognizing the new proxyuser prefix. It will be useful to refactor the proxy-user handling into a common method that is used in both RM and AdminService. - Add comments in RMAuthenticationFilterInitializer about why we are having special handling of proxy-users. Please ignore my other comments about filter. We need to fix them separately, outlining below - Similar to the configuration that you changed, we also need to get rid of RM_WEBAPP_DELEGATION_TOKEN_AUTH_FILTER (is this a compatible change?)? - ResourceManager.startWepApp() used to allow loading common auth-filter, we now require our custom filter for various reasons? /cc Varun Vasudev - RMAuthenticationFilterInitializer.configPrefix doesn't need to be a class variable. - RMAuthenticationFilterInitializer and TimelineAuthenticationFilterInitializer share a lot of code, they can be refactored. I'll file tickets for the above. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2694) Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178885#comment-14178885 ] Hadoop QA commented on YARN-2694: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676130/YARN-2694-20141021-1.patch against trunk revision 6f2028b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5484//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5484//console This message is automatically generated. Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch, YARN-2694-20141021-1.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest with multiple node labels will make user limit computation becomes tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178894#comment-14178894 ] Hadoop QA commented on YARN-2720: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676140/YARN-2720.4.patch against trunk revision 6f2028b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5485//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5485//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5485//console This message is automatically generated. Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178897#comment-14178897 ] Chris Nauroth commented on YARN-2720: - The Findbugs warnings are unrelated. I'll commit this. Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178921#comment-14178921 ] Hadoop QA commented on YARN-2709: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676151/YARN-2709-102114.patch against trunk revision 4e134a0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5486//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5486//console This message is automatically generated. Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178926#comment-14178926 ] Hudson commented on YARN-2720: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6303 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6303/]) YARN-2720. Windows: Wildcard classpath variables not expanded against resources contained in archives. Contributed by Craig Welch. (cnauroth: rev 6637e3cf95b3a9be8d6b9cd66bc849a0607e8ed5) * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Classpath.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFileUtil.java Windows: Wildcard classpath variables not expanded against resources contained in archives -- Key: YARN-2720 URL: https://issues.apache.org/jira/browse/YARN-2720 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.6.0 Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch On windows there are limitations to the length of command lines and environment variables which prevent placing all classpath resources into these elements. Instead, a jar containing only a classpath manifest is created to provide the classpath. During this process wildcard references are expanded by inspecting the filesystem. Since archives are extracted to a different location and linked into the final location after the classpath jar is created, resources referred to via wildcards which exist in localized archives (.zip, tar.gz) are not added to the classpath manifest jar. Since these entries are removed from the final classpath for the container they are not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178992#comment-14178992 ] Zhijie Shen commented on YARN-2715: --- Vinod, thanks for the comments. Agree with them, and let do the filter refactoring in a separate Jira. Here're the response the comments related to this one. bq. Test-case: Fixed the issue of the test cases and move it into the rm submodule. bq. AdminService should also start recognizing the new proxyuser prefix. Refactor the code of processing RM proxy user configs, and make both RM and AdminService refer to it. In AdminService, make the refreshing request source yarn-site.xml for RM specific configs too. bq. Add comments in RMAuthenticationFilterInitializer about why we are having special handling of proxy-users. Add a comment there. I uploaded a new patch accordingly Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2715: -- Attachment: YARN-2715.3.patch Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179001#comment-14179001 ] Kannan Rajah commented on YARN-2468: Makes sense. Sorry, but I have just one last question, not completely relevant to this JIRA though. Is there any ongoing effort to write the logs directly to HDFS instead of this 2 phase approach? If not, can you point out the reasons? This work being done to take care of the lifecycle of these logs seem fairly complex and also potentially adding performance overhead to the cluster. So I am interested to understand the rationale. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.11.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
Jian He created YARN-2721: - Summary: Race condition: ZKRMStateStore retry logic may throw NodeExist exception Key: YARN-2721 URL: https://issues.apache.org/jira/browse/YARN-2721 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Blindly retrying operations in zookeeper will not work for non-idempotent operations (like create znode). The reason is that the client can do a create znode, but the response may not be returned because the server can die or timeout. In case of retrying the create znode, it will throw a NODE_EXISTS exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179067#comment-14179067 ] Jian He commented on YARN-2721: --- Curator should handle the retry properly which is addressed in YARN-2716. As a temporary fix, we can simply ignore the potential NodeExist exception for now. Creating a patch. Race condition: ZKRMStateStore retry logic may throw NodeExist exception - Key: YARN-2721 URL: https://issues.apache.org/jira/browse/YARN-2721 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Blindly retrying operations in zookeeper will not work for non-idempotent operations (like create znode). The reason is that the client can do a create znode, but the response may not be returned because the server can die or timeout. In case of retrying the create znode, it will throw a NODE_EXISTS exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179107#comment-14179107 ] Zhijie Shen commented on YARN-2709: --- Almost good to me. Some nits: 1. Can you add a comment to say the following config is to bypass the issue in HADOOP-11215. {code} conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos); {code} 2. For both retry test cases, set newMaxRetries to 5 and newIntervalMs to 500? Make sure it's able to retry multiple times? {code} int newMaxRetries = 1; long newIntervalMs = 1500; {coe} 3. token is an unused var {code} TokenTimelineDelegationTokenIdentifier token = client.getDelegationToken( UserGroupInformation.getCurrentUser().getShortUserName()); {code} 4. You can directly change connectionRetry to default visibility (no private modifier) because the test class is in the same package, and mark it @VisibleForTesting. {code} @Private @VisibleForTesting public TimelineClientConnectionRetry getConnectionRetry() { return connectionRetry; } {code} 5. Retried is not thread safe, but it should be fine if it is not used for unit test. Would you please add a comment? {code} // Indicates if retries happened last time @Private @VisibleForTesting public boolean retried = false; {code} Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179107#comment-14179107 ] Zhijie Shen edited comment on YARN-2709 at 10/21/14 9:10 PM: - Almost good to me. Some nits: 1. Can you add a comment to say the following config is to bypass the issue in HADOOP-11215. {code} conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos); {code} 2. For both retry test cases, set newMaxRetries to 5 and newIntervalMs to 500? Make sure it's able to retry multiple times? {code} int newMaxRetries = 1; long newIntervalMs = 1500; {code} 3. token is an unused var {code} TokenTimelineDelegationTokenIdentifier token = client.getDelegationToken( UserGroupInformation.getCurrentUser().getShortUserName()); {code} 4. You can directly change connectionRetry to default visibility (no private modifier) because the test class is in the same package, and mark it \@VisibleForTesting. {code} @Private @VisibleForTesting public TimelineClientConnectionRetry getConnectionRetry() { return connectionRetry; } {code} 5. Retried is not thread safe, but it should be fine if it is not used for unit test. Would you please add a comment? {code} // Indicates if retries happened last time @Private @VisibleForTesting public boolean retried = false; {code} was (Author: zjshen): Almost good to me. Some nits: 1. Can you add a comment to say the following config is to bypass the issue in HADOOP-11215. {code} conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos); {code} 2. For both retry test cases, set newMaxRetries to 5 and newIntervalMs to 500? Make sure it's able to retry multiple times? {code} int newMaxRetries = 1; long newIntervalMs = 1500; {coe} 3. token is an unused var {code} TokenTimelineDelegationTokenIdentifier token = client.getDelegationToken( UserGroupInformation.getCurrentUser().getShortUserName()); {code} 4. You can directly change connectionRetry to default visibility (no private modifier) because the test class is in the same package, and mark it @VisibleForTesting. {code} @Private @VisibleForTesting public TimelineClientConnectionRetry getConnectionRetry() { return connectionRetry; } {code} 5. Retried is not thread safe, but it should be fine if it is not used for unit test. Would you please add a comment? {code} // Indicates if retries happened last time @Private @VisibleForTesting public boolean retried = false; {code} Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2709: Attachment: YARN-2709-102114-1.patch Hi [~zjshen], I've addressed your comments in this patch. If you have time please feel free to have a look at it. Thanks! Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114-1.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179162#comment-14179162 ] Hadoop QA commented on YARN-2715: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676165/YARN-2715.3.patch against trunk revision ac56b06. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5487//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5487//console This message is automatically generated. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179165#comment-14179165 ] Craig Welch commented on YARN-2505: --- [~ksumit] I do like the idea of being able to add node label(s) to multiple nodes, it seems natural/like it would be useful. I would think that should be a post with a (list of) label(s) and a list of node ids. I'm not sure if it will go in the first iteration but it makes sense to me to have it. I don't think there's a compelling purpose at the moment for a node label type, it's a string/textual label and I think it is sensible to just model it as such. Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179213#comment-14179213 ] Hadoop QA commented on YARN-2709: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676183/YARN-2709-102114-1.patch against trunk revision 4baca31. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5489//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5489//console This message is automatically generated. Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114-1.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179218#comment-14179218 ] Hadoop QA commented on YARN-2721: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676172/YARN-2721.1.patch against trunk revision b85919f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5488//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5488//console This message is automatically generated. Race condition: ZKRMStateStore retry logic may throw NodeExist exception - Key: YARN-2721 URL: https://issues.apache.org/jira/browse/YARN-2721 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2721.1.patch Blindly retrying operations in zookeeper will not work for non-idempotent operations (like create znode). The reason is that the client can do a create znode, but the response may not be returned because the server can die or timeout. In case of retrying the create znode, it will throw a NODE_EXISTS exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2709: Attachment: YARN-2709-102114-2.patch Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114-2.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2709: Attachment: (was: YARN-2709-102114-1.patch) Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114-2.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2448) RM should expose the resource types considered during scheduling when AMs register
[ https://issues.apache.org/jira/browse/YARN-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179235#comment-14179235 ] Maxim Ivanov commented on YARN-2448: [~vvasudev], just curious what this patch is aiming to achieve? Surely requirements of the specific application shouldn't change depending on scheduler is taking into consideration when doing its job of allocating resources. As [~kkambatl] suggested, AM can submit all it knows about resources it needs and then scheduler can safely ignore those which request which it doesn't know about RM should expose the resource types considered during scheduling when AMs register -- Key: YARN-2448 URL: https://issues.apache.org/jira/browse/YARN-2448 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2448.0.patch, apache-yarn-2448.1.patch, apache-yarn-2448.2.patch The RM should expose the name of the ResourceCalculator being used when AMs register, as part of the RegisterApplicationMasterResponse. This will allow applications to make better decisions when scheduling. MapReduce for example, only looks at memory when deciding it's scheduling, even though the RM could potentially be using the DominantResourceCalculator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2505: -- Attachment: YARN-2505.3.patch Attached is a preview patch - it is incomplete, lacking tests not all planned interfaces are present yet, but a basic set is. Couple minor changes wrt the plan a couple posts above - to keep consistency with the rest of the service interface I'm sticking with plural names everywhere (/nodes/, /node-labels/) deferring (perhaps, permanently...) a couple of items which seem duplicative/purely for completeness but not really needed. DONE POST .../cluster/node-labels (serialized data) adds multiple labels in an operation GET .../cluster/node-labels returns multiple labels as serialized data (all labels) POST .../cluster/nodes/id/labels (serialized data) adds multiple labels to a node in an operation GET .../cluster/nodes/id/labels returns serialized set of all labels for node TODO NOW DELETE .../cluster/node-labels/a deletes an existing node label, a DELETE .../cluster/nodes/id/labels/a deletes label a from node id PUT .../cluster/node-labels/a creates a new node label, a PUT .../cluster/nodes/id/labels/a ads label a to node id JUST DEFERRING FOR THE MOMENT - seems like a good idea, though POST label to multiple nodes DEFERRING - DUPLICATIVE GET .../cluster/node-labels/a return value indicates presence or absense of a GET .../cluster/nodes/id/labels/a indicates existance of label on node by return value Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.3.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179280#comment-14179280 ] Hadoop QA commented on YARN-2709: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676202/YARN-2709-102114-2.patch against trunk revision 4baca31. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5490//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5490//console This message is automatically generated. Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114-2.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
Wei Yan created YARN-2722: - Summary: Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle Key: YARN-2722 URL: https://issues.apache.org/jira/browse/YARN-2722 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179281#comment-14179281 ] Zhijie Shen commented on YARN-2709: --- +1 for the last patch. Will commit it. Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114-2.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179288#comment-14179288 ] Hudson commented on YARN-2709: -- FAILURE: Integrated in Hadoop-trunk-Commit #6307 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6307/]) YARN-2709. Made timeline client getDelegationToken API retry if ConnectException happens. Contributed by Li Lu. (zjshen: rev b2942762d7f76d510ece5621c71116346a6b12f6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/CHANGES.txt Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, YARN-2709-102114-2.patch, YARN-2709-102114.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179377#comment-14179377 ] Mayank Bansal commented on YARN-2647: - HI [~sunilg] , Are u still working on this ? Can i take it over if u r not looking at it? Thanks, Mayank Add yarn queue CLI to get queue info including labels of such queue --- Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2698) Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-2698: --- Assignee: Mayank Bansal (was: Wangda Tan) Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI --- Key: YARN-2698 URL: https://issues.apache.org/jira/browse/YARN-2698 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Mayank Bansal YARN RMAdminCLI and AdminService should have write API only, for other read APIs, they should be located at YARNCLI and RMClientService. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2698) Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179379#comment-14179379 ] Mayank Bansal commented on YARN-2698: - taking it over Thanks, Mayank Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI --- Key: YARN-2698 URL: https://issues.apache.org/jira/browse/YARN-2698 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Mayank Bansal YARN RMAdminCLI and AdminService should have write API only, for other read APIs, they should be located at YARNCLI and RMClientService. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
Phil D'Amore created YARN-2723: -- Summary: rmadmin -replaceLabelsOnNode does not correctly parse port Key: YARN-2723 URL: https://issues.apache.org/jira/browse/YARN-2723 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Phil D'Amore There is an off-by-one issue in RMAdminCLI.java (line 457): port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(:))); should probably be: port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(:)+1)); Currently attempting to add a label to a node with a port specified looks like this: [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode node.example.com:45454,test-label replaceLabelsOnNode: For input string: :45454 Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 node2:port,label1,label2]] It appears to be trying to parse the ':' as part of the integer because the substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179398#comment-14179398 ] Vinod Kumar Vavilapalli commented on YARN-2715: --- Looks better now. A couple of comments: - TestRMAdminService: can you change it to also use the YARN property names also? - Can we move processRMProxyUsersConf to somewhere else? Say RMServerUtils? Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2505: -- Attachment: YARN-2505.4.patch WIP update - some items were moved to [YARN-2503] and so are removed from here, some tests are now done Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
Sumit Mohanty created YARN-2724: --- Summary: If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179438#comment-14179438 ] Jian He commented on YARN-2198: --- Hi [~rusanu], +1 for the latest patch. Looks like it's conflicting with trunk again. Could you update ? I'd like to commit this after that. sorry for the repeated updating. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2715: -- Attachment: YARN-2715.4.patch Thanks for the comments. I uploaded a new patch which address the two comments. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, YARN-2715.4.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2495: Attachment: YARN-2495_20141022.1.patch Hi All, Uploading a WIP patch, just to share the approach... Completed * User can set labels in each NM (by setting yarn-site.xml or using script suggested by Allen Wittenauer) * NM will send labels to RM via ResourceTracker API * RM will set labels in NodeLabelManager when NM register/update labels Pending : * No test cases written. and may be test cases modified to resolve compilation issues can be done in a better way. * As per the design Doc there was requirement to specifically either support distributed or Centralized but was not sure how to get it done as current design does not seem to be specific to central or distributed and class was configured to identify NodeLabelsManager, * Currently i have not completely ensured that Node Labels are sent only when there is change in the labels got from script to the last successful updated labels to RM. Response of node heartbeat and node register can be made used for this. Yet to finish. * Configuration has been added to validate for centralized labels. But need to further discuss on the approach for this. Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time
[ https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179479#comment-14179479 ] Jian He commented on YARN-1915: --- bq. However there's some confusion as to how the client token master key should be sent to the RM (e.g.: via container credentials, via the current method, etc.) thanks Jason, I read through the discussion. I prefer setting via container credential as it's the common way to pass the credential/tokens to both AM container and non-AM container. I'm also OK to get the current patch in first. ClientToAMTokenMasterKey should be provided to AM at launch time Key: YARN-1915 URL: https://issues.apache.org/jira/browse/YARN-1915 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Hitesh Shah Assignee: Jason Lowe Priority: Blocker Attachments: YARN-1915.patch, YARN-1915v2.patch, YARN-1915v3.patch Currently, the AM receives the key as part of registration. This introduces a race where a client can connect to the AM when the AM has not received the key. Current Flow: 1) AM needs to start the client listening service in order to get host:port and send it to the RM as part of registration 2) RM gets the port info in register() and transitions the app to RUNNING. Responds back with client secret to AM. 3) User asks RM for client token. Gets it and pings the AM. AM hasn't received client secret from RM and so RPC itself rejects the request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179491#comment-14179491 ] Hadoop QA commented on YARN-2715: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676234/YARN-2715.4.patch against trunk revision b294276. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5491//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5491//console This message is automatically generated. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, YARN-2715.4.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
[ https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-2723: --- Assignee: Naganarasimha G R rmadmin -replaceLabelsOnNode does not correctly parse port -- Key: YARN-2723 URL: https://issues.apache.org/jira/browse/YARN-2723 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Phil D'Amore Assignee: Naganarasimha G R There is an off-by-one issue in RMAdminCLI.java (line 457): port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(:))); should probably be: port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(:)+1)); Currently attempting to add a label to a node with a port specified looks like this: [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode node.example.com:45454,test-label replaceLabelsOnNode: For input string: :45454 Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 node2:port,label1,label2]] It appears to be trying to parse the ':' as part of the integer because the substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179507#comment-14179507 ] Vinod Kumar Vavilapalli commented on YARN-2495: --- This is very useful to get in for 2.6, [~leftnoteasy]/[~Naganarasimha] how feasible is it? Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179509#comment-14179509 ] Vinod Kumar Vavilapalli commented on YARN-2715: --- Looks good, +1. Checking this in. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, YARN-2715.4.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179514#comment-14179514 ] Hudson commented on YARN-2715: -- FAILURE: Integrated in Hadoop-trunk-Commit #6308 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6308/]) YARN-2715. Fixed ResourceManager to respect common configurations for proxy users/groups beyond just the YARN level config. Contributed by Zhijie Shen. (vinodkv: rev c0e034336c85296be6f549d88d137fb2b2b79a15) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMProxyUsersConf.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/CHANGES.txt Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, YARN-2715.4.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-2724: --- Assignee: Xuan Gong If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time
[ https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179528#comment-14179528 ] Vinod Kumar Vavilapalli commented on YARN-1915: --- bq. Yes, I thought the ugi mangling was gone, but the AMRMToken is indeed manually removed. I had a JIRA for fixing this, so that NMs themselves will remove it for non-AM containers, will find it. bq. I'm assuming there was a valid reason why the secret is passed in the registration response, perhaps for future functionality. The secret used to be in env. We moved it to registration because of security issues in Windows. bq. However there's some confusion as to how the client token master key should be sent to the RM (e.g.: via container credentials, via the current method, etc.). We can deprecate the key returning in response and instead put it inside container credentials. The credentials is unfortunately named as 'tokens' - it was always token so far. We could deprecate tokens too and instead move to credentials ala CredentialsInfo for web-services. The wait in the current patch is worrisome *only* if we have large number of clients pinging in and blocking RPC handlers. This doesn't happen in practice though, I'm okay getting it in for 2.6. ClientToAMTokenMasterKey should be provided to AM at launch time Key: YARN-1915 URL: https://issues.apache.org/jira/browse/YARN-1915 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Hitesh Shah Assignee: Jason Lowe Priority: Blocker Attachments: YARN-1915.patch, YARN-1915v2.patch, YARN-1915v3.patch Currently, the AM receives the key as part of registration. This introduces a race where a client can connect to the AM when the AM has not received the key. Current Flow: 1) AM needs to start the client listening service in order to get host:port and send it to the RM as part of registration 2) RM gets the port info in register() and transitions the app to RUNNING. Responds back with client secret to AM. 3) User asks RM for client token. Gets it and pings the AM. AM hasn't received client secret from RM and so RPC itself rejects the request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2701: Attachment: YARN-2701.addendum.1.patch Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, YARN-2701.addendum.1.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179560#comment-14179560 ] Xuan Gong commented on YARN-2701: - [~aw] Thanks for the summary. Let us not revert the current code. I uploaded an addendum patch. In this patch, I revert the current mkdirs codes to the codes which were committed in YARN-2161. Also I made some necessary changes to solve the race condition issue. If you can review it, that will be very helpful. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, YARN-2701.addendum.1.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179563#comment-14179563 ] Zhijie Shen commented on YARN-2721: --- +1 straightforward change. Let's make the complete solution in YARN-2716. Will commit the patch. Race condition: ZKRMStateStore retry logic may throw NodeExist exception - Key: YARN-2721 URL: https://issues.apache.org/jira/browse/YARN-2721 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2721.1.patch Blindly retrying operations in zookeeper will not work for non-idempotent operations (like create znode). The reason is that the client can do a create znode, but the response may not be returned because the server can die or timeout. In case of retrying the create znode, it will throw a NODE_EXISTS exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179573#comment-14179573 ] Hudson commented on YARN-2721: -- FAILURE: Integrated in Hadoop-trunk-Commit #6309 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6309/]) YARN-2721. Suppress NodeExist exception thrown by ZKRMStateStore when it retries creating znode. Contributed by Jian He. (zjshen: rev 7e3b5e6f5cb4945b4fab27e8a83d04280df50e17) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * hadoop-yarn-project/CHANGES.txt Race condition: ZKRMStateStore retry logic may throw NodeExist exception - Key: YARN-2721 URL: https://issues.apache.org/jira/browse/YARN-2721 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2721.1.patch Blindly retrying operations in zookeeper will not work for non-idempotent operations (like create znode). The reason is that the client can do a create znode, but the response may not be returned because the server can die or timeout. In case of retrying the create znode, it will throw a NODE_EXISTS exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2710) RM HA tests failed intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2710: - Attachment: TestResourceTrackerOnHA-output.2.txt I could reproduced same issue about TestResourceTrackerOnHA - it's intermittent failure, and it happens rarely. Attaching log on my local. RM HA tests failed intermittently on trunk -- Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Attachments: TestResourceTrackerOnHA-output.2.txt, org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
[ https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2723: - Issue Type: Sub-task (was: Bug) Parent: YARN-2492 rmadmin -replaceLabelsOnNode does not correctly parse port -- Key: YARN-2723 URL: https://issues.apache.org/jira/browse/YARN-2723 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Phil D'Amore Assignee: Naganarasimha G R There is an off-by-one issue in RMAdminCLI.java (line 457): port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(:))); should probably be: port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(:)+1)); Currently attempting to add a label to a node with a port specified looks like this: [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode node.example.com:45454,test-label replaceLabelsOnNode: For input string: :45454 Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 node2:port,label1,label2]] It appears to be trying to parse the ':' as part of the integer because the substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)