[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179673#comment-14179673 ] Wangda Tan commented on YARN-2495: -- [~Naganarasimha], [~aw], let me first give you an overview about what we need to do to support labels in capacity scheduler, that will help you better understanding why we need central node label validation now. In existing capacity scheduler (patch of YARN-2496), we can support specify what labels of each queue can access (to make sure important resource can only be used by privileged users), and proportion of resource on label ("marketing" queue can access 80% of GPU resource). Now if user want to leverage change of capacity scheduler, user *MUST* specify 1) labels can be accessed by the queue and 2) proportion of resource can be accessed by a queue of each label. Back to the central node label validation discussion, without this, we cannot get capacity scheduler work for now. (user cannot specify capacity for a unknown node-label for a queue, etc.). So I still insist to have central node label valication for both centralized/distribtued node label configuration at least for 2.6 release. This might be changed in the future, I suggest to move disable central node label configuration to a separated task for further discussions. And I've looked at patch uploaded by [~Naganarasimha], thanks for this WIP patch, took a quick glance at the patch, several suggestions on this patch: - According to above comments, do not change {{CommonNodeLabelsManager}}, move the changes to disable central node label validation to a separated patch for further discussion. - Make this patch contains a {{NodeLabelProvider}} only and create separate JIRA for {{ScriptNodeLabelProvider}} and an implementation to read node label from yarn-site.xml for easier review. > Allow admin specify labels in each NM (Distributed configuration) > - > > Key: YARN-2495 > URL: https://issues.apache.org/jira/browse/YARN-2495 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R > Attachments: YARN-2495_20141022.1.patch > > > Target of this JIRA is to allow admin specify labels in each NM, this covers > - User can set labels in each NM (by setting yarn-site.xml or using script > suggested by [~aw]) > - NM will send labels to RM via ResourceTracker API > - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2398) TestResourceTrackerOnHA crashes
[ https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2398: - Attachment: (was: TestResourceTrackerOnHA-output.txt) > TestResourceTrackerOnHA crashes > --- > > Key: YARN-2398 > URL: https://issues.apache.org/jira/browse/YARN-2398 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jason Lowe > > TestResourceTrackerOnHA is currently crashing and failing trunk builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2398) TestResourceTrackerOnHA crashes
[ https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179797#comment-14179797 ] Tsuyoshi OZAWA commented on YARN-2398: -- Rohith, Wangda, yeah, thanks for your pointing. the log I attached looks not related to the issue Jason mentioned. Removing it. > TestResourceTrackerOnHA crashes > --- > > Key: YARN-2398 > URL: https://issues.apache.org/jira/browse/YARN-2398 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jason Lowe > > TestResourceTrackerOnHA is currently crashing and failing trunk builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179802#comment-14179802 ] cntic commented on YARN-2681: - Find the concept at http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, Traffic Control Design.png > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179831#comment-14179831 ] Hudson commented on YARN-2721: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/720/]) YARN-2721. Suppress NodeExist exception thrown by ZKRMStateStore when it retries creating znode. Contributed by Jian He. (zjshen: rev 7e3b5e6f5cb4945b4fab27e8a83d04280df50e17) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java > Race condition: ZKRMStateStore retry logic may throw NodeExist exception > - > > Key: YARN-2721 > URL: https://issues.apache.org/jira/browse/YARN-2721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.6.0 > > Attachments: YARN-2721.1.patch > > > Blindly retrying operations in zookeeper will not work for non-idempotent > operations (like create znode). The reason is that the client can do a create > znode, but the response may not be returned because the server can die or > timeout. In case of retrying the create znode, it will throw a NODE_EXISTS > exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179829#comment-14179829 ] Hudson commented on YARN-2720: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/720/]) YARN-2720. Windows: Wildcard classpath variables not expanded against resources contained in archives. Contributed by Craig Welch. (cnauroth: rev 6637e3cf95b3a9be8d6b9cd66bc849a0607e8ed5) * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFileUtil.java * hadoop-yarn-project/CHANGES.txt * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Classpath.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java > Windows: Wildcard classpath variables not expanded against resources > contained in archives > -- > > Key: YARN-2720 > URL: https://issues.apache.org/jira/browse/YARN-2720 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Craig Welch >Assignee: Craig Welch > Fix For: 2.6.0 > > Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch > > > On windows there are limitations to the length of command lines and > environment variables which prevent placing all classpath resources into > these elements. Instead, a jar containing only a classpath manifest is > created to provide the classpath. During this process wildcard references > are expanded by inspecting the filesystem. Since archives are extracted to a > different location and linked into the final location after the classpath jar > is created, resources referred to via wildcards which exist in localized > archives (.zip, tar.gz) are not added to the classpath manifest jar. Since > these entries are removed from the final classpath for the container they are > not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179835#comment-14179835 ] Hudson commented on YARN-2709: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/720/]) YARN-2709. Made timeline client getDelegationToken API retry if ConnectException happens. Contributed by Li Lu. (zjshen: rev b2942762d7f76d510ece5621c71116346a6b12f6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/CHANGES.txt > Add retry for timeline client getDelegationToken method > --- > > Key: YARN-2709 > URL: https://issues.apache.org/jira/browse/YARN-2709 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, > YARN-2709-102114-2.patch, YARN-2709-102114.patch > > > As mentioned in YARN-2673, we need to add retry mechanism to timeline client > for secured clusters. This means if the timeline server is not available, a > timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179832#comment-14179832 ] Hudson commented on YARN-2715: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/720/]) YARN-2715. Fixed ResourceManager to respect common configurations for proxy users/groups beyond just the YARN level config. Contributed by Zhijie Shen. (vinodkv: rev c0e034336c85296be6f549d88d137fb2b2b79a15) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMProxyUsersConf.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java > Proxy user is problem for RPC interface if > yarn.resourcemanager.webapp.proxyuser is not set. > > > Key: YARN-2715 > URL: https://issues.apache.org/jira/browse/YARN-2715 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, > YARN-2715.4.patch > > > After YARN-2656, if people set hadoop.proxyuser for the client<-->RM RPC > interface, it's not going to work, because ProxyUsers#sip is a singleton per > daemon. After YARN-2656, RM has both channels that want to set this > configuration: RPC and HTTP. RPC interface sets it first by reading > hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to > empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. > The fix for it could be similar to what we've done for YARN-2676: make the > HTTP interface anyway source hadoop.proxyuser first, then > yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179830#comment-14179830 ] Hudson commented on YARN-90: SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/720/]) YARN-90. NodeManager should identify failed disks becoming good again. Contributed by Varun Vasudev (jlowe: rev 6f2028bd1514d90b831f889fd0ee7f2ba5c15000) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDirectoryCollection.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2725) Adding retry requests about ZKRMStateStore
Tsuyoshi OZAWA created YARN-2725: Summary: Adding retry requests about ZKRMStateStore Key: YARN-2725 URL: https://issues.apache.org/jira/browse/YARN-2725 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA YARN-2721 found a race condition for ZK-specific retry semantics. We should add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179842#comment-14179842 ] Tsuyoshi OZAWA commented on YARN-2721: -- Good job, Jian. Created YARN-2725 for adding tests to cover these cases. > Race condition: ZKRMStateStore retry logic may throw NodeExist exception > - > > Key: YARN-2721 > URL: https://issues.apache.org/jira/browse/YARN-2721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.6.0 > > Attachments: YARN-2721.1.patch > > > Blindly retrying operations in zookeeper will not work for non-idempotent > operations (like create znode). The reason is that the client can do a create > znode, but the response may not be returned because the server can die or > timeout. In case of retrying the create znode, it will throw a NODE_EXISTS > exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2725: - Summary: Adding test cases of retrying requests about ZKRMStateStore (was: Adding retry requests about ZKRMStateStore) > Adding test cases of retrying requests about ZKRMStateStore > --- > > Key: YARN-2725 > URL: https://issues.apache.org/jira/browse/YARN-2725 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tsuyoshi OZAWA > > YARN-2721 found a race condition for ZK-specific retry semantics. We should > add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
Phil D'Amore created YARN-2726: -- Summary: CapacityScheduler should explicitly log when an accessible label has no capacity Key: YARN-2726 URL: https://issues.apache.org/jira/browse/YARN-2726 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Phil D'Amore Priority: Minor Given: - Node label defined: test-label - Two queues defined: a, b - label accessibility and and capacity defined as follows (properties abbreviated for readability): root.a.accessible-node-labels = test-label root.a.accessible-node-labels.test-label.capacity = 100 If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack trace with the following error buried within: "Illegal capacity of -1.0 for label=test-label in queue=root.b" This of course occurs because test-label is accessible to b due to inheritance from the root, and -1 is the UNDEFINED value. To my mind this might not be obvious to the admin, and the error message which results does not help guide someone to the source of the issue. I propose that this situation be updated so that when the capacity on an accessible label is undefined, it is explicitly called out instead of falling through to the illegal capacity check. Something like: {code} if (capacity == UNDEFINED) { throw new IllegalArgumentException("Configuration issue: " + " label=" + label + " is accessible from queue=" + queue + " but has no capacity set."); } {code} I'll leave it to better judgement than mine as to whether I'm throwing the appropriate exception there. I think this check should be added to both getNodeLabelCapacities and getMaximumNodeLabelCapacities in CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2692) ktutil test hanging on some machines/ktutil versions
[ https://issues.apache.org/jira/browse/YARN-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179903#comment-14179903 ] Hudson commented on YARN-2692: -- FAILURE: Integrated in Hadoop-trunk-Commit #6310 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6310/]) YARN-2692 ktutil test hanging on some machines/ktutil versions (stevel) (stevel: rev 85a88649c3f3fb7280aa511b2035104bcef28a6f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/RegistryTestHelper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/TestSecureLogins.java > ktutil test hanging on some machines/ktutil versions > > > Key: YARN-2692 > URL: https://issues.apache.org/jira/browse/YARN-2692 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Fix For: 2.6.0 > > Attachments: YARN-2692-001.patch > > > a couple of the registry security tests run native {{ktutil}}; this is > primarily to debug the keytab generation. [~cnauroth] reports that some > versions of {{kinit}} hang. Fix: rm the tests. [YARN-2689] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179919#comment-14179919 ] Hudson commented on YARN-2721: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/]) YARN-2721. Suppress NodeExist exception thrown by ZKRMStateStore when it retries creating znode. Contributed by Jian He. (zjshen: rev 7e3b5e6f5cb4945b4fab27e8a83d04280df50e17) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * hadoop-yarn-project/CHANGES.txt > Race condition: ZKRMStateStore retry logic may throw NodeExist exception > - > > Key: YARN-2721 > URL: https://issues.apache.org/jira/browse/YARN-2721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.6.0 > > Attachments: YARN-2721.1.patch > > > Blindly retrying operations in zookeeper will not work for non-idempotent > operations (like create znode). The reason is that the client can do a create > znode, but the response may not be returned because the server can die or > timeout. In case of retrying the create znode, it will throw a NODE_EXISTS > exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179917#comment-14179917 ] Hudson commented on YARN-2720: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/]) YARN-2720. Windows: Wildcard classpath variables not expanded against resources contained in archives. Contributed by Craig Welch. (cnauroth: rev 6637e3cf95b3a9be8d6b9cd66bc849a0607e8ed5) * hadoop-yarn-project/CHANGES.txt * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFileUtil.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Classpath.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java > Windows: Wildcard classpath variables not expanded against resources > contained in archives > -- > > Key: YARN-2720 > URL: https://issues.apache.org/jira/browse/YARN-2720 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Craig Welch >Assignee: Craig Welch > Fix For: 2.6.0 > > Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch > > > On windows there are limitations to the length of command lines and > environment variables which prevent placing all classpath resources into > these elements. Instead, a jar containing only a classpath manifest is > created to provide the classpath. During this process wildcard references > are expanded by inspecting the filesystem. Since archives are extracted to a > different location and linked into the final location after the classpath jar > is created, resources referred to via wildcards which exist in localized > archives (.zip, tar.gz) are not added to the classpath manifest jar. Since > these entries are removed from the final classpath for the container they are > not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179921#comment-14179921 ] Hudson commented on YARN-2715: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/]) YARN-2715. Fixed ResourceManager to respect common configurations for proxy users/groups beyond just the YARN level config. Contributed by Zhijie Shen. (vinodkv: rev c0e034336c85296be6f549d88d137fb2b2b79a15) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMProxyUsersConf.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java > Proxy user is problem for RPC interface if > yarn.resourcemanager.webapp.proxyuser is not set. > > > Key: YARN-2715 > URL: https://issues.apache.org/jira/browse/YARN-2715 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, > YARN-2715.4.patch > > > After YARN-2656, if people set hadoop.proxyuser for the client<-->RM RPC > interface, it's not going to work, because ProxyUsers#sip is a singleton per > daemon. After YARN-2656, RM has both channels that want to set this > configuration: RPC and HTTP. RPC interface sets it first by reading > hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to > empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. > The fix for it could be similar to what we've done for YARN-2676: make the > HTTP interface anyway source hadoop.proxyuser first, then > yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179918#comment-14179918 ] Hudson commented on YARN-90: FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/]) YARN-90. NodeManager should identify failed disks becoming good again. Contributed by Varun Vasudev (jlowe: rev 6f2028bd1514d90b831f889fd0ee7f2ba5c15000) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDirectoryCollection.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179924#comment-14179924 ] Hudson commented on YARN-2709: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/]) YARN-2709. Made timeline client getDelegationToken API retry if ConnectException happens. Contributed by Li Lu. (zjshen: rev b2942762d7f76d510ece5621c71116346a6b12f6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java > Add retry for timeline client getDelegationToken method > --- > > Key: YARN-2709 > URL: https://issues.apache.org/jira/browse/YARN-2709 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, > YARN-2709-102114-2.patch, YARN-2709-102114.patch > > > As mentioned in YARN-2673, we need to add retry mechanism to timeline client > for secured clusters. This means if the timeline server is not available, a > timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2683) document registry config options
[ https://issues.apache.org/jira/browse/YARN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2683: - Attachment: YARN-2683-002.patch correct patch as applied to branch-2 > document registry config options > > > Key: YARN-2683 > URL: https://issues.apache.org/jira/browse/YARN-2683 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-2683-001.patch, YARN-2683-002.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Add to {{yarn-site}} a page on registry configuration parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179929#comment-14179929 ] Mit Desai commented on YARN-2724: - The problem here is due to calculation of file length before even trying to open the file. Log aggregator reads the file length of the log file that is to be aggregated and records it. Then it tries to go and read the file contents. If the log aggregator does not have the permissions to access the file, it will get "Permission Denied". Just like what is seen here. What application were you guys trying to run while you encountered this error? My guess is if there is a specific application where this happens, the NM user should have the access to the log file that is created by that application. As the log aggregation is done by NM user, giving it the permissions to access the generated log file should fix this issue. > If an unreadable file is encountered during log aggregation then aggregated > file in HDFS badly formed > - > > Key: YARN-2724 > URL: https://issues.apache.org/jira/browse/YARN-2724 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 2.5.1 >Reporter: Sumit Mohanty >Assignee: Xuan Gong > > Look into the log output snippet. It looks like there is an issue during > aggregation when an unreadable file is encountered. Likely, this results in > bad encoding. > {noformat} > LogType: command-13.json > LogLength: 13934 > Log Contents: > Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json > (Permission denied)command-3.json13983Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json > (Permission denied) > > errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: > [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K->15575K(184320K), > 0.0488700 secs] 163840K->15575K(1028096K), 0.0492510 secs] [Times: user=0.06 > sys=0.01, real=0.05 secs] > 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: > [ParNew: 179415K->11865K(184320K), 0.0941310 secs] 179415K->17228K(1028096K), > 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] > 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: > 95.187: [ParNew: 175705K->12802K(184320K), 0.0466420 secs] > 181068K->18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, > real=0.04 secs] > {noformat} > Specifically, look at the text after the exception text. There should be two > more entries for log files but none exist. This is likely due to the fact > that command-13.json is expected to be of length 13934 but its is not as the > file was never read. > I think, it should have been > {noformat} > LogType: command-13.json > LogLength: > Log Contents: > Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json > (Permission denied)command-3.json13983Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json > (Permission denied) > {noformat} > {noformat} > LogType: errors-3.txt > LogLength:0 > Log Contents: > {noformat} > {noformat} > LogType:gc.log > LogLength:??? > Log Contents: > ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: > [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-2726: --- Assignee: Naganarasimha G R > CapacityScheduler should explicitly log when an accessible label has no > capacity > > > Key: YARN-2726 > URL: https://issues.apache.org/jira/browse/YARN-2726 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Phil D'Amore >Assignee: Naganarasimha G R >Priority: Minor > > Given: > - Node label defined: test-label > - Two queues defined: a, b > - label accessibility and and capacity defined as follows (properties > abbreviated for readability): > root.a.accessible-node-labels = test-label > root.a.accessible-node-labels.test-label.capacity = 100 > If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack > trace with the following error buried within: > "Illegal capacity of -1.0 for label=test-label in queue=root.b" > This of course occurs because test-label is accessible to b due to > inheritance from the root, and -1 is the UNDEFINED value. To my mind this > might not be obvious to the admin, and the error message which results does > not help guide someone to the source of the issue. > I propose that this situation be updated so that when the capacity on an > accessible label is undefined, it is explicitly called out instead of falling > through to the illegal capacity check. Something like: > {code} > if (capacity == UNDEFINED) { > throw new IllegalArgumentException("Configuration issue: " + " label=" + > label + " is accessible from queue=" + queue + " but has no capacity set."); > } > {code} > I'll leave it to better judgement than mine as to whether I'm throwing the > appropriate exception there. I think this check should be added to both > getNodeLabelCapacities and getMaximumNodeLabelCapacities in > CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179953#comment-14179953 ] Steve Loughran commented on YARN-2700: -- logs {code} 2014-10-21 03:25:26,022 [NIOServerCxn.Factory:localhost/127.0.0.1:0] INFO server.NIOServerCnxnFactory (NIOServerCnxnFactory.java:run(197)) - Accepted socket connection from /127.0.0.1:49869 2014-10-21 03:25:26,024 [JUnit-SendThread(127.0.0.1:49864)] DEBUG zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(892)) - Session establishment request sent on 127.0.0.1/127.0.0.1:49864 Found KeyTab Found KerberosKey for zookeeper/localh...@example.com Found KerberosKey for zookeeper/localh...@example.com Found KerberosKey for zookeeper/localh...@example.com Found KerberosKey for zookeeper/localh...@example.com Found KerberosKey for zookeeper/localh...@example.com 2014-10-21 03:25:26,035 [NIOServerCxn.Factory:localhost/127.0.0.1:0] INFO server.ZooKeeperServer (ZooKeeperServer.java:processConnectRequest(868)) - Client attempting to establish new session at /127.0.0.1:49869 2014-10-21 03:25:26,039 [SyncThread:0] INFO persistence.FileTxnLog (FileTxnLog.java:append(199)) - Creating new log file: log.1 2014-10-21 03:25:26,057 [SyncThread:0] INFO server.ZooKeeperServer (ZooKeeperServer.java:finishSessionInit(617)) - Established session 0x149323d6882 with negotiated timeout 6 for client /127.0.0.1:49869 2014-10-21 03:25:26,059 [JUnit-SendThread(127.0.0.1:49864)] INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server 127.0.0.1/127.0.0.1:49864, sessionid = 0x149323d6882, negotiated timeout = 6 Found ticket for zookee...@example.com to go to krbtgt/example@example.com expiring on Wed Oct 22 03:25:25 PDT 2014 Entered Krb5Context.initSecContext with state=STATE_NEW Found ticket for zookee...@example.com to go to krbtgt/example@example.com expiring on Wed Oct 22 03:25:25 PDT 2014 Service ticket not found in the subject KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database at sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73) at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:192) at sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:203) at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:309) at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:115) at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:454) at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:641) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193) at org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:366) at org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:363) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:362) at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:348) at org.apache.zookeeper.client.ZooKeeperSaslClient.sendSaslPacket(ZooKeeperSaslClient.java:420) at org.apache.zookeeper.client.ZooKeeperSaslClient.initialize(ZooKeeperSaslClient.java:458) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1013) Caused by: KrbException: Identifier doesn't match expected value (906) at sun.security.krb5.internal.KDCRep.init(KDCRep.java:143) at sun.security.krb5.internal.TGSRep.init(TGSRep.java:66) at sun.security.krb5.internal.TGSRep.(TGSRep.java:61) at sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55) ... 18 more 2014-10-21 03:25:26,145 [JUnit-SendThread(127.0.0.1:49864)] ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(384)) - An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - Server not found in Kerberos database)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. 2014-10-21 03:25:26,146 [JUnit-SendThread(127.0.0.1:49864)] ERROR zookeeper.ClientCnxn (ClientCnxn.java:run(1015)) - SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sa
[jira] [Commented] (YARN-2683) document registry config options
[ https://issues.apache.org/jira/browse/YARN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179958#comment-14179958 ] Hadoop QA commented on YARN-2683: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676323/YARN-2683-002.patch against trunk revision 85a8864. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5492//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5492//console This message is automatically generated. > document registry config options > > > Key: YARN-2683 > URL: https://issues.apache.org/jira/browse/YARN-2683 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-2683-001.patch, YARN-2683-002.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Add to {{yarn-site}} a page on registry configuration parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cntic updated YARN-2681: Description: To read/write data from HDFS on data node, applications establise TCP/IP connections with the datanode. The HDFS read can be controled by setting Linux Traffic Control (TC) subsystem on the data node to make filters on appropriate connections. The current cgroups net_cls concept can not be applied on the node where the container is launched, netheir on data node since: - TC hanldes outgoing bandwidth only, so it can be set on container node (HDFS read = incoming data for the container) - Since HDFS data node is handled by only one process, it is not possible to use net_cls to separate connections from different containers to the datanode. Tasks: 1) Extend Resource model to define bandwidth enforcement rate 2) Monitor TCP/IP connection estabilised by container handling process and its child processes 3) Set Linux Traffic Control rules on data node base on address:port pairs in order to enforce bandwidth of outgoing data Concept: http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf was: To read/write data from HDFS on data node, applications establise TCP/IP connections with the datanode. The HDFS read can be controled by setting Linux Traffic Control (TC) subsystem on the data node to make filters on appropriate connections. The current cgroups net_cls concept can not be applied on the node where the container is launched, netheir on data node since: - TC hanldes outgoing bandwidth only, so it can be set on container node (HDFS read = incoming data for the container) - Since HDFS data node is handled by only one process, it is not possible to use net_cls to separate connections from different containers to the datanode. Tasks: 1) Extend Resource model to define bandwidth enforcement rate 2) Monitor TCP/IP connection estabilised by container handling process and its child processes 3) Set Linux Traffic Control rules on data node base on address:port pairs in order to enforce bandwidth of outgoing data > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, Traffic Control Design.png > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: > http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: YARN-2198.16.patch .16.patch rebased to current trunk and resolves the conflict from YARN-2720 > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, > YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cntic updated YARN-2681: Attachment: HADOOP-2681.patch - fix findbugs warnings - testing purpose: +TC class rate can be given by read HDFS file defined in YARN configuration + TC class burst can be defined in configration. Otherwise default value will be set when TC class is added > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control > Design.png > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: > http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2700: - Attachment: YARN-2700-001.patch Patch. The problem (as explained by Chris Nauroth) is that windows doesn't rDNS 127.0.0.1 to localhost —the principals there need to use the raw IP Address. > TestSecureRMRegistryOperations failing on windows: auth problems > > > Key: YARN-2700 > URL: https://issues.apache.org/jira/browse/YARN-2700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.6.0 > Environment: Windows Server, Win7 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-2700-001.patch > > > TestSecureRMRegistryOperations failing on windows: unable to create the root > /registry path with permissions problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180012#comment-14180012 ] Hudson commented on YARN-2715: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/]) YARN-2715. Fixed ResourceManager to respect common configurations for proxy users/groups beyond just the YARN level config. Contributed by Zhijie Shen. (vinodkv: rev c0e034336c85296be6f549d88d137fb2b2b79a15) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMProxyUsersConf.java > Proxy user is problem for RPC interface if > yarn.resourcemanager.webapp.proxyuser is not set. > > > Key: YARN-2715 > URL: https://issues.apache.org/jira/browse/YARN-2715 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, > YARN-2715.4.patch > > > After YARN-2656, if people set hadoop.proxyuser for the client<-->RM RPC > interface, it's not going to work, because ProxyUsers#sip is a singleton per > daemon. After YARN-2656, RM has both channels that want to set this > configuration: RPC and HTTP. RPC interface sets it first by reading > hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to > empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. > The fix for it could be similar to what we've done for YARN-2676: make the > HTTP interface anyway source hadoop.proxyuser first, then > yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180010#comment-14180010 ] Hudson commented on YARN-90: FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/]) YARN-90. NodeManager should identify failed disks becoming good again. Contributed by Varun Vasudev (jlowe: rev 6f2028bd1514d90b831f889fd0ee7f2ba5c15000) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDirectoryCollection.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180011#comment-14180011 ] Hudson commented on YARN-2721: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/]) YARN-2721. Suppress NodeExist exception thrown by ZKRMStateStore when it retries creating znode. Contributed by Jian He. (zjshen: rev 7e3b5e6f5cb4945b4fab27e8a83d04280df50e17) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java > Race condition: ZKRMStateStore retry logic may throw NodeExist exception > - > > Key: YARN-2721 > URL: https://issues.apache.org/jira/browse/YARN-2721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.6.0 > > Attachments: YARN-2721.1.patch > > > Blindly retrying operations in zookeeper will not work for non-idempotent > operations (like create znode). The reason is that the client can do a create > znode, but the response may not be returned because the server can die or > timeout. In case of retrying the create znode, it will throw a NODE_EXISTS > exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives
[ https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180009#comment-14180009 ] Hudson commented on YARN-2720: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/]) YARN-2720. Windows: Wildcard classpath variables not expanded against resources contained in archives. Contributed by Craig Welch. (cnauroth: rev 6637e3cf95b3a9be8d6b9cd66bc849a0607e8ed5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Classpath.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFileUtil.java * hadoop-yarn-project/CHANGES.txt > Windows: Wildcard classpath variables not expanded against resources > contained in archives > -- > > Key: YARN-2720 > URL: https://issues.apache.org/jira/browse/YARN-2720 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Craig Welch >Assignee: Craig Welch > Fix For: 2.6.0 > > Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch > > > On windows there are limitations to the length of command lines and > environment variables which prevent placing all classpath resources into > these elements. Instead, a jar containing only a classpath manifest is > created to provide the classpath. During this process wildcard references > are expanded by inspecting the filesystem. Since archives are extracted to a > different location and linked into the final location after the classpath jar > is created, resources referred to via wildcards which exist in localized > archives (.zip, tar.gz) are not added to the classpath manifest jar. Since > these entries are removed from the final classpath for the container they are > not on the container's classpath as they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180015#comment-14180015 ] Hudson commented on YARN-2709: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/]) YARN-2709. Made timeline client getDelegationToken API retry if ConnectException happens. Contributed by Li Lu. (zjshen: rev b2942762d7f76d510ece5621c71116346a6b12f6) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java > Add retry for timeline client getDelegationToken method > --- > > Key: YARN-2709 > URL: https://issues.apache.org/jira/browse/YARN-2709 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, > YARN-2709-102114-2.patch, YARN-2709-102114.patch > > > As mentioned in YARN-2673, we need to add retry mechanism to timeline client > for secured clusters. This means if the timeline server is not available, a > timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2714) Localizer thread might stuck if NM is OOM
[ https://issues.apache.org/jira/browse/YARN-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180042#comment-14180042 ] Ming Ma commented on YARN-2714: --- Thanks Zhihai for the information. Yes, setting the RPC timeout at the hadoop common layer will address the issue. For other suggestions, they might be good to have even with RPC timeout. We can open separate jiras if necessary. > Localizer thread might stuck if NM is OOM > - > > Key: YARN-2714 > URL: https://issues.apache.org/jira/browse/YARN-2714 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ming Ma > > When NM JVM runs out of memory; normally it is uncaught exception and the > process will exit. But RPC server used by node manager catches > OutOfMemoryError to give a chance GC to catch up so NM doesn't need to exit > and can recover from OutOfMemoryError situation. > However, in some rare situation when this happens, one of the NM localizer > thread didn't get the RPC response from node manager and just waited there. > The explanation of why node manager RPC server doesn't respond is because RPC > server responder thread swallowed OutOfMemoryError and didn't process > outstanding RPC response. On the RPC client side, the RPC timeout is set to 0 > and it relies on Ping to detect RPC server availability. > {noformat} > Thread 481 (LocalizerRunner for container_1413487737702_2948_01_013383): > State: WAITING > Blocked count: 27 > Waited count: 84 > Waiting on org.apache.hadoop.ipc.Client$Call@6be5add3 > Stack: > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:503) > org.apache.hadoop.ipc.Client.call(Client.java:1396) > org.apache.hadoop.ipc.Client.call(Client.java:1363) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > com.sun.proxy.$Proxy36.heartbeat(Unknown Source) > > org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62) > > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235) > > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) > > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107) > > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995) > {noformat} > The consequence of this depends on which ContainerExecutor NM uses. If it > uses DefaultContainerExecutor, given its startLocalizer method is > synchronized, it will blocks other localizer threads. If you use > LinuxContainerExecutor, at least other localizer threads can still proceed. > But in theory it can slowly drain all available localizer threads. > There are couple ways to fix it. Some of these fixes are complementary. > 1. Fix it at haoop-common layer. It seems RPC server hosted by worker > services such ad NM doesn't really need to catch OutOfMemoryError; the > service JVM can just exit. Even for the NN and RM, given we have HA, it might > be ok to do so. > 2. Set RPC timeout at HadoopYarnProtoRPC layer so that all YARN clients will > timeout if RPC server drops the response. > 3. Fix it at yarn localization service. For example, > a) Fix DefaultContainerExecutor so that synchronization isn't required for > startLocalizer method. > b) Download executor thread used by ContainerLocalizer currently catches any > exceptions. We can fix ContainerLocalizer so that when Download executor > thread catches OutOfMemoryError, it can exit its host process. > IMHO, fix it at RPC server layer is better as it addresses other scenarios. > Appreciate any input others might have. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cntic updated YARN-2681: Attachment: yarn-site.xml 1) Configuration for testing HDFS Bandwidth Enforcement (yarn-site,xml) - Enable enforcement: yarn.nodemanager.hdfs-bandwidth-enforcement.enable = true - Port which Datanodes are listening: yarn.nodemanager.hdfs-bandwidth-enforcement.port = 50010 - Devices' list of machine of data node: yarn.nodemanager.hdfs-bandwidth-enforcement.devices = lo, eth0 - Interval for checking new tc config from persistence (ms): yarn.nodemanager.hdfs-bandwidth-enforcement.check-tc-config-interval = 1000 - Since only the API for Resource has been upgrated to get/set HDFS Bandwidth Enforcement, but the ResourceRequest has not been implemented yet, for test purpose, the rate and burst use to define tc class can be given in YARN configuration file: + test rate will be writen to an HDFS file: yarn.nodemanager.hdfs-bandwidth-enforcement.test-rate-file = test-rate-file + test rate file example content : 30mbps ( rate unit: kbps, mbps, kbit, mbit. See also: http://lartc.org/manpages/tc.txt) + test burst : yarn.nodemanager.hdfs-bandwidth-enforcement.test-burst = 50Kb 2) The patch is tested by running TestDFSIO-read > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control > Design.png, yarn-site.xml > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: > http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180047#comment-14180047 ] Hadoop QA commented on YARN-2681: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676351/yarn-site.xml against trunk revision 85a8864. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5496//console This message is automatically generated. > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control > Design.png, yarn-site.xml > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: > http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180061#comment-14180061 ] Sunil G commented on YARN-2647: --- Hi [~mayank_bansal] Sorry for the delay here, I have done some ground work on this. I was taking the mapred queue CLI changes to YARN. Namely using the *GetQueueInfoRequest* and *GetQueueInfoResponse*. I have added the node label related information to the response object and wanted to take back to client. Now for apis, YarnClientImpl already have apis like getQueueInfo and getQueueAclsInfo etc. I wanted to merge all these under "yarn queue " command followed by queue name. The can be namely *queue-acl*, *node-label*, *all* (can print all information in queueInfo) I may need a day more to upload this patch, kindly suggest if the approach is fine, and also if its needed before that. > Add yarn queue CLI to get queue info including labels of such queue > --- > > Key: YARN-2647 > URL: https://issues.apache.org/jira/browse/YARN-2647 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Wangda Tan >Assignee: Sunil G > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180066#comment-14180066 ] Hadoop QA commented on YARN-2198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676333/YARN-2198.16.patch against trunk revision 85a8864. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5493//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5493//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5493//console This message is automatically generated. > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, > YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180069#comment-14180069 ] Hadoop QA commented on YARN-2700: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676344/YARN-2700-001.patch against trunk revision 85a8864. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5495//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5495//console This message is automatically generated. > TestSecureRMRegistryOperations failing on windows: auth problems > > > Key: YARN-2700 > URL: https://issues.apache.org/jira/browse/YARN-2700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.6.0 > Environment: Windows Server, Win7 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-2700-001.patch > > > TestSecureRMRegistryOperations failing on windows: unable to create the root > /registry path with permissions problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cntic updated YARN-2681: Attachment: yarn-site.xml.example > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control > Design.png, yarn-site.xml.example > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: > http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-2700: Hadoop Flags: Reviewed +1 for the patch, pending Jenkins. Thanks for the fix, Steve. > TestSecureRMRegistryOperations failing on windows: auth problems > > > Key: YARN-2700 > URL: https://issues.apache.org/jira/browse/YARN-2700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.6.0 > Environment: Windows Server, Win7 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-2700-001.patch > > > TestSecureRMRegistryOperations failing on windows: unable to create the root > /registry path with permissions problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cntic updated YARN-2681: Attachment: (was: yarn-site.xml) > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control > Design.png > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: > http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180086#comment-14180086 ] Chris Nauroth commented on YARN-2700: - bq. ...pending Jenkins... Never mind. It looks like Jenkins and I had a race condition commenting. :-) You have a full +1 from me now. > TestSecureRMRegistryOperations failing on windows: auth problems > > > Key: YARN-2700 > URL: https://issues.apache.org/jira/browse/YARN-2700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.6.0 > Environment: Windows Server, Win7 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-2700-001.patch > > > TestSecureRMRegistryOperations failing on windows: unable to create the root > /registry path with permissions problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180089#comment-14180089 ] Hadoop QA commented on YARN-2681: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676356/yarn-site.xml.example against trunk revision 85a8864. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5497//console This message is automatically generated. > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control > Design.png, yarn-site.xml.example > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: > http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180119#comment-14180119 ] zhihai xu commented on YARN-2701: - One nit in the addendum patch: Can we change {code} if (stat(path, &sb) == 0) { if (check_dir(path, sb.st_mode, perm, 1) == -1) { return -1; } return 0; } {code} to {code} if (stat(path, &sb) == 0) { return check_dir(path, sb.st_mode, perm, 1); } {code} > Potential race condition in startLocalizer when using LinuxContainerExecutor > -- > > Key: YARN-2701 > URL: https://issues.apache.org/jira/browse/YARN-2701 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, > YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, > YARN-2701.addendum.1.patch > > > When using LinuxContainerExecutor do startLocalizer, we are using native code > container-executor.c. > {code} > if (stat(npath, &sb) != 0) { >if (mkdir(npath, perm) != 0) { > {code} > We are using check and create method to create the appDir under /usercache. > But if there are two containers trying to do this at the same time, race > condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180130#comment-14180130 ] Jason Lowe commented on YARN-2010: -- We recently ran into a case where an application tried to recover with an expired token and the InvalidToken exception thrown by the delegation token secret manager for this application prevented the RM from coming up. > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Karthik Kambatla >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, > yarn-2010-3.patch, yarn-2010-3.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180134#comment-14180134 ] Hadoop QA commented on YARN-2681: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676343/HADOOP-2681.patch against trunk revision 85a8864. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 3 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5494//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5494//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5494//console This message is automatically generated. > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control > Design.png, yarn-site.xml.example > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: > http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180142#comment-14180142 ] Karthik Kambatla commented on YARN-2010: I should have an updated patch with tests later today. Would be nice to fix this for 2.6. > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Karthik Kambatla >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, > yarn-2010-3.patch, yarn-2010-3.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180144#comment-14180144 ] Ming Ma commented on YARN-2578: --- Yeah, it is more than just * -> RM, it could be * -> NM and * -> AM. Agree it is better to fix it at hadoop common layer. From HDFS-4858, it looks like the concern of fixing it at hadoop common layer is the test coverage. Is there any follow up on hadoop common? Perhaps we can fix hadoop common layer so that rpc timeout is still off by default; but if ping is set to false, then rpc timeout will be set to the ping value in the code Karthik refers to. In that way, YARN and MR don't need to change and people can experiment with rpc timeout. After enough test coverage, we can then set ping default value to false. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2701: Attachment: YARN-2701.addendum.2.patch > Potential race condition in startLocalizer when using LinuxContainerExecutor > -- > > Key: YARN-2701 > URL: https://issues.apache.org/jira/browse/YARN-2701 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, > YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, > YARN-2701.addendum.1.patch, YARN-2701.addendum.2.patch > > > When using LinuxContainerExecutor do startLocalizer, we are using native code > container-executor.c. > {code} > if (stat(npath, &sb) != 0) { >if (mkdir(npath, perm) != 0) { > {code} > We are using check and create method to create the appDir under /usercache. > But if there are two containers trying to do this at the same time, race > condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180174#comment-14180174 ] Xuan Gong commented on YARN-2701: - [~zxu] Thanks for reviewing this patch again. New patch addressed your comment. > Potential race condition in startLocalizer when using LinuxContainerExecutor > -- > > Key: YARN-2701 > URL: https://issues.apache.org/jira/browse/YARN-2701 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, > YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, > YARN-2701.addendum.1.patch, YARN-2701.addendum.2.patch > > > When using LinuxContainerExecutor do startLocalizer, we are using native code > container-executor.c. > {code} > if (stat(npath, &sb) != 0) { >if (mkdir(npath, perm) != 0) { > {code} > We are using check and create method to create the appDir under /usercache. > But if there are two containers trying to do this at the same time, race > condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
[ https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180206#comment-14180206 ] Wangda Tan commented on YARN-2723: -- [~Naganarasimha], is there any updates on this patch? It should be a one line fix with a new test, if you didn't start on it, I can take it over. Thanks. > rmadmin -replaceLabelsOnNode does not correctly parse port > -- > > Key: YARN-2723 > URL: https://issues.apache.org/jira/browse/YARN-2723 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Phil D'Amore >Assignee: Naganarasimha G R > > There is an off-by-one issue in RMAdminCLI.java (line 457): > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":"))); > should probably be: > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1)); > Currently attempting to add a label to a node with a port specified looks > like this: > [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode > node.example.com:45454,test-label > replaceLabelsOnNode: For input string: ":45454" > Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 > node2:port,label1,label2]] > It appears to be trying to parse the ':' as part of the integer because the > substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
[ https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180244#comment-14180244 ] Naganarasimha G R commented on YARN-2723: - Its almost done i will attach the patch in an hour > rmadmin -replaceLabelsOnNode does not correctly parse port > -- > > Key: YARN-2723 > URL: https://issues.apache.org/jira/browse/YARN-2723 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Phil D'Amore >Assignee: Naganarasimha G R > > There is an off-by-one issue in RMAdminCLI.java (line 457): > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":"))); > should probably be: > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1)); > Currently attempting to add a label to a node with a port specified looks > like this: > [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode > node.example.com:45454,test-label > replaceLabelsOnNode: For input string: ":45454" > Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 > node2:port,label1,label2]] > It appears to be trying to parse the ':' as part of the integer because the > substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
[ https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180248#comment-14180248 ] Wangda Tan commented on YARN-2723: -- Thanks :) > rmadmin -replaceLabelsOnNode does not correctly parse port > -- > > Key: YARN-2723 > URL: https://issues.apache.org/jira/browse/YARN-2723 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Phil D'Amore >Assignee: Naganarasimha G R > > There is an off-by-one issue in RMAdminCLI.java (line 457): > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":"))); > should probably be: > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1)); > Currently attempting to add a label to a node with a port specified looks > like this: > [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode > node.example.com:45454,test-label > replaceLabelsOnNode: For input string: ":45454" > Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 > node2:port,label1,label2]] > It appears to be trying to parse the ':' as part of the integer because the > substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180304#comment-14180304 ] Wangda Tan commented on YARN-2647: -- [~sunilg], Thanks for working on this item, bq. Namely using the GetQueueInfoRequest and GetQueueInfoResponse. I think we may not need extra PB object, {{org.apache.hadoop.yarn.api.records.QueueInfo}} already has labels and default-label-expression field. bq. I wanted to merge all these under "yarn queue " command followed by queue name. Agree merging all to "yarn queue", I think by default, "yarn queue -list" is list all information of all queues. And if user want to see specific queue, he/she can use "yarn queue -list ". And extra options can be applied like -queue-acl -node-label can be applied if he/she want to see some specific field(s) only. Wangda > Add yarn queue CLI to get queue info including labels of such queue > --- > > Key: YARN-2647 > URL: https://issues.apache.org/jira/browse/YARN-2647 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Wangda Tan >Assignee: Sunil G > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
[ https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2723: Attachment: YARN-2723.20141023.1.patch attaching patch for this issue > rmadmin -replaceLabelsOnNode does not correctly parse port > -- > > Key: YARN-2723 > URL: https://issues.apache.org/jira/browse/YARN-2723 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Phil D'Amore >Assignee: Naganarasimha G R > Attachments: YARN-2723.20141023.1.patch > > > There is an off-by-one issue in RMAdminCLI.java (line 457): > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":"))); > should probably be: > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1)); > Currently attempting to add a label to a node with a port specified looks > like this: > [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode > node.example.com:45454,test-label > replaceLabelsOnNode: For input string: ":45454" > Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 > node2:port,label1,label2]] > It appears to be trying to parse the ':' as part of the integer because the > substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed
Naganarasimha G R created YARN-2727: --- Summary: In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed Key: YARN-2727 URL: https://issues.apache.org/jira/browse/YARN-2727 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Minor In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being used And also some modifications for the description -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed
[ https://issues.apache.org/jira/browse/YARN-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2727: Attachment: YARN-2727.20141023.1.patch attaching patch > In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", > "yarn.node-labels.fs-store.uri" is being displayed > > > Key: YARN-2727 > URL: https://issues.apache.org/jira/browse/YARN-2727 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Minor > Attachments: YARN-2727.20141023.1.patch > > > In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of > "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is > being used > And also some modifications for the description -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180437#comment-14180437 ] Jian He commented on YARN-2198: --- Not sure if the test failure is related. re-triger jenkins > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, > YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed
[ https://issues.apache.org/jira/browse/YARN-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180456#comment-14180456 ] Wangda Tan commented on YARN-2727: -- [~Naganarasimha], Thanks for the patch, some comments: 1) bq. +port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1)); As convention, please leave a space before and after "+" 2) {code} +// no labels, should fail +args = new String[] { "-replaceLabelsOnNode" }; +assertTrue(0 != rmAdminCLI.run(args)); + +// no labels, should fail +args = +new String[] { "-replaceLabelsOnNode", +"-directlyAccessNodeLabelStore" }; +assertTrue(0 != rmAdminCLI.run(args)); {code} These two check were already included by {{testReplaceLabelsOnNode}} Thanks, > In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", > "yarn.node-labels.fs-store.uri" is being displayed > > > Key: YARN-2727 > URL: https://issues.apache.org/jira/browse/YARN-2727 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Minor > Attachments: YARN-2727.20141023.1.patch > > > In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of > "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is > being used > And also some modifications for the description -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2728) Support for disabling the Centralized NodeLabel validation in Distributed Node Label Configuration setup
Naganarasimha G R created YARN-2728: --- Summary: Support for disabling the Centralized NodeLabel validation in Distributed Node Label Configuration setup Key: YARN-2728 URL: https://issues.apache.org/jira/browse/YARN-2728 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Naganarasimha G R Currently without Central List of Valid Labels, Capacity scheduler will not be able to work (user cannot specify capacity for a unknown node-label for a queue, etc.). But without disabling the central label validation, Distributed Node Label configuration feature is not complete. so we need to support this feature -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
[ https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180458#comment-14180458 ] Wangda Tan commented on YARN-2723: -- [~Naganarasimha] Thanks for the patch, some comments: 1) bq. + port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1)); As convention, please leave a space before and after "+" 2) {code} +// no labels, should fail +args = new String[] { "-replaceLabelsOnNode" }; +assertTrue(0 != rmAdminCLI.run(args)); + +// no labels, should fail +args = +new String[] { "-replaceLabelsOnNode", +"-directlyAccessNodeLabelStore" }; +assertTrue(0 != rmAdminCLI.run(args)); {code} These two check were already included by testReplaceLabelsOnNode Thanks, > rmadmin -replaceLabelsOnNode does not correctly parse port > -- > > Key: YARN-2723 > URL: https://issues.apache.org/jira/browse/YARN-2723 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Phil D'Amore >Assignee: Naganarasimha G R > Attachments: YARN-2723.20141023.1.patch > > > There is an off-by-one issue in RMAdminCLI.java (line 457): > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":"))); > should probably be: > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1)); > Currently attempting to add a label to a node with a port specified looks > like this: > [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode > node.example.com:45454,test-label > replaceLabelsOnNode: For input string: ":45454" > Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 > node2:port,label1,label2]] > It appears to be trying to parse the ':' as part of the integer because the > substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed
[ https://issues.apache.org/jira/browse/YARN-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180457#comment-14180457 ] Wangda Tan commented on YARN-2727: -- Oh sorry, this comment is for YARN-2723, please ignore above comment > In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", > "yarn.node-labels.fs-store.uri" is being displayed > > > Key: YARN-2727 > URL: https://issues.apache.org/jira/browse/YARN-2727 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Minor > Attachments: YARN-2727.20141023.1.patch > > > In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of > "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is > being used > And also some modifications for the description -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
Naganarasimha G R created YARN-2729: --- Summary: Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup Key: YARN-2729 URL: https://issues.apache.org/jira/browse/YARN-2729 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180468#comment-14180468 ] Hudson commented on YARN-2700: -- FAILURE: Integrated in Hadoop-trunk-Commit #6313 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6313/]) YARN-2700 TestSecureRMRegistryOperations failing on windows: auth problems (stevel: rev 90e5ca24fbd3bb2da2a3879cc9b73f0b1d7f3e03) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/AbstractSecureRegistryTest.java > TestSecureRMRegistryOperations failing on windows: auth problems > > > Key: YARN-2700 > URL: https://issues.apache.org/jira/browse/YARN-2700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.6.0 > Environment: Windows Server, Win7 >Reporter: Steve Loughran >Assignee: Steve Loughran > Fix For: 2.6.0 > > Attachments: YARN-2700-001.patch > > > TestSecureRMRegistryOperations failing on windows: unable to create the root > /registry path with permissions problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2729: Attachment: YARN-2729.20141023-1.patch Attaching the WIP patch for this part of the issue... > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup > --- > > Key: YARN-2729 > URL: https://issues.apache.org/jira/browse/YARN-2729 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-2729.20141023-1.patch > > > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port
[ https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180484#comment-14180484 ] Hadoop QA commented on YARN-2723: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676406/YARN-2723.20141023.1.patch against trunk revision d67214f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5498//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5498//console This message is automatically generated. > rmadmin -replaceLabelsOnNode does not correctly parse port > -- > > Key: YARN-2723 > URL: https://issues.apache.org/jira/browse/YARN-2723 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Phil D'Amore >Assignee: Naganarasimha G R > Attachments: YARN-2723.20141023.1.patch > > > There is an off-by-one issue in RMAdminCLI.java (line 457): > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":"))); > should probably be: > port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1)); > Currently attempting to add a label to a node with a port specified looks > like this: > [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode > node.example.com:45454,test-label > replaceLabelsOnNode: For input string: ":45454" > Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 > node2:port,label1,label2]] > It appears to be trying to parse the ':' as part of the integer because the > substring index is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180493#comment-14180493 ] Hadoop QA commented on YARN-2198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676333/YARN-2198.16.patch against trunk revision d67214f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarnTests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5499//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5499//console This message is automatically generated. > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, > YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed
[ https://issues.apache.org/jira/browse/YARN-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180494#comment-14180494 ] Hadoop QA commented on YARN-2727: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676407/YARN-2727.20141023.1.patch against trunk revision d67214f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.TestApplicationMasterServiceProtocolOnHA org.apache.hadoop.yarn.client.TestGetGroups The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarnTests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5500//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5500//console This message is automatically generated. > In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", > "yarn.node-labels.fs-store.uri" is being displayed > > > Key: YARN-2727 > URL: https://issues.apache.org/jira/browse/YARN-2727 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Minor > Attachments: YARN-2727.20141023.1.patch > > > In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of > "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is > being used > And also some modifications for the description -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2495: Attachment: YARN-2495.20141023-1.patch As per Wangda's comments 1> raised new jira for "disable central node label configuration" 2> removed modifications in the current jira for the CommonNodeLabelManager 3> ScriptNodeLabelProvider in separate jira for better review Currently attached WIP(Earlier patch bifurcated into 2 jiras YARN-2495 and YARN-2729). Will update with actual patch at the earliest. > Allow admin specify labels in each NM (Distributed configuration) > - > > Key: YARN-2495 > URL: https://issues.apache.org/jira/browse/YARN-2495 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R > Attachments: YARN-2495.20141023-1.patch, YARN-2495_20141022.1.patch > > > Target of this JIRA is to allow admin specify labels in each NM, this covers > - User can set labels in each NM (by setting yarn-site.xml or using script > suggested by [~aw]) > - NM will send labels to RM via ResourceTracker API > - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2724: -- Target Version/s: (was: 2.5.1) bq. As the log aggregation is done by NM user, giving it the permissions to access the generated log file should fix this issue. Agreed. I guess the problem that YARN should address is to surface the issue with aggregation to the end-user - right now it's not clear what really happened. > If an unreadable file is encountered during log aggregation then aggregated > file in HDFS badly formed > - > > Key: YARN-2724 > URL: https://issues.apache.org/jira/browse/YARN-2724 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 2.5.1 >Reporter: Sumit Mohanty >Assignee: Xuan Gong > > Look into the log output snippet. It looks like there is an issue during > aggregation when an unreadable file is encountered. Likely, this results in > bad encoding. > {noformat} > LogType: command-13.json > LogLength: 13934 > Log Contents: > Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json > (Permission denied)command-3.json13983Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json > (Permission denied) > > errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: > [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K->15575K(184320K), > 0.0488700 secs] 163840K->15575K(1028096K), 0.0492510 secs] [Times: user=0.06 > sys=0.01, real=0.05 secs] > 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: > [ParNew: 179415K->11865K(184320K), 0.0941310 secs] 179415K->17228K(1028096K), > 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] > 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: > 95.187: [ParNew: 175705K->12802K(184320K), 0.0466420 secs] > 181068K->18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, > real=0.04 secs] > {noformat} > Specifically, look at the text after the exception text. There should be two > more entries for log files but none exist. This is likely due to the fact > that command-13.json is expected to be of length 13934 but its is not as the > file was never read. > I think, it should have been > {noformat} > LogType: command-13.json > LogLength: > Log Contents: > Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json > (Permission denied)command-3.json13983Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json > (Permission denied) > {noformat} > {noformat} > LogType: errors-3.txt > LogLength:0 > Log Contents: > {noformat} > {noformat} > LogType:gc.log > LogLength:??? > Log Contents: > ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: > [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2503) Changes in RM Web UI to better show labels to end users
[ https://issues.apache.org/jira/browse/YARN-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2503: - Description: Include but not limited to: - Show labels of nodes in RM/nodes page - Show labels of queue in RM/scheduler page was: Include but not limited to: - Show labels of nodes in RM/nodes page - Show labels of queue in RM/scheduler page - Warn user/admin if capacity of queue cannot be guaranteed according to mis config of labels. > Changes in RM Web UI to better show labels to end users > --- > > Key: YARN-2503 > URL: https://issues.apache.org/jira/browse/YARN-2503 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2503.patch > > > Include but not limited to: > - Show labels of nodes in RM/nodes page > - Show labels of queue in RM/scheduler page -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2503) Changes in RM Web UI to better show labels to end users
[ https://issues.apache.org/jira/browse/YARN-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2503: - Attachment: YARN-2503-20141022-1.patch Attached an updated patch > Changes in RM Web UI to better show labels to end users > --- > > Key: YARN-2503 > URL: https://issues.apache.org/jira/browse/YARN-2503 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2503-20141022-1.patch, YARN-2503.patch > > > Include but not limited to: > - Show labels of nodes in RM/nodes page > - Show labels of queue in RM/scheduler page -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180554#comment-14180554 ] Xuan Gong commented on YARN-2724: - As [~mitdesai] mentioned, "the problem here is due to calculation of file length before even trying to open the file. Log aggregator reads the file length of the log file that is to be aggregated and records it. Then it tries to go and read the file contents." For the issue reported by [~sumitmohanty], it is because of file permission. We can not aggregate the log file. Looking at the code {code} final long fileLength = logFile.length(); // Write the logFile Type out.writeUTF(logFile.getName()); // Write the log length as UTF so that it is printable out.writeUTF(String.valueOf(fileLength)); // Write the log itself FileInputStream in = null; try { in = SecureIOUtils.openForRead(logFile, getUser(), null); byte[] buf = new byte[65535]; int len = 0; long bytesLeft = fileLength; while ((len = in.read(buf)) != -1) { //If buffer contents within fileLength, write if (len < bytesLeft) { out.write(buf, 0, len); bytesLeft-=len; } //else only write contents within fileLength, then exit early else { out.write(buf, 0, (int)bytesLeft); break; } } long newLength = logFile.length(); if(fileLength < newLength) { LOG.warn("Aggregated logs truncated by approximately "+ (newLength-fileLength) +" bytes."); } this.uploadedFiles.add(logFile); } catch (IOException e) { String message = "Error aggregating log file. Log file : " + logFile.getAbsolutePath() + e.getMessage(); LOG.error(message, e); out.write(message.getBytes()); } finally { if (in != null) { in.close(); } } {code} Excluding the permission issue, there will be more issues which can cause the same problem. > If an unreadable file is encountered during log aggregation then aggregated > file in HDFS badly formed > - > > Key: YARN-2724 > URL: https://issues.apache.org/jira/browse/YARN-2724 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 2.5.1 >Reporter: Sumit Mohanty >Assignee: Xuan Gong > > Look into the log output snippet. It looks like there is an issue during > aggregation when an unreadable file is encountered. Likely, this results in > bad encoding. > {noformat} > LogType: command-13.json > LogLength: 13934 > Log Contents: > Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json > (Permission denied)command-3.json13983Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json > (Permission denied) > > errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: > [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K->15575K(184320K), > 0.0488700 secs] 163840K->15575K(1028096K), 0.0492510 secs] [Times: user=0.06 > sys=0.01, real=0.05 secs] > 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: > [ParNew: 179415K->11865K(184320K), 0.0941310 secs] 179415K->17228K(1028096K), > 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] > 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: > 95.187: [ParNew: 175705K->12802K(184320K), 0.0466420 secs] > 181068K->18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, > real=0.04 secs] > {noformat} > Specifically, look at the text after the exception text. There should be two > more entries for log files but none exist. This is likely due to the fact > that command-13.json is expected to be of length 13934 but its is not as the > file was never read. > I think, it should have been > {noformat} > LogType: command-13.json > LogLength: > Log Contents: > Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json > (Permission denied)command-3.json13983Error aggregating log file. Log file
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180575#comment-14180575 ] Jian He commented on YARN-2314: --- Thanks Jason, I looked at the patch, looks good overall. just one thing: - IIUC, {{mayBeCloseProxy}} can be invoked by MR/NMClient, but {{proxy.scheduledForClose}} is always false. So it won’t call the following stopProxy. If cache is disabled, this doesn’t matter too much as the idleTimeout is set to 0. But if the cache is enabled, MR/NMClient, won’t be able to explicitly close the proxy ? Also, Can you help me understand one point: bq. See ClientCache.stopClient for details. Given that the whole point of the ContainerManagementProtocolProxy cache is to preserve at least one reference to the Client, the IPC Client stop method will never be called in practice and IPC client threads will never be explicitly torn down as a result of calling stopProxy. once {{ContainerManagementProtocolProxy#tryCloseProxy}} is called, internally it’ll call {{rpc.stopProxy}}, will it eventually call {{ClientCache#stopClient}} ? > ContainerManagementProtocolProxy can create thousands of threads for a large > cluster > > > Key: YARN-2314 > URL: https://issues.apache.org/jira/browse/YARN-2314 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2314.patch, YARN-2314v2.patch, > disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, > tez-yarn-2314.xlsx > > > ContainerManagementProtocolProxy has a cache of NM proxies, and the size of > this cache is configurable. However the cache can grow far beyond the > configured size when running on a large cluster and blow AM address/container > limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180581#comment-14180581 ] Xuan Gong commented on YARN-2724: - So, for the exception reported here is because of the file permission, and we could not aggregated this log file. And aggregated file is badly format. We could fix this issue first in this ticket. > If an unreadable file is encountered during log aggregation then aggregated > file in HDFS badly formed > - > > Key: YARN-2724 > URL: https://issues.apache.org/jira/browse/YARN-2724 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 2.5.1 >Reporter: Sumit Mohanty >Assignee: Xuan Gong > > Look into the log output snippet. It looks like there is an issue during > aggregation when an unreadable file is encountered. Likely, this results in > bad encoding. > {noformat} > LogType: command-13.json > LogLength: 13934 > Log Contents: > Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json > (Permission denied)command-3.json13983Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json > (Permission denied) > > errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: > [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K->15575K(184320K), > 0.0488700 secs] 163840K->15575K(1028096K), 0.0492510 secs] [Times: user=0.06 > sys=0.01, real=0.05 secs] > 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: > [ParNew: 179415K->11865K(184320K), 0.0941310 secs] 179415K->17228K(1028096K), > 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] > 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: > 95.187: [ParNew: 175705K->12802K(184320K), 0.0466420 secs] > 181068K->18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, > real=0.04 secs] > {noformat} > Specifically, look at the text after the exception text. There should be two > more entries for log files but none exist. This is likely due to the fact > that command-13.json is expected to be of length 13934 but its is not as the > file was never read. > I think, it should have been > {noformat} > LogType: command-13.json > LogLength: > Log Contents: > Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json > (Permission denied)command-3.json13983Error aggregating log file. Log file : > /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json > (Permission denied) > {noformat} > {noformat} > LogType: errors-3.txt > LogLength:0 > Log Contents: > {noformat} > {noformat} > LogType:gc.log > LogLength:??? > Log Contents: > ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: > [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180592#comment-14180592 ] Xuan Gong commented on YARN-2701: - [~aw] Do you have any other comments for this patch ? > Potential race condition in startLocalizer when using LinuxContainerExecutor > -- > > Key: YARN-2701 > URL: https://issues.apache.org/jira/browse/YARN-2701 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, > YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, > YARN-2701.addendum.1.patch, YARN-2701.addendum.2.patch > > > When using LinuxContainerExecutor do startLocalizer, we are using native code > container-executor.c. > {code} > if (stat(npath, &sb) != 0) { >if (mkdir(npath, perm) != 0) { > {code} > We are using check and create method to create the appDir under /usercache. > But if there are two containers trying to do this at the same time, race > condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180594#comment-14180594 ] Wangda Tan commented on YARN-2726: -- [~tweek], good suggestion! I completely agree to make error message more clear to admin/user. > CapacityScheduler should explicitly log when an accessible label has no > capacity > > > Key: YARN-2726 > URL: https://issues.apache.org/jira/browse/YARN-2726 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Phil D'Amore >Assignee: Naganarasimha G R >Priority: Minor > > Given: > - Node label defined: test-label > - Two queues defined: a, b > - label accessibility and and capacity defined as follows (properties > abbreviated for readability): > root.a.accessible-node-labels = test-label > root.a.accessible-node-labels.test-label.capacity = 100 > If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack > trace with the following error buried within: > "Illegal capacity of -1.0 for label=test-label in queue=root.b" > This of course occurs because test-label is accessible to b due to > inheritance from the root, and -1 is the UNDEFINED value. To my mind this > might not be obvious to the admin, and the error message which results does > not help guide someone to the source of the issue. > I propose that this situation be updated so that when the capacity on an > accessible label is undefined, it is explicitly called out instead of falling > through to the illegal capacity check. Something like: > {code} > if (capacity == UNDEFINED) { > throw new IllegalArgumentException("Configuration issue: " + " label=" + > label + " is accessible from queue=" + queue + " but has no capacity set."); > } > {code} > I'll leave it to better judgement than mine as to whether I'm throwing the > appropriate exception there. I think this check should be added to both > getNodeLabelCapacities and getMaximumNodeLabelCapacities in > CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2726: - Issue Type: Sub-task (was: Improvement) Parent: YARN-2492 > CapacityScheduler should explicitly log when an accessible label has no > capacity > > > Key: YARN-2726 > URL: https://issues.apache.org/jira/browse/YARN-2726 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Phil D'Amore >Assignee: Naganarasimha G R >Priority: Minor > > Given: > - Node label defined: test-label > - Two queues defined: a, b > - label accessibility and and capacity defined as follows (properties > abbreviated for readability): > root.a.accessible-node-labels = test-label > root.a.accessible-node-labels.test-label.capacity = 100 > If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack > trace with the following error buried within: > "Illegal capacity of -1.0 for label=test-label in queue=root.b" > This of course occurs because test-label is accessible to b due to > inheritance from the root, and -1 is the UNDEFINED value. To my mind this > might not be obvious to the admin, and the error message which results does > not help guide someone to the source of the issue. > I propose that this situation be updated so that when the capacity on an > accessible label is undefined, it is explicitly called out instead of falling > through to the illegal capacity check. Something like: > {code} > if (capacity == UNDEFINED) { > throw new IllegalArgumentException("Configuration issue: " + " label=" + > label + " is accessible from queue=" + queue + " but has no capacity set."); > } > {code} > I'll leave it to better judgement than mine as to whether I'm throwing the > appropriate exception there. I think this check should be added to both > getNodeLabelCapacities and getMaximumNodeLabelCapacities in > CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180637#comment-14180637 ] Wangda Tan commented on YARN-2726: -- Converted this to sub task of YARN-2492 > CapacityScheduler should explicitly log when an accessible label has no > capacity > > > Key: YARN-2726 > URL: https://issues.apache.org/jira/browse/YARN-2726 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Phil D'Amore >Assignee: Naganarasimha G R >Priority: Minor > > Given: > - Node label defined: test-label > - Two queues defined: a, b > - label accessibility and and capacity defined as follows (properties > abbreviated for readability): > root.a.accessible-node-labels = test-label > root.a.accessible-node-labels.test-label.capacity = 100 > If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack > trace with the following error buried within: > "Illegal capacity of -1.0 for label=test-label in queue=root.b" > This of course occurs because test-label is accessible to b due to > inheritance from the root, and -1 is the UNDEFINED value. To my mind this > might not be obvious to the admin, and the error message which results does > not help guide someone to the source of the issue. > I propose that this situation be updated so that when the capacity on an > accessible label is undefined, it is explicitly called out instead of falling > through to the illegal capacity check. Something like: > {code} > if (capacity == UNDEFINED) { > throw new IllegalArgumentException("Configuration issue: " + " label=" + > label + " is accessible from queue=" + queue + " but has no capacity set."); > } > {code} > I'll leave it to better judgement than mine as to whether I'm throwing the > appropriate exception there. I think this check should be added to both > getNodeLabelCapacities and getMaximumNodeLabelCapacities in > CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2721: --- Priority: Blocker (was: Major) > Race condition: ZKRMStateStore retry logic may throw NodeExist exception > - > > Key: YARN-2721 > URL: https://issues.apache.org/jira/browse/YARN-2721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He >Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2721.1.patch > > > Blindly retrying operations in zookeeper will not work for non-idempotent > operations (like create znode). The reason is that the client can do a create > znode, but the response may not be returned because the server can die or > timeout. In case of retrying the create znode, it will throw a NODE_EXISTS > exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
[ https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2722: -- Attachment: YARN-2722-1.patch This patch creates a whilelist {"TLSv1.2", "TLSv1.1", "TLSv1"} for the SSLFactory. Have verified with the ShuffleHandler (13562 port). {code:title=Without fix} $ openssl s_client -connect localhost:13562 -ssl3 CONNECTED(0003) depth=0 CN = *.ent.cloudera.com verify error:num=18:self signed certificate verify return:1 depth=0 CN = *.ent.cloudera.com verify return:1 --- Certificate chain 0 s:/CN=*.ent.cloudera.com i:/CN=*.ent.cloudera.com --- Server certificate -BEGIN CERTIFICATE- MIIC2TCCAcGgAwIBAgIERTXzmDANBgkqhkiG9w0BAQsFADAdMRswGQYDVQQDDBIq LmVudC5jbG91ZGVyYS5jb20wHhcNMTQxMDE0MjEwOTU1WhcNMTUwMTEyMjEwOTU1 WjAdMRswGQYDVQQDDBIqLmVudC5jbG91ZGVyYS5jb20wggEiMA0GCSqGSIb3DQEB AQUAA4IBDwAwggEKAoIBAQDdd3RIofg6S0jNi1tZPLC/ye4yLz5PLdxpn5Rlmg8p jORirbyvsLSn82WcfITUUx8Iez9pYLLXBzOqS4nlXwFP1WHDHGJFyuidTOaXm2fr sZIVYUx0ldzUT6AhSLQ1p81g8Uplv3xA+Bh/SIXU84vKnjH6eU2wJc/0AKS6Jchl hNr9ZuMEK6Dc34MbjOd0inLNqR2A26wV/tEPhf3UWbpkED9J8DZqevp25hvmYomM OSoUSyO2hc6Mkj97Cbd8OglbXzG0lFzCgmN0yqFZ7X8pZuOzs2MhnzXtzjUbwvyO G+1mpQ95Oc1cBdK40Rq/xeE8NwDP6C9JJ8FEz/VuuUZfAgMBAAGjITAfMB0GA1Ud DgQWBBR/aS6adMIKP9pQbfcNkxyIbRMXJDANBgkqhkiG9w0BAQsFAAOCAQEAktNr AzECBbO3hZEmjbZ/lnE+9DI7LF8DV1XbwZqd5qXhnnqZde5CryOGsAn76RkizUlo KH1+8w8WRW8YxCx3863dOKg9yRr8rR5+BedSfG1GeF9PSpRYJ1o5Bv9wLNjI+UM0 E6zq3ObxpLe1QqXwz5Ro5DOIaBN5GRNp6i1B6k6b1aPsJOAaBkuFkR+unBCWnQk7 uMtGb78LaCYU0/8D5fRMTkeChR9gxuwYj7hwt3+CKdKEQ+0Mxbd5/sO8HgGlOcB1 T1xtu/GXoboiwwn6pLm/OksEyxB9TXnSvkc9C/RXQeaSaiEvYksS1LvPkvq27qDU 09EC8C1HkfWd4uOKYA== -END CERTIFICATE- subject=/CN=*.ent.cloudera.com issuer=/CN=*.ent.cloudera.com --- No client certificate CA names sent --- SSL handshake has read 1239 bytes and written 288 bytes --- New, TLSv1/SSLv3, Cipher is ECDHE-RSA-DES-CBC3-SHA Server public key is 2048 bit Secure Renegotiation IS supported Compression: NONE Expansion: NONE SSL-Session: Protocol : SSLv3 Cipher: ECDHE-RSA-DES-CBC3-SHA Session-ID: 5446E4F74C3341F5AEA8CB827A5745A90AB8BF09765C4EDBBE57174314AEC901 Session-ID-ctx: Master-Key: D6C5A557D188361EB4E25414C6360EC6835143D27572D7A0019213C2AD175852C8F850D21B95DF334EC8B95D9FDB Key-Arg : None PSK identity: None PSK identity hint: None SRP username: None Start Time: 1413932279 Timeout : 7200 (sec) Verify return code: 18 (self signed certificate) --- q HTTP/1.1 500 Internal Server Error Content-Type: text/plain; charset=UTF-8 name: mapreduce version: 1.0.0 closed {code} {code:title=With Fix} $ openssl s_client -connect localhost:13562 -ssl3 CONNECTED(0003) write:errno=104 --- no peer certificate available --- No client certificate CA names sent --- SSL handshake has read 0 bytes and written 0 bytes --- New, (NONE), Cipher is (NONE) Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE SSL-Session: Protocol : SSLv3 Cipher: Session-ID: Session-ID-ctx: Master-Key: Key-Arg : None PSK identity: None PSK identity hint: None SRP username: None Start Time: 1414013826 Timeout : 7200 (sec) Verify return code: 0 (ok) --- {code} > Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle > - > > Key: YARN-2722 > URL: https://issues.apache.org/jira/browse/YARN-2722 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2722-1.patch > > > We should disable SSLv3 in HttpFS to protect against the POODLEbleed > vulnerability. > See [CVE-2014-3566 > |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] > We have {{context = SSLContext.getInstance("TLS");}} in SSLFactory, but when > I checked, I could still connect with SSLv3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180676#comment-14180676 ] Karthik Kambatla commented on YARN-2183: bq. we do have a YARN admin command implemented that lets you run the cleaner task on demand (YARN-2189). Cool. In that case, I see the merit to keeping runCleanerTask around. bq. this check is needed to prevent a race (i.e. not allow an on-demand run when a scheduled run is in progress). I understand we need a check to prevent the race. I wonder if we can just re-use the existing check in CleanerTask#run instead of an explicit check in CleanerService#runCleanerTask? From what I remember, that would make the code in CleanerTask#run cleaner as well. (no pun) bq. However, it’s not clear to me whether a dependency from an SCMStore to an AppChecker is always a fine requirement for other types of stores. I poked around a little more, and here is what I think. SharedCacheManager creates an instance of AppChecker, rest of the SCM pieces (Store, CleanerService) should just use the same instance. This instance can be passed either in the constructor or through an SCMContext similar to RMContext. Or, we could add SCM#getAppChecker. In its current form, CleanerTask#cleanResourceReferences fetches the references from the store, checks if the apps are running, and asks the store to remove the references. Moving the whole method to the store would simplify the code more. The latest patch looks pretty good but for the above two points. One other nit: One of {CleanerTask, CleanerService} has unused imports. > Cleaner service for cache manager > - > > Key: YARN-2183 > URL: https://issues.apache.org/jira/browse/YARN-2183 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, > YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch > > > Implement the cleaner service for the cache manager along with metrics for > the service. This service is responsible for cleaning up old resource > references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2503) Changes in RM Web UI to better show labels to end users
[ https://issues.apache.org/jira/browse/YARN-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180683#comment-14180683 ] Hadoop QA commented on YARN-2503: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676426/YARN-2503-20141022-1.patch against trunk revision 7b0f9bb. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5501//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5501//console This message is automatically generated. > Changes in RM Web UI to better show labels to end users > --- > > Key: YARN-2503 > URL: https://issues.apache.org/jira/browse/YARN-2503 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2503-20141022-1.patch, YARN-2503.patch > > > Include but not limited to: > - Show labels of nodes in RM/nodes page > - Show labels of queue in RM/scheduler page -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180690#comment-14180690 ] Hadoop QA commented on YARN-2198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676333/YARN-2198.16.patch against trunk revision a36399e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5502//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5502//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5502//console This message is automatically generated. > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, > YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180695#comment-14180695 ] Jian He commented on YARN-2198: --- committing > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, > YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
[ https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180724#comment-14180724 ] Hadoop QA commented on YARN-2722: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676452/YARN-2722-1.patch against trunk revision a36399e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5503//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5503//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5503//console This message is automatically generated. > Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle > - > > Key: YARN-2722 > URL: https://issues.apache.org/jira/browse/YARN-2722 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2722-1.patch > > > We should disable SSLv3 in HttpFS to protect against the POODLEbleed > vulnerability. > See [CVE-2014-3566 > |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] > We have {{context = SSLContext.getInstance("TLS");}} in SSLFactory, but when > I checked, I could still connect with SSLv3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180728#comment-14180728 ] Allen Wittenauer commented on YARN-2718: I don't think creating some sort of mutant executor is really the proper fix here. I suspect the real answer is to allow users to pick which executor (from an admin approved list) is probably closer (and quicker!) to the real goal. But the bigger issue is that this will lead to some very weird and unpredictable administrative experiences. It also means that users will be given even more impact on how the NM actually works. This is a bit of a dangerous road to start heading... > Create a CompositeConatainerExecutor that combines DockerContainerExecutor > and DefaultContainerExecutor > --- > > Key: YARN-2718 > URL: https://issues.apache.org/jira/browse/YARN-2718 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Abin Shahab > > There should be a composite container that allows users to run their jobs in > DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging > purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180733#comment-14180733 ] Jian He commented on YARN-2198: --- thanks Craig for reviewing the patch ! > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Fix For: 2.6.0 > > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, > YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180735#comment-14180735 ] Jian He commented on YARN-1063: --- I merged this to 2.6, as this was marked for 2.6 > Winutils needs ability to create task as domain user > > > Key: YARN-1063 > URL: https://issues.apache.org/jira/browse/YARN-1063 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Environment: Windows >Reporter: Kyle Leckie >Assignee: Remus Rusanu > Labels: security, windows > Fix For: 2.6.0 > > Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, > YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch > > > h1. Summary: > Securing a Hadoop cluster requires constructing some form of security > boundary around the processes executed in YARN containers. Isolation based on > Windows user isolation seems most feasible. This approach is similar to the > approach taken by the existing LinuxContainerExecutor. The current patch to > winutils.exe adds the ability to create a process as a domain user. > h1. Alternative Methods considered: > h2. Process rights limited by security token restriction: > On Windows access decisions are made by examining the security token of a > process. It is possible to spawn a process with a restricted security token. > Any of the rights granted by SIDs of the default token may be restricted. It > is possible to see this in action by examining the security tone of a > sandboxed process launch be a web browser. Typically the launched process > will have a fully restricted token and need to access machine resources > through a dedicated broker process that enforces a custom security policy. > This broker process mechanism would break compatibility with the typical > Hadoop container process. The Container process must be able to utilize > standard function calls for disk and network IO. I performed some work > looking at ways to ACL the local files to the specific launched without > granting rights to other processes launched on the same machine but found > this to be an overly complex solution. > h2. Relying on APP containers: > Recent versions of windows have the ability to launch processes within an > isolated container. Application containers are supported for execution of > WinRT based executables. This method was ruled out due to the lack of > official support for standard windows APIs. At some point in the future > windows may support functionality similar to BSD jails or Linux containers, > at that point support for containers should be added. > h1. Create As User Feature Description: > h2. Usage: > A new sub command was added to the set of task commands. Here is the syntax: > winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] > Some notes: > * The username specified is in the format of "user@domain" > * The machine executing this command must be joined to the domain of the user > specified > * The domain controller must allow the account executing the command access > to the user information. For this join the account to the predefined group > labeled "Pre-Windows 2000 Compatible Access" > * The account running the command must have several rights on the local > machine. These can be managed manually using secpol.msc: > ** "Act as part of the operating system" - SE_TCB_NAME > ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME > ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME > * The launched process will not have rights to the desktop so will not be > able to display any information or create UI. > * The launched process will have no network credentials. Any access of > network resources that requires domain authentication will fail. > h2. Implementation: > Winutils performs the following steps: > # Enable the required privileges for the current process. > # Register as a trusted process with the Local Security Authority (LSA). > # Create a new logon for the user passed on the command line. > # Load/Create a profile on the local machine for the new logon. > # Create a new environment for the new logon. > # Launch the new process in a job with the task name specified and using the > created logon. > # Wait for the JOB to exit. > h2. Future work: > The following work was scoped out of this check in: > * Support for non-domain users or machine that are not domain joined. > * Support for privilege isolation by running the task launcher in a high > privilege service with access over an ACLed named pipe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Attachment: yarn-2010-4.patch > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Karthik Kambatla >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, > yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180739#comment-14180739 ] Hudson commented on YARN-2198: -- FAILURE: Integrated in Hadoop-trunk-Commit #6318 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6318/]) YARN-2198. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor. Contributed by Remus Rusanu (jianhe: rev 3b12fd6cfbf4cc91ef8e8616c7aafa9de006cde5) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/util/ProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-common-project/hadoop-common/src/main/winutils/hadoopwinutilsvc.idl * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.vcxproj * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * hadoop-common-project/hadoop-common/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.h * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java * hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/winutils/task.c * hadoop-common-project/hadoop-common/src/main/winutils/client.c * hadoop-common-project/hadoop-common/src/main/winutils/config.cpp * hadoop-common-project/hadoop-common/src/main/native/native.vcxproj * hadoop-common-project/hadoop-common/src/main/winutils/winutils.vcxproj * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java * .gitignore * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.c * hadoop-common-project/hadoop-common/src/main/winutils/winutils.sln * hadoop-common-project/hadoop-common/src/main/winutils/service.c * hadoop-common-project/hadoop-common/src/main/winutils/main.c * hadoop-common-project/hadoop-common/src/main/winutils/chown.c * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/winutils/winutils.mc > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >
[jira] [Updated] (YARN-2010) If RM fails to recover an app, it can never transition to active again
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Summary: If RM fails to recover an app, it can never transition to active again (was: RM can't transition to active if it can't recover an app attempt) > If RM fails to recover an app, it can never transition to active again > -- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Karthik Kambatla >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, > yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2730) Only one localizer can run on a NodeManager at a time
[ https://issues.apache.org/jira/browse/YARN-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2730: -- Description: The synchronized modifier appears to have been added by https://issues.apache.org/jira/browse/MAPREDUCE-3537 It could be removed if Localizer doesn't depend on current directory > Only one localizer can run on a NodeManager at a time > - > > Key: YARN-2730 > URL: https://issues.apache.org/jira/browse/YARN-2730 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Priority: Critical > > The synchronized modifier appears to have been added by > https://issues.apache.org/jira/browse/MAPREDUCE-3537 > It could be removed if Localizer doesn't depend on current directory -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2730) Only one localizer can run on a NodeManager at a time
Siqi Li created YARN-2730: - Summary: Only one localizer can run on a NodeManager at a time Key: YARN-2730 URL: https://issues.apache.org/jira/browse/YARN-2730 Project: Hadoop YARN Issue Type: Bug Reporter: Siqi Li Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2010) If RM fails to recover an app, it can never transition to active again
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180747#comment-14180747 ] Karthik Kambatla commented on YARN-2010: This JIRA has been an open for a while and went through several discussions. I ll try to consolidate everything here so we can iterate on this quickly. Sometimes, the RM fails to recover an application. It could be because of turning security on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1) transient, (2) specific to one application, and (3) permanent and apply to multiple (all) applications. Today, the RM fails to transition to Active and ends up in STOPPED state and can never be transitioned to Active again. Vinod suggested we handle these cases (exceptions) separately, so we can do the right thing for each exception. The latest patch (v4) is along these lines - it catches a potentially transient issue (ConnectException) and transitions the RM to Standby. If the issue were to persist (case - 3), the RM would eventually run out of number of failovers and crash. For application-specific issues (as of now, all other exceptions), we just skip recovering that app. In addition to this, the patch cleans up RMAppManager#recoverApplication and also adds a null-check in RMAppAttempt#recoverAppAttemptCredentials per Jian's suggestion. > If RM fails to recover an app, it can never transition to active again > -- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Karthik Kambatla >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, > yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument
[jira] [Updated] (YARN-2730) Only one localizer can run on a NodeManager at a time
[ https://issues.apache.org/jira/browse/YARN-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2730: -- Description: We are seeing that when one of the localizerRunner stuck, the rest of the localizerRunners are blocked. We should remove the synchronized modifier. The synchronized modifier appears to have been added by https://issues.apache.org/jira/browse/MAPREDUCE-3537 It could be removed if Localizer doesn't depend on current directory was: The synchronized modifier appears to have been added by https://issues.apache.org/jira/browse/MAPREDUCE-3537 It could be removed if Localizer doesn't depend on current directory > Only one localizer can run on a NodeManager at a time > - > > Key: YARN-2730 > URL: https://issues.apache.org/jira/browse/YARN-2730 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Priority: Critical > > We are seeing that when one of the localizerRunner stuck, the rest of the > localizerRunners are blocked. We should remove the synchronized modifier. > The synchronized modifier appears to have been added by > https://issues.apache.org/jira/browse/MAPREDUCE-3537 > It could be removed if Localizer doesn't depend on current directory -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2730) Only one localizer can run on a NodeManager at a time
[ https://issues.apache.org/jira/browse/YARN-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li reassigned YARN-2730: - Assignee: Siqi Li > Only one localizer can run on a NodeManager at a time > - > > Key: YARN-2730 > URL: https://issues.apache.org/jira/browse/YARN-2730 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Siqi Li >Assignee: Siqi Li >Priority: Critical > > We are seeing that when one of the localizerRunner stuck, the rest of the > localizerRunners are blocked. We should remove the synchronized modifier. > The synchronized modifier appears to have been added by > https://issues.apache.org/jira/browse/MAPREDUCE-3537 > It could be removed if Localizer doesn't depend on current directory -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2730) Only one localizer can run on a NodeManager at a time
[ https://issues.apache.org/jira/browse/YARN-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2730: -- Affects Version/s: (was: 2.5.0) 2.4.0 > Only one localizer can run on a NodeManager at a time > - > > Key: YARN-2730 > URL: https://issues.apache.org/jira/browse/YARN-2730 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Siqi Li >Assignee: Siqi Li >Priority: Critical > > We are seeing that when one of the localizerRunner stuck, the rest of the > localizerRunners are blocked. We should remove the synchronized modifier. > The synchronized modifier appears to have been added by > https://issues.apache.org/jira/browse/MAPREDUCE-3537 > It could be removed if Localizer doesn't depend on current directory -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2010) If RM fails to recover an app, it can never transition to active again
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Attachment: issue-stack-strace.rtf > If RM fails to recover an app, it can never transition to active again > -- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Karthik Kambatla >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, > yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch > > > Sometimes, the RM fails to recover an application. It could be because of > turning security on, token expiry, or issues connecting to HDFS etc. The > causes could be classified into (1) transient, (2) specific to one > application, and (3) permanent and apply to multiple (all) applications. > Today, the RM fails to transition to Active and ends up in STOPPED state and > can never be transitioned to Active again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)