[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590409#comment-14590409 ] Brahma Reddy Battula commented on YARN-3528: Sorry for delay..Based on above proposal attached the initial patch. Actually there are so many classes with hardcoded ports,I think, we need plan to fix all those ( anythoughts ..?,I mean.how to plan? all should go with this jira or can we track with project level). Currrently mentioned classes in the jira are addressed...Will update the final patch based on inputs.. Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker Labels: test Attachments: YARN-3528.patch A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590318#comment-14590318 ] Tsuyoshi Ozawa commented on YARN-3798: -- Sorry for the delay. I took a time to investigate the behaviour of ZooKeeper yesterday. Now I'm checking the comment by Varun. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590208#comment-14590208 ] Xuan Gong commented on YARN-3804: - +1 LGTM. Will commit Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Attachments: YARN-3804.01.patch, YARN-3804.02.patch, YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590309#comment-14590309 ] Tsuyoshi Ozawa commented on YARN-3798: -- [~vinodkv] thank you for taking a look at this issue. {quote} If my understanding is correct, someone should edit the title. {quote} Sure. {quote} Coming to the patch: By definition, CONNECTIONLOSS also means that we should recreate the connection? {quote} IIUC, we should not recreate new connection when CONNECTIONLOSS happens by definition. ZooKeeper client tries to reconnect automatically since it's recoverable error. It's written in the Wiki of ZooKeeper(http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling). Curator also does same thing. {quote} Recoverable errors: the disconnected event, connection timed out, and the connection loss exception are examples of recoverable errors, they indicate a problem that happened, but the ZooKeeper handle is still valid and future operations will succeed once the ZooKeeper library can reestablish its connection to ZooKeeper. The ZooKeeper library does try to recover the connection, so the handle should not be closed on a recoverable error, but the application must deal with the transient error. {quote} {quote} 2. (ZKRMStateStore) Failing to zkClient.close() in ZKRMStateStore#createConnection, but IOException is ignored. I think this should be fixed in ZooKeeper. No amount of patching in YARN will fix this. {quote} I took a look at the code of ZooKeeper#close deeply. I found IOException is not related. However, the way of our error handling affects this phenomena as follows: # (ZKRMStateStore) CONNETIONLOSS happens - calling closeZkClients inside createConnection. # (ZooKeeper client in ZKRMStateStore) submitRequest - wait() for finishing the packet for close(). # (ZooKeeper client # SendThread) Exception happens because of timeout - cleanup the packet for the close(). The reply header of the packet has CONNECTIONLOSS again. notify to caller of close(). # (ZooKeeper client in ZKRMStateStore) return to closeZkClients(). # (ZKRMStateStore) continuing to createConnection() normally. I think the error handling when CONNECTIONLOSS happens and the connection management in YARN-side are wrong as described above. We should fix it at our side. Please correct me if I'm wrong. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
[jira] [Updated] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle
[ https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3047: --- Attachment: YARN-3047-YARN-2928.08.patch [Data Serving] Set up ATS reader with basic request serving structure and lifecycle --- Key: YARN-3047 URL: https://issues.apache.org/jira/browse/YARN-3047 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Varun Saxena Labels: BB2015-05-TBR Attachments: Timeline_Reader(draft).pdf, YARN-3047-YARN-2928.08.patch, YARN-3047.001.patch, YARN-3047.003.patch, YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, YARN-3047.02.patch, YARN-3047.04.patch Per design in YARN-2938, set up the ATS reader as a service and implement the basic structure as a service. It includes lifecycle management, request serving, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590243#comment-14590243 ] Hudson commented on YARN-3714: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #229 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/229/]) YARN-3714. AM proxy filter can not get RM webapp address from (xgong: rev e27d5a13b0623e3eb43ac773eccd082b9d6fa9d0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/amfilter/TestAmFilterInitializer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/RMHAUtils.java * hadoop-yarn-project/CHANGES.txt AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id -- Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, YARN-3714.003.patch, YARN-3714.004.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Grandl updated YARN-3819: Attachment: YARN-3819-2.patch Updates to DummyResourceCalculatorPlugin.java Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2745) Extend YARN to support multi-resource packing of tasks
[ https://issues.apache.org/jira/browse/YARN-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590289#comment-14590289 ] Allen Wittenauer commented on YARN-2745: How much of this is actually YARN specific though? YARN-3819 and YARN-3820 seem like things that HDFS should care about too. It seems extremely shortsighted not to commit the collection parts into common. Extend YARN to support multi-resource packing of tasks -- Key: YARN-2745 URL: https://issues.apache.org/jira/browse/YARN-2745 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, scheduler Reporter: Robert Grandl Assignee: Robert Grandl Attachments: sigcomm_14_tetris_talk.pptx, tetris_design_doc.docx, tetris_paper.pdf In this umbrella JIRA we propose an extension to existing scheduling techniques, which accounts for all resources used by a task (CPU, memory, disk, network) and it is able to achieve three competing objectives: fairness, improve cluster utilization and reduces average job completion time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3820) Collect disks usages on the node
[ https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Grandl updated YARN-3820: Attachment: YARN-3820-1.patch Added first cut patch Collect disks usages on the node Key: YARN-3820 URL: https://issues.apache.org/jira/browse/YARN-3820 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3820-1.patch In this JIRA we propose to collect disks usages on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590164#comment-14590164 ] Vinod Kumar Vavilapalli commented on YARN-3811: --- bq. We should also consider graceful NM decommission. For graceful decommission, the RM should refrain from assigning more tasks to the node in question. Should we also prevent AMs that have already been assigned this node from starting new containers? In that case, I guess we would not be throwing NMNotYetReadyException, but another YarnException - NMShuttingDownException? [~kasha], we could. Let's file a separate JIRA? bq. we should just avoid opening or processing the client port until we've registered with the RM if it's really a problem in practice [~jlowe], this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. bq. 2. For NM restart with no recovery support, startContainer will fail anyways because the NMToken is not valid. bq. 3. For work-preserving RM restart, containers launched before NM re-register can be recovered on RM when NM sends the container status across. startContainer call after re-register will fail because the NMToken is not valid. [~jianhe], these two errors will be much harder for apps to process and react to than the current named exception. Further, things like Auxiliary services are also not setup already by time the RPC server starts and depending on how the service order changes over time, users may get different types of errors. Overall, I am in favor of keeping the named exception with clients explicitly retrying. NM restarts could lead to app failures -- Key: YARN-3811 URL: https://issues.apache.org/jira/browse/YARN-3811 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Consider the following scenario: 1. RM assigns a container on node N to an app A. 2. Node N is restarted 3. A tries to launch container on node N. 3 could lead to an NMNotYetReadyException depending on whether NM N has registered with the RM. In MR, this is considered a task attempt failure. A few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590207#comment-14590207 ] Jason Lowe commented on YARN-3811: -- bq. this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. Yes, but that's a limitation in the RPC layer. If we could bind the server before we start it then we could know the port, register with the RM, then start the server. IMHO the RPC layer should support this, but I understand we'll have to work around the lack of that in the interim. I think we all can agree the retry exception is just a hack being used because we can't keep the client service from serving too soon. NM restarts could lead to app failures -- Key: YARN-3811 URL: https://issues.apache.org/jira/browse/YARN-3811 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Consider the following scenario: 1. RM assigns a container on node N to an app A. 2. Node N is restarted 3. A tries to launch container on node N. 3 could lead to an NMNotYetReadyException depending on whether NM N has registered with the RM. In MR, this is considered a task attempt failure. A few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590237#comment-14590237 ] Hudson commented on YARN-3148: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #229 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/229/]) YARN-3148. Allow CORS related headers to passthrough in (devaraj: rev ebb9a82519c622bb898e1eec5798c2298c726694) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java * hadoop-yarn-project/CHANGES.txt Allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Prakash Ramachandran Assignee: Varun Saxena Fix For: 2.8.0 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, YARN-3148.03.patch, YARN-3148.04.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
[ https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590375#comment-14590375 ] Hudson commented on YARN-3617: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2159 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2159/]) YARN-3617. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning (devaraj: rev 318d2cde7cb5c05a5f87c4ee967446bb60d28ae4) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java * hadoop-yarn-project/CHANGES.txt Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1 - Key: YARN-3617 URL: https://issues.apache.org/jira/browse/YARN-3617 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: Windows 7 x64 SP1 Reporter: Georg Berendt Assignee: J.Andreina Priority: Minor Fix For: 2.8.0 Attachments: YARN-3617.1.patch Original Estimate: 1h Remaining Estimate: 1h In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, there is an unused variable for CPU frequency. /** {@inheritDoc} */ @Override public long getCpuFrequency() { refreshIfNeeded(); return -1; } Please change '-1' to use 'cpuFrequencyKhz'. org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590379#comment-14590379 ] Hudson commented on YARN-3714: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2159 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2159/]) YARN-3714. AM proxy filter can not get RM webapp address from (xgong: rev e27d5a13b0623e3eb43ac773eccd082b9d6fa9d0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/RMHAUtils.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/amfilter/TestAmFilterInitializer.java AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id -- Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, YARN-3714.003.patch, YARN-3714.004.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590373#comment-14590373 ] Hudson commented on YARN-3148: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2159 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2159/]) YARN-3148. Allow CORS related headers to passthrough in (devaraj: rev ebb9a82519c622bb898e1eec5798c2298c726694) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java Allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Prakash Ramachandran Assignee: Varun Saxena Fix For: 2.8.0 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, YARN-3148.03.patch, YARN-3148.04.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3820) Collect disks usages on the node
[ https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590297#comment-14590297 ] Robert Grandl commented on YARN-3820: - [~srikanthkandula] and I were proposing to collect the disks usages on a node. This is part of a larger effort of multi-resource scheduling. Currently, yarn does not have any mechanism to monitor the amount of bytes read/written from/to disks. Collect disks usages on the node Key: YARN-3820 URL: https://issues.apache.org/jira/browse/YARN-3820 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util In this JIRA we propose to collect disks usages on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
[ https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590345#comment-14590345 ] Hudson commented on YARN-3617: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2177 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2177/]) YARN-3617. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning (devaraj: rev 318d2cde7cb5c05a5f87c4ee967446bb60d28ae4) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java * hadoop-yarn-project/CHANGES.txt Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1 - Key: YARN-3617 URL: https://issues.apache.org/jira/browse/YARN-3617 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: Windows 7 x64 SP1 Reporter: Georg Berendt Assignee: J.Andreina Priority: Minor Fix For: 2.8.0 Attachments: YARN-3617.1.patch Original Estimate: 1h Remaining Estimate: 1h In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, there is an unused variable for CPU frequency. /** {@inheritDoc} */ @Override public long getCpuFrequency() { refreshIfNeeded(); return -1; } Please change '-1' to use 'cpuFrequencyKhz'. org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590349#comment-14590349 ] Hudson commented on YARN-3714: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2177 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2177/]) YARN-3714. AM proxy filter can not get RM webapp address from (xgong: rev e27d5a13b0623e3eb43ac773eccd082b9d6fa9d0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/RMHAUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/amfilter/TestAmFilterInitializer.java * hadoop-yarn-project/CHANGES.txt AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id -- Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.8.0 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, YARN-3714.003.patch, YARN-3714.004.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590343#comment-14590343 ] Hudson commented on YARN-3148: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2177 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2177/]) YARN-3148. Allow CORS related headers to passthrough in (devaraj: rev ebb9a82519c622bb898e1eec5798c2298c726694) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java Allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Prakash Ramachandran Assignee: Varun Saxena Fix For: 2.8.0 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, YARN-3148.03.patch, YARN-3148.04.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Grandl updated YARN-3819: Flags: Patch Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590172#comment-14590172 ] Vinod Kumar Vavilapalli commented on YARN-3798: --- [~ozawa], bumping for my comments and those from [~varun_saxena] and to figure out if I should hold 2.7.1 for this. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
[jira] [Commented] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590217#comment-14590217 ] Robert Grandl commented on YARN-3819: - [~srikanthkandula] and I were proposing to collect the network usage on a node. This is part of a larger effort of multi-resource scheduling. Previous efforts in collecting network usage per containers is not enough for the purpose of multi-resource scheduling, as it is not able to capture other traffic activities on the node such as ingestion or evacuation. Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590484#comment-14590484 ] zhihai xu commented on YARN-3591: - Hi [~vvasudev], thanks for the explanation. IMHO, If we want the LocalDirHandlerService to be a central place for the state of the local dirs, doing it in {{DirsChangeListener#onDirsChanged}} will be better. IIUC, it is also your suggestion. The benefits for doing this are: 1. It will give better performance. because you will do it only when some Dirs become bad, which should happen rarely, you won't waste your time to do it for every localization request. 2. It will also help the issue What about zombie files lying in the various paths which [~lavkesh] found, a similar issue as YARN-2624. 3. {{checkLocalizedResources}}/{{removeResource}} called by {{onDirsChanged}} will be done inside {{LocalDirsHandlerService#checkDirs}} without any delay. Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Grandl updated YARN-3819: Attachment: YARN-3819-1.patch Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590319#comment-14590319 ] Jian He commented on YARN-3811: --- bq. this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. For RM work-preserving restart, this is not a problem as the NM remain as-is. For NM restart with no recovery, all outstanding containers allocated on this node are anyways killed. For NM work-preserving restart, I found the code already make sure everything starts first before starting the containerManager server. {code} if (delayedRpcServerStart) { waitForRecoveredContainers(); server.start(); {code} Overall, I think it's fine to add a client retry fix in 2.7.1;But long term I'd like to re-visit this, may be I still miss something. NM restarts could lead to app failures -- Key: YARN-3811 URL: https://issues.apache.org/jira/browse/YARN-3811 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Consider the following scenario: 1. RM assigns a container on node N to an app A. 2. Node N is restarted 3. A tries to launch container on node N. 3 could lead to an NMNotYetReadyException depending on whether NM N has registered with the RM. In MR, this is considered a task attempt failure. A few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590472#comment-14590472 ] Lei Guo commented on YARN-3819: --- For multiple resource scheduling, we may have different resource types, not just CPU/disk/network. Even for network, we may need other attributes instead of just read and write. It's better to have some generic framework in RM/NM and collect data via plug-ins. Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Grandl updated YARN-3819: Attachment: YARN-3819-3.patch Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590581#comment-14590581 ] Xuan Gong commented on YARN-3804: - [~varun_saxena] Looks like the patch does not apply for 2.7. Could you provide a patch for branch-2.7, please ? Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Attachments: YARN-3804.01.patch, YARN-3804.02.patch, YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2801: - Attachment: (was: YARN-2801.md) Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gururaj Shetty Assignee: Wangda Tan Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3806) Proposal of Generic Scheduling Framework for YARN
[ https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590827#comment-14590827 ] Karthik Kambatla commented on YARN-3806: FairScheduler supports per-queue policy. Folks could always implement their own policies. YARN-3306 aims to generalize this, starting with leaf queues, so we have a single scheduler. Proposal of Generic Scheduling Framework for YARN - Key: YARN-3806 URL: https://issues.apache.org/jira/browse/YARN-3806 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Wei Shao Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf Currently, a typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic scheduling framework brought up in this proposal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590843#comment-14590843 ] Xuan Gong commented on YARN-3804: - Committed into trunk/branch-2/branch-2.7. Thanks, [~varun_saxena]. Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Fix For: 2.7.1 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, YARN-3804.branch-2.7.patch Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3804: Attachment: YARN-3804.branch-2.7.patch Upload a same patch but can apply to branch-2.7 Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Attachments: YARN-3804.01.patch, YARN-3804.02.patch, YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, YARN-3804.branch-2.7.patch Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN
[ https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Shao updated YARN-3806: --- Description: Currently, a typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies (fair, capacity, fifo and so on) for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic scheduling framework brought up in this proposal. was: Currently, a typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic scheduling framework brought up in this proposal. Proposal of Generic Scheduling Framework for YARN - Key: YARN-3806 URL: https://issues.apache.org/jira/browse/YARN-3806 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Wei Shao Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf Currently, a typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies (fair, capacity, fifo and so on) for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic scheduling framework brought up in this proposal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590914#comment-14590914 ] Junping Du commented on YARN-3045: -- Hi [~Naganarasimha], given YARN-3044 is already get in, mind to update patch here? Thx! [Event producers] Implement NM writing container lifecycle events to ATS Key: YARN-3045 URL: https://issues.apache.org/jira/browse/YARN-3045 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Labels: BB2015-05-TBR Attachments: YARN-3045-YARN-2928.002.patch, YARN-3045-YARN-2928.003.patch, YARN-3045.20150420-1.patch Per design in YARN-2928, implement NM writing container lifecycle events and container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590852#comment-14590852 ] Karthik Kambatla commented on YARN-3809: Or, could we make it so we don't wait as long as 15 minutes? Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590849#comment-14590849 ] Karthik Kambatla commented on YARN-3809: Shouldn't the number of threads in the pool be at least as big as the maximum number of apps that could run on a node? By making it configurable, how do we expect the admins to pick this number? Just pick an arbitrarily high value? Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
[ https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590893#comment-14590893 ] Colin Patrick McCabe commented on YARN-3812: bq. Yes, ImmutableFsPermission should not be overriding applyUMask since the method does not actually mutate the object. readFields does mutate and therefore is appropriate for preventing invocation for constant objects. I agree. It seems that {{FsPermission#ImmutablePermission}} is incorrectly overriding {{FsPermission#applyUMask}}. There is no reason to override this method since it doesn't modify the {{FsPermission}}. The right fix should be to simply stop overriding that method. Do you want to move the JIRA over to Hadoop-common and post a patch for that? TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347 -- Key: YARN-3812 URL: https://issues.apache.org/jira/browse/YARN-3812 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.0.0 Reporter: Robert Kanter Assignee: Bibin A Chundatt Attachments: 0001-YARN-3812.patch {{TestRollingLevelDBTimelineStore}} is failing with the below errors in trunk. I did a git bisect and found that it was due to HADOOP-11347, which changed something with umasks in {{FsPermission}}. {noformat} Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec FAILURE! - in org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 1.533 sec ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200) at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65) testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 0.085 sec ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200) at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65) testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 0.07 sec ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590747#comment-14590747 ] Eric Payne commented on YARN-2902: -- Hi [~varun_saxena]. Thank you very much for working on and fixing this issue. We are looking forward to your next patch. Do you have an ETA for when that might be? Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2745) Extend YARN to support multi-resource packing of tasks
[ https://issues.apache.org/jira/browse/YARN-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590858#comment-14590858 ] Karthik Kambatla commented on YARN-2745: YARN-3332 tracks the work required to move all this collection from within Yarn to a service that HDFS could also use. We are just getting the collection bits in first, and plan to consolidate and move things around after. Extend YARN to support multi-resource packing of tasks -- Key: YARN-2745 URL: https://issues.apache.org/jira/browse/YARN-2745 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, scheduler Reporter: Robert Grandl Assignee: Robert Grandl Attachments: sigcomm_14_tetris_talk.pptx, tetris_design_doc.docx, tetris_paper.pdf In this umbrella JIRA we propose an extension to existing scheduling techniques, which accounts for all resources used by a task (CPU, memory, disk, network) and it is able to achieve three competing objectives: fairness, improve cluster utilization and reduces average job completion time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3798: - Summary: ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED (was: RM shutdown with NoNode exception while updating appAttempt on zk) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED --- Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at
[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-433: --- Attachment: YARN-433.2.patch Fix the testcase failure When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-433.1.patch, YARN-433.2.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
[ https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590239#comment-14590239 ] Hudson commented on YARN-3617: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #229 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/229/]) YARN-3617. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning (devaraj: rev 318d2cde7cb5c05a5f87c4ee967446bb60d28ae4) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java * hadoop-yarn-project/CHANGES.txt Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1 - Key: YARN-3617 URL: https://issues.apache.org/jira/browse/YARN-3617 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: Windows 7 x64 SP1 Reporter: Georg Berendt Assignee: J.Andreina Priority: Minor Fix For: 2.8.0 Attachments: YARN-3617.1.patch Original Estimate: 1h Remaining Estimate: 1h In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, there is an unused variable for CPU frequency. /** {@inheritDoc} */ @Override public long getCpuFrequency() { refreshIfNeeded(); return -1; } Please change '-1' to use 'cpuFrequencyKhz'. org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3820) Collect disks usages on the node
Robert Grandl created YARN-3820: --- Summary: Collect disks usages on the node Key: YARN-3820 URL: https://issues.apache.org/jira/browse/YARN-3820 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl In this JIRA we propose to collect disks usages on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3528: --- Attachment: YARN-3528.patch Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker Labels: test Attachments: YARN-3528.patch A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590555#comment-14590555 ] Robert Grandl commented on YARN-3819: - Short description of this JIRA: We process /proc/net/dev file which reports for every network interface present on the node, the cumulative amount of bytes read/written. We aggregate these numbers across all the interfaces except loopback. We tested the existence of these files in the following Linux kernel versions: Linux 3.2.0 Linux 2.6.32 Linux 3.13.0 Also, doing further search on the web, it seems people are using/recommending these files for extracting read/written network bytes counters. Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2801: - Attachment: YARN-2801.1.patch Fixed a bunch of formatting issues and reattached patch. Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gururaj Shetty Assignee: Wangda Tan Attachments: YARN-2801.1.patch Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590562#comment-14590562 ] Varun Saxena commented on YARN-3528: Thanks for updating the patch [~brahmareddy]. Few comments. # Move {{ServerSocketUtil}} to {{org.apache.hadoop.net}} instead of having it in {{org.apache.hadoop.fs}}. # As this is a utility class which might be used elsewhere as well, can we pass an initial port into {{getPort()}} and try with that port first before randomizing . We can use this instead of using 49152 everytime. # Probably can pass number of retries as well instead of fixing it as 10. Let the caller decide. Thoughts ? # Replace {{System.out.println}} by {{LOG}} # In catch block no need to recreate {{IOException}} and wrapping the caught exception. You can directly throw the exception caught as it is also {{IOException}} and you are not adding any additional information. # What's the point of having {{getFreePort()}} ? You can write 0 directly in code instead of calling this function or use a constant. Thoughts ? # If this is a class to be used only by tests, we can move it to test folder ? # In {{TestNodeManagerShutdown}}, catching {{RuntimeException}} is unnecessary. # In {{TestNodeManagerShutdown#startContainer}}, if exception is thrown(i.e. no free port is available) the code simply continues on to make a call to {{rpc.getProxy()}} with {{null}} containerManagerBindAddress. We can probably throw and exception so that test fails at correct location. # Can remove below line from {{BaseContainerManagerTest}} {code} // String bindAddress = 0.0.0.0:12345; {code}} Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker Labels: test Attachments: YARN-3528.patch A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590587#comment-14590587 ] Srikanth Kandula commented on YARN-3819: [~grey] The patch does have the generic component, in that it needs /proc/net... It would be possible to expose whatever additional fields end up being needed by schedulers or monitors. We only expose a first cut of them (total read/ written). Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590600#comment-14590600 ] Lei Guo commented on YARN-3819: --- Thanks for the explanation. My concern is mainly about we will need update the code when we need new resource information for scheduling purpose. If we have a generic framework, and the integration developer may can write a script to feed information into NM, and then RM can do scheduling based on that, this is part of my comment in 3332, https://issues.apache.org/jira/browse/YARN-3332?focusedCommentId=14355923page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14355923 Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3244) Add user specified information for clean-up container in ApplicationSubmissionContext
[ https://issues.apache.org/jira/browse/YARN-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3244: Attachment: YARN-3244.5.patch rebase the patch Add user specified information for clean-up container in ApplicationSubmissionContext - Key: YARN-3244 URL: https://issues.apache.org/jira/browse/YARN-3244 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3244.1.patch, YARN-3244.2.patch, YARN-3244.3.patch, YARN-3244.4.patch, YARN-3244.5.patch To launch user-specified clean up container, users need to provide proper informations to YARN. It should at least have following properties: * A flag to indicate whether needs to launch the clean-up container * A time-out period to indicate how long the clean-up container can run * maxRetry times * containerLaunchContext for clean-up container -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590699#comment-14590699 ] Carlo Curino commented on YARN-3656: Patch looks good to me (as you followed all previous rounds of advises we discussed). Looking at it, I would argue that the structure of a solution you provide, i.e., decomposing the placement agent into various sub-routines and a choice of policies for time-range selection (IStageEarliestStart) , and single-atom allocation (IStageAllocator) is quite elegant. I would propose, in fact, to gut the existing GreedyReservationAgent and turn it into a simple configuration class that selects the right set of policies for IStageAllocator, IStageEarliestStart and iteratively run over those. This would be cleaner, and make it easier for people to improve on specific sub-set of this problem. If you agree with this, you must ensure the behavior is identical to the current GreedyReservationAgent. You should be able to do this fairly easily given the infrastructure you have for your agents and the testing harnesses you have in place. My 2 cents... LowCost: A Cost-Based Placement Agent for YARN Reservations --- Key: YARN-3656 URL: https://issues.apache.org/jira/browse/YARN-3656 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Ishai Menache Assignee: Jonathan Yaniv Labels: capacity-scheduler, resourcemanager Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf YARN-1051 enables SLA support by allowing users to reserve cluster capacity ahead of time. YARN-1710 introduced a greedy agent for placing user reservations. The greedy agent makes fast placement decisions but at the cost of ignoring the cluster committed resources, which might result in blocking the cluster resources for certain periods of time, and in turn rejecting some arriving jobs. We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590576#comment-14590576 ] Hadoop QA commented on YARN-3591: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 14s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 46s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 52s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 36s | The applied patch generated 2 new checkstyle issues (total was 172, now 174). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 11s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 44m 29s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12740121/YARN-3591.5.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 6e3fcff | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8272/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8272/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8272/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8272/console | This message was automatically generated. Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3820) Collect disks usages on the node
[ https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Grandl updated YARN-3820: Attachment: YARN-3820-2.patch Collect disks usages on the node Key: YARN-3820 URL: https://issues.apache.org/jira/browse/YARN-3820 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3820-1.patch, YARN-3820-2.patch In this JIRA we propose to collect disks usages on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3244) Add user specified information for clean-up container in ApplicationSubmissionContext
[ https://issues.apache.org/jira/browse/YARN-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590702#comment-14590702 ] Xuan Gong commented on YARN-3244: - [~jianhe] Please take a look. Add user specified information for clean-up container in ApplicationSubmissionContext - Key: YARN-3244 URL: https://issues.apache.org/jira/browse/YARN-3244 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3244.1.patch, YARN-3244.2.patch, YARN-3244.3.patch, YARN-3244.4.patch, YARN-3244.5.patch To launch user-specified clean up container, users need to provide proper informations to YARN. It should at least have following properties: * A flag to indicate whether needs to launch the clean-up container * A time-out period to indicate how long the clean-up container can run * maxRetry times * containerLaunchContext for clean-up container -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2801: - Attachment: YARN-2801.md Attached initial version for review. Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gururaj Shetty Assignee: Wangda Tan Attachments: YARN-2801.md Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591120#comment-14591120 ] Ray Chiang commented on YARN-3069: -- I found one more important mismatch in the existing file. XML Property: yarn.scheduler.maximum-allocation-vcores XML Value:32 Config Name: DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES Config Value: 4 The Config value comes from YARN-193 and the default xml property comes from YARN-2. Should we keep it this way or should one of the values get updated? Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591165#comment-14591165 ] Akira AJISAKA commented on YARN-3069: - Nice catch! I'm thinking we can discuss the issue about the mismatch in a separate jira. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3821) Scheduler spams log with messages at INFO level
Wilfred Spiegelenburg created YARN-3821: --- Summary: Scheduler spams log with messages at INFO level Key: YARN-3821 URL: https://issues.apache.org/jira/browse/YARN-3821 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, fairscheduler Affects Versions: 2.8.0 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Priority: Minor The schedulers spams the logs with messages that are not providing any actionable information. There is no action taken in the code and there is nothing that needs to be done from an administrative point of view. Even after the improvements for the messages from YARN-3197 and YARN-3495 administrators get confused and ask what needs to be done to prevent the log spam. Moving the messages to a debug log level makes far more sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591194#comment-14591194 ] Varun Saxena commented on YARN-3804: Thanks for the review and commit [~xgong] Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Fix For: 2.7.1 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, YARN-3804.branch-2.7.patch Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591255#comment-14591255 ] Ray Chiang commented on YARN-3069: -- Created YARN-3823. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3823) Fix mismatch in default values for yarn.scheduler.maximum-allocation-vcores property
Ray Chiang created YARN-3823: Summary: Fix mismatch in default values for yarn.scheduler.maximum-allocation-vcores property Key: YARN-3823 URL: https://issues.apache.org/jira/browse/YARN-3823 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor In yarn-default.xml, the property is defined as: XML Property: yarn.scheduler.maximum-allocation-vcores XML Value: 32 In YarnConfiguration.java the corresponding member variable is defined as: Config Name: DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES Config Value: 4 The Config value comes from YARN-193 and the default xml property comes from YARN-2. Should we keep it this way or should one of the values get updated? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3825) Add automatic search of default Configuration variables to TestConfigurationFieldsBase
Ray Chiang created YARN-3825: Summary: Add automatic search of default Configuration variables to TestConfigurationFieldsBase Key: YARN-3825 URL: https://issues.apache.org/jira/browse/YARN-3825 Project: Hadoop YARN Issue Type: Test Components: test Affects Versions: 2.7.0 Reporter: Ray Chiang Assignee: Ray Chiang Add functionality given a Configuration variable FOO, to at least check the xml file value against DEFAULT_FOO. Without waivers and a mapping for exceptions, this can probably never be a test method that generates actual errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591285#comment-14591285 ] Varun Saxena commented on YARN-2902: Yeah this is pending for a long time. I have to primarily update tests. Got taken over by other tasks. Don't have any 2.7.1 related JIRAs' with me right now so will update a patch by this weekend. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-3069: - Attachment: YARN-3069.012.patch - Implement Akira's last 3 comments - First version including fixes from HADOOP-12101 -- Fix default in yarn.nodemanager.env-whitelist to match -- Fix spacing in two other properties to match Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch, YARN-3069.012.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3824) Fix two minor nits in member variable properties of YarnConfiguration
Ray Chiang created YARN-3824: Summary: Fix two minor nits in member variable properties of YarnConfiguration Key: YARN-3824 URL: https://issues.apache.org/jira/browse/YARN-3824 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Trivial Attachments: YARN-3824.001.patch Two nitpicks that could be cleaned up easily: - DEFAULT_YARN_INTERMEDIATE_DATA_ENCRYPTION is defined as a java.lang.Boolean instead of a boolean primitive - DEFAULT_RM_PROXY_USER_PRIVILEGES_ENABLED is missing the final keyword -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3824) Fix two minor nits in member variable properties of YarnConfiguration
[ https://issues.apache.org/jira/browse/YARN-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-3824: - Attachment: YARN-3824.001.patch Fix two minor nits in member variable properties of YarnConfiguration - Key: YARN-3824 URL: https://issues.apache.org/jira/browse/YARN-3824 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Trivial Labels: newbie Attachments: YARN-3824.001.patch Two nitpicks that could be cleaned up easily: - DEFAULT_YARN_INTERMEDIATE_DATA_ENCRYPTION is defined as a java.lang.Boolean instead of a boolean primitive - DEFAULT_RM_PROXY_USER_PRIVILEGES_ENABLED is missing the final keyword -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591295#comment-14591295 ] Ray Chiang commented on YARN-3069: -- Forgot to mention the above two comments are for the .012 patch, coming next. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591055#comment-14591055 ] Sandy Ryza commented on YARN-1197: -- The latest proposal makes sense to me as well. Thanks [~wangda] and [~mding]! Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591075#comment-14591075 ] Hadoop QA commented on YARN-3819: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 4s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:red}-1{color} | javac | 3m 3s | The patch appears to cause the build to fail. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12740201/YARN-3819-3.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | trunk / cc43288 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8274/console | This message was automatically generated. Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591143#comment-14591143 ] Hudson commented on YARN-3804: -- FAILURE: Integrated in Hadoop-trunk-Commit #8035 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8035/]) YARN-3804. Both RM are on standBy state when kerberos user not in yarn.admin.acl. Contributed by Varun Saxena (xgong: rev a826d432f9b45550cc5ab79ef63ca39b176dabb2) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Fix For: 2.7.1 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, YARN-3804.branch-2.7.patch Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by
[jira] [Updated] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Grandl updated YARN-3819: Attachment: YARN-3819-4.patch Updated patch to address the failure and the whitespaces Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch, YARN-3819-4.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591193#comment-14591193 ] Varun Saxena commented on YARN-3804: [~xgong], thanks for updating branch-2.7 patch. Didn't notice your comment due to time difference. Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Fix For: 2.7.1 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, YARN-3804.branch-2.7.patch Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591093#comment-14591093 ] Jun Gong commented on YARN-3809: [~devaraj.k] and [~kasha], thank you for the comments and suggestions. {quote} Shouldn't the number of threads in the pool be at least as big as the maximum number of apps that could run on a node?By making it configurable, how do we expect the admins to pick this number? Just pick an arbitrarily high value? {quote} Threads in the pool are just launching/stopping AMs, so it will be better that the number of threads in the pool is at least as big as the maximum number of AMs that could run on a node. Although we could not know the max value for all clusters in advance, a larger value will make it faster that deal with AMLauncher events. Admins could just pick the default value, and they could adjust the value if they find the value is a little small. {quote} Or, could we make it so we don't wait as long as 15 minutes? {quote} Yes, we could make it shorter. I think we also need a larger thread pool, then it could deal with more events at the same time. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3809.01.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3821) Scheduler spams log with messages at INFO level
[ https://issues.apache.org/jira/browse/YARN-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-3821: Attachment: YARN-3821.patch Moving messages to debug level No tests added since this is just a log level change Scheduler spams log with messages at INFO level --- Key: YARN-3821 URL: https://issues.apache.org/jira/browse/YARN-3821 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, fairscheduler Affects Versions: 2.8.0 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Priority: Minor Attachments: YARN-3821.patch The schedulers spams the logs with messages that are not providing any actionable information. There is no action taken in the code and there is nothing that needs to be done from an administrative point of view. Even after the improvements for the messages from YARN-3197 and YARN-3495 administrators get confused and ask what needs to be done to prevent the log spam. Moving the messages to a debug log level makes far more sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN
[ https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Shao updated YARN-3806: --- Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.2.pdf Proposal of Generic Scheduling Framework for YARN - Key: YARN-3806 URL: https://issues.apache.org/jira/browse/YARN-3806 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Wei Shao Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.2.pdf Currently, a typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies (fair, capacity, fifo and so on) for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic scheduling framework brought up in this proposal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN
[ https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Shao updated YARN-3806: --- Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf) Proposal of Generic Scheduling Framework for YARN - Key: YARN-3806 URL: https://issues.apache.org/jira/browse/YARN-3806 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Wei Shao Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.2.pdf Currently, a typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies (fair, capacity, fifo and so on) for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic scheduling framework brought up in this proposal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN
[ https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Shao updated YARN-3806: --- Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf) Proposal of Generic Scheduling Framework for YARN - Key: YARN-3806 URL: https://issues.apache.org/jira/browse/YARN-3806 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Wei Shao Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.2.pdf Currently, a typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies (fair, capacity, fifo and so on) for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic scheduling framework brought up in this proposal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591189#comment-14591189 ] Tsuyoshi Ozawa commented on YARN-3801: -- Thanks Sangjin, Zhijie, and Sean for your reviews! [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util - Key: YARN-3801 URL: https://issues.apache.org/jira/browse/YARN-3801 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Tsuyoshi Ozawa Assignee: Tsuyoshi Ozawa Fix For: YARN-2928 Attachments: YARN-3801.001.patch timelineservice depends on hbase-client and hbase-testing-util, and they dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. {quote} [WARNING] Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-client:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-testing-util:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence failed with message: Failed while enforcing releasability the error(s) are [ Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-client:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-testing-util:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
[ https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3812: --- Attachment: 0002-YARN-3812.patch Thnk you for review comments .As per comments have updated ImmutableFspermission .Please review the patch. TestRollingLevelDB class all test are passing. TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347 -- Key: YARN-3812 URL: https://issues.apache.org/jira/browse/YARN-3812 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.0.0 Reporter: Robert Kanter Assignee: Bibin A Chundatt Attachments: 0001-YARN-3812.patch, 0002-YARN-3812.patch {{TestRollingLevelDBTimelineStore}} is failing with the below errors in trunk. I did a git bisect and found that it was due to HADOOP-11347, which changed something with umasks in {{FsPermission}}. {noformat} Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec FAILURE! - in org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 1.533 sec ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200) at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65) testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 0.085 sec ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200) at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65) testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 0.07 sec ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200) at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at
[jira] [Created] (YARN-3822) Scalability validation of RM writing app/attempt/container lifecycle events
Zhijie Shen created YARN-3822: - Summary: Scalability validation of RM writing app/attempt/container lifecycle events Key: YARN-3822 URL: https://issues.apache.org/jira/browse/YARN-3822 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, timelineserver Reporter: Zhijie Shen Assignee: Naganarasimha G R We need to test how scalable RM metrics publisher is -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590133#comment-14590133 ] Zhijie Shen commented on YARN-3051: --- [~sjlee0], thanks for your chiming in. Varun, Li and I recently have a offline discussion. In general, we agreed on focusing on storage-oriented interface (raw data query) together with a FS implementation of it on this jira, but spinning off change about the user-oriented interface, web front wire up, and single reader daemon setup and dealing with them separately. The rationale is to roll out the reader interface fast, and we can work the HBase/Phoenix implement and web front wireup on a commonly agreed interface in parallel. How do you think about the plan? bq. It's already doing that to some extent, and we should push that some more. For instance, it might be helpful to create Context. Context is useful. Instead of creating a new one, maybe we can reuse the existing Context, which hosts more content than reader needs. So we just need to let reader put/get the required information to/from it. bq. In essence, one way to look at it is that a query onto the storage is really (context) + (predicate/filters) + (contents to retrieve). Then we could consolidate arguments into these coarse-grained things. +1 LGTM, but I think it's for the query of searching a set of qualified entities, right. For fetching a single entity, the query may look like (context) + (entity identifier) + (contents to retrieve) Another issue I want to raise is that after our performance evaluation, we agreed on using HBase for raw data and Phoenix for aggregated data. It implies that we need to use HBase to implement the APIs for the raw entities, while use Phoenix to implement the APIs for the aggregated data. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051-YARN-2928.003.patch, YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3813) Support Application timeout feature in YARN.
[ https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-3813: - Component/s: scheduler Fix Version/s: (was: 2.8.0) Support Application timeout feature in YARN. - Key: YARN-3813 URL: https://issues.apache.org/jira/browse/YARN-3813 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: nijel It will be useful to support Application Timeout in YARN. Some use cases are not worried about the output of the applications if the application is not completed in a specific time. *Background:* The requirement is to show the CDR statistics of last few minutes, say for every 5 minutes. The same Job will run continuously with different dataset. So one job will be started in every 5 minutes. The estimate time for this task is 2 minutes or lesser time. If the application is not completing in the given time the output is not useful. *Proposal* So idea is to support application timeout, with which timeout parameter is given while submitting the job. Here, user is expecting to finish (complete or kill) the application in the given time. One option for us is to move this logic to Application client (who submit the job). But it will be nice if it can be generic logic and can make more robust. Kindly provide your suggestions/opinion on this feature. If it sounds good, i will update the design doc and prototype patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3819) Collect network usage on the node
Robert Grandl created YARN-3819: --- Summary: Collect network usage on the node Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Grandl Assignee: Robert Grandl In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3820) Collect disks usages on the node
[ https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590533#comment-14590533 ] Robert Grandl commented on YARN-3820: - Short description: This JIRA collects bytes read/written from/to disks in Linux. Step 1: We exploit the /proc/diskstats counters, extract the number of sectors read/written for every disk, and return the aggregation of these counters among all the disks. Step 2: To convert sectors into bytes, for every disk, we extract the sector size from /sys/block/diskName/queue/hw_sector_size. Step 3: Finally by multiplying the number of sectors from Step 1 with sector size from Step 2 we compute the number of bytes. We tested the existence of these files in the following Linux kernel versions: Linux 3.2.0 Linux 2.6.32 Linux 3.13.0 Also, doing further search on the web, it seems people are using/recommending these files for extracting read/written disks counters Collect disks usages on the node Key: YARN-3820 URL: https://issues.apache.org/jira/browse/YARN-3820 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3820-1.patch In this JIRA we propose to collect disks usages on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3819) Collect network usage on the node
[ https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590553#comment-14590553 ] Robert Grandl commented on YARN-3819: - [~grey], YARN-2745 is an effort to schedule multiple resources. The resources taken in account are CPU/Memory/Disk/Network. For fungible resources such as disk and network, the counters required are the total number of bytes read/written from/to disk/network. This JIRA extends the ResourceCalculatorPlugin which is able to extract the amount of available CPU and Memory on a node. YARN-1012 is already using this information and YARN-1012 is aggregating this information in a heartbeat from NM to RM. Collect network usage on the node - Key: YARN-3819 URL: https://issues.apache.org/jira/browse/YARN-3819 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch In this JIRA we propose to collect the network usage on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3798: - Attachment: YARN-3798-branch-2.7.002.patch Attaching v2 patch to handle SESSIONMOVED correctly. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590574#comment-14590574 ] Tsuyoshi Ozawa commented on YARN-3798: -- [~varun_saxena] thank you for your review. It helps me a lot. {quote} Moreover, there can be a case when a particular zookeeper server is forever down. In this case also, we will keep on getting ConnectionLoss IIUC till retries exhaust. {quote} It looks exhaust, but it's not: reconnection to other ZooKeeper servers are done at ClientCnxn#startConnect in a main thread of ZooKeeper's client. Please note that a session is not equal to a connection in ZooKeeper. What we can do it to retry with current zookeeper client. I also noticed that we shouldn't create new session when SESSIONMOVED occurs. Updating a patch soon. {quote} So to handle these cases, I think we should retry with new connection atleast once. Thoughts ? {quote} I think we shouldn't create new ZooKeeper session unless SESSIONEXPIRED occurs: from http://wiki.apache.org/hadoop/ZooKeeper/FAQ : {quote} Only create a new session when you are notified of session expiration (mandatory) {quote} RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at
[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590571#comment-14590571 ] Varun Saxena commented on YARN-3528: 9. In TestNodeManagerShutdown#startContainer, if exception is thrown(i.e. no free port is available) the code simply continues on to make a call to rpc.getProxy() with null containerManagerBindAddress. We can probably throw an exception so that test fails at correct location. Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker Labels: test Attachments: YARN-3528.patch A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3820) Collect disks usages on the node
[ https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Grandl updated YARN-3820: Attachment: YARN-3820-3.patch Remove whitespaces and same build crash as YARN-3819-3.patch Collect disks usages on the node Key: YARN-3820 URL: https://issues.apache.org/jira/browse/YARN-3820 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 3.0.0 Reporter: Robert Grandl Assignee: Robert Grandl Labels: yarn-common, yarn-util Attachments: YARN-3820-1.patch, YARN-3820-2.patch, YARN-3820-3.patch In this JIRA we propose to collect disks usages on a node. This JIRA is part of a larger effort of monitoring resource usages on the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591288#comment-14591288 ] Ray Chiang commented on YARN-3069: -- Wrote up the code for automatic checking at HADOOP-12101 for automatic default checking. Ran automatic checking with the following results: - yarn-default.xml has 15 properties that do not match the default Config value -- Filed one bug filed at YARN-3823 -- Remaining 14 are due to variable references like ${yarn.resourcemanager.hostname} or a documented -1 value like yarn.nodemanager.resource.memory-mb. - Configuration(s) have 67 properties with no corresponding default member variable. These will need to be verified manually. -- Will document as a separate comment. - yarn-default.xml has 6 properties with empty values -- Nothing to compare - yarn-default.xml has 135 properties which match a corresponding Config variable -- No need to compare Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591294#comment-14591294 ] Ray Chiang commented on YARN-3069: -- Most of the manual verification were in the following categories: - Hardcoded value - Not using DEFAULT_FOO for FOO member variable naming convention - No default value at all - Variable is used indirectly Manual verification specifics: CLIENT_FAILOVER_MAX_ATTEMPTS - Hardcoded default to -1 in RMProxy CLIENT_FAILOVER_SLEEPTIME_BASE_MS CLIENT_FAILOVER_SLEEPTIME_MAX_MS - Defaults to RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_MS or DEFAULT_RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_MS DEBUG_NM_DELETE_DELAY_SEC - Hardcoded default to 0 in DeletionService FS_NODE_LABELS_STORE_ROOT_DIR - Defaults to FileSystemNodeLabelsStore#getDefaultFSNodeLabelsRootDir() return value FS_RM_STATE_STORE_URI - No default value anywhere IS_MINI_YARN_CLUSTER - Hardcoded to false in Client, MRApps, ResourceManager NM_AUX_SERVICES - No default value anywhere. Maybe whatever Configuration#getStringCollection() returns. NM_BIND_HOST - No default value anywhere NM_CONTAINER_EXECUTOR - Hardcoded to DefaultContainerExecutor.class in NodeManager NM_CONTAINER_LOCALIZER_JAVA_OPTS_KEY - Defaults to YarnConfiguration.NM_CONTAINER_LOCALIZER_JAVA_OPTS_DEFAULT in ContainerLocalizer NM_CONTAINER_MON_PROCESS_TREE NM_CONTAINER_MON_RESOURCE_CALCULATOR - Hardcoded to null in ContainersMonitorImpl NM_DISK_HEALTH_CHECK_ENABLE - Hardcoded to true in LocalDirsHanderService NM_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME - Defaults to unconventional name YarnConfiguration.NM_DEFAULT_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME - No default value anywhere NM_HEALTH_CHECK_SCRIPT_OPTS - Defaults to empty String array in NodeManager NM_HEALTH_CHECK_SCRIPT_PATH - No default value anywhere NM_KEYTAB - Defaults to YarnConfiguration.NM_PRINCIPAL NM_LINUX_CONTAINER_CGROUPS_HIERARCHY - Hardcoded to /hadoop-yarn in CGroupsHandlerImpl and CgroupsLCEResourcesHandler NM_LINUX_CONTAINER_CGROUPS_MOUNT - Hardcoded to false in CGroupsHandlerImpl and CgroupsLCEResourcesHandler NM_LINUX_CONTAINER_CGROUPS_MOUNT_PATH - Hardcoded to null in CGroupsHandlerImpl and CgroupsLCEResourcesHandler NM_LINUX_CONTAINER_EXECUTOR_PATH - Defaults to internal variable defaultPath (which looks to be based off HADOOP_YARN_HOME environment) NM_LINUX_CONTAINER_GROUP - Not used anywhere NM_LINUX_CONTAINER_RESOURCES_HANDLER - Hardcoded to DefaultLCEResourcesHandler.class in LinuxContainerExecutor NM_LOG_DELETION_THREADS_COUNT - Defaults to unconventional name YarnConfiguration.DEFAULT_NM_LOG_DELETE_THREAD_COUNT NM_NONSECURE_MODE_LOCAL_USER_KEY - Defaults to unconventional name YarnConfiguration.DEFAULT_NM_NONSECURE_MODE_LOCAL_USER NM_NONSECURE_MODE_USER_PATTERN_KEY - Defaults to unconventional name YarnConfiguration.DEFAULT_NM_NONSECURE_MODE_USER_PATTERN NM_PRINCIPAL - Is the default value for YarnConfiguration.NM_KEYTAB NM_RECOVERY_DIR - No default value anywhere NM_SYSTEM_RESERVED_PMEM_MB - Hardcoded to -1 in NodeManagerHardwareUtils NM_WEBAPP_SPNEGO_KEYTAB_FILE_KEY NM_WEBAPP_SPNEGO_USER_NAME_KEY - No default value anywhere NM_WINDOWS_SECURE_CONTAINER_GROUP - No default value anywhere PROXY_KEYTAB PROXY_PRINCIPAL - No default value anywhere RECOVERY_ENABLED - Defaults to YarnConfiguration.DEFAULT_NM_NONSECURE_MODE_USER_PATTERN in ResourceManager RM_BIND_HOST - No default value anywhere RM_CLUSTER_ID - No default value anywhere RM_DELEGATION_KEY_UPDATE_INTERVAL_KEY - Defaults to YarnConfiguration.RM_DELEGATION_KEY_UPDATE_INTERVAL_DEFAULT in RMSecretManagerService RM_DELEGATION_TOKEN_MAX_LIFETIME_KEY - Defaults to YarnConfiguration.RM_DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT in RMSecretManagerService RM_DELEGATION_TOKEN_RENEW_INTERVAL_KEY - Defaults to YarnConfiguration.RM_DELEGATION_TOKEN_RENEW_INTERVAL_DEFAULT in RMSecretManagerService RM_HA_ID - Defaults to values from RM_HA_IDS RM_HA_IDS - No default value, but gets validation in HAUtil#verifyAndSetRMHAIdsList() RM_HOSTNAME - Defaults to internal variable RMId in HAUtils RM_KEYTAB - Defaults to YarnConfiguration.RM_PRINCIPAL RM_LEVELDB_STORE_PATH - No default value anywhere RM_PRINCIPAL - Default value for RM_KEYTAB RM_PROXY_USER_PRIVILEGES_ENABLED - Defaults to YarnConfiguration.DEFAULT_RM_PROXY_USER_PRIVILEGES_ENABLED. Needs final keyword added. RM_RESERVATION_SYSTEM_CLASS - Defaults to AbstractReservationSystem#getDefaultReservationSystem(scheduler) RM_RESERVATION_SYSTEM_PLAN_FOLLOWER - Defaults to AbstractReservationSystem.getDefaultPlanFollower() RM_SCHEDULER_INCLUDE_PORT_IN_NODE_NAME - Unconventional default YarnConfiguration.DEFAULT_RM_SCHEDULER_USE_PORT_FOR_NODE_NAME RM_SCHEDULER_MONITOR_POLICIES - Defaults to an SchedulingEditPolicy.class as an Interface RM_STORE - Hardcoded to MemoryRMStateStore.class in
[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589690#comment-14589690 ] Varun Saxena commented on YARN-3148: Thanks [~devaraj.k] for the commit Allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Prakash Ramachandran Assignee: Varun Saxena Fix For: 2.8.0 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, YARN-3148.03.patch, YARN-3148.04.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables
[ https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joep Rottinghuis updated YARN-3706: --- Attachment: YARN-3706-YARN-2928.015.patch YARN-3706-YARN-2928.015.patch local runs in pseudo distributed mode work moved Entity* classes to o.a.h.y.timelineservice.storage.entity Generalize native HBase writer for additional tables Key: YARN-3706 URL: https://issues.apache.org/jira/browse/YARN-3706 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Joep Rottinghuis Assignee: Joep Rottinghuis Priority: Minor Attachments: YARN-3706-YARN-2928.001.patch, YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, YARN-3706-YARN-2928.014.patch, YARN-3706-YARN-2928.015.patch, YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, YARN-3726-YARN-2928.009.patch When reviewing YARN-3411 we noticed that we could change the class hierarchy a little in order to accommodate additional tables easily. In order to get ready for benchmark testing we left the original layout in place, as performance would not be impacted by the code hierarchy. Here is a separate jira to address the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589686#comment-14589686 ] Varun Saxena commented on YARN-3804: Test failure unrelated. YARN-3790 already filed for it Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Attachments: YARN-3804.01.patch, YARN-3804.02.patch, YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589694#comment-14589694 ] Hudson commented on YARN-3148: -- FAILURE: Integrated in Hadoop-trunk-Commit #8033 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8033/]) YARN-3148. Allow CORS related headers to passthrough in (devaraj: rev ebb9a82519c622bb898e1eec5798c2298c726694) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java Allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Prakash Ramachandran Assignee: Varun Saxena Fix For: 2.8.0 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, YARN-3148.03.patch, YARN-3148.04.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
[ https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589695#comment-14589695 ] Hudson commented on YARN-3617: -- FAILURE: Integrated in Hadoop-trunk-Commit #8033 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8033/]) YARN-3617. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning (devaraj: rev 318d2cde7cb5c05a5f87c4ee967446bb60d28ae4) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1 - Key: YARN-3617 URL: https://issues.apache.org/jira/browse/YARN-3617 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: Windows 7 x64 SP1 Reporter: Georg Berendt Assignee: J.Andreina Priority: Minor Fix For: 2.8.0 Attachments: YARN-3617.1.patch Original Estimate: 1h Remaining Estimate: 1h In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, there is an unused variable for CPU frequency. /** {@inheritDoc} */ @Override public long getCpuFrequency() { refreshIfNeeded(); return -1; } Please change '-1' to use 'cpuFrequencyKhz'. org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle
[ https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3047: --- Target Version/s: YARN-2928 [Data Serving] Set up ATS reader with basic request serving structure and lifecycle --- Key: YARN-3047 URL: https://issues.apache.org/jira/browse/YARN-3047 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Varun Saxena Labels: BB2015-05-TBR Attachments: Timeline_Reader(draft).pdf, YARN-3047.001.patch, YARN-3047.003.patch, YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, YARN-3047.02.patch, YARN-3047.04.patch Per design in YARN-2938, set up the ATS reader as a service and implement the basic structure as a service. It includes lifecycle management, request serving, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables
[ https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589354#comment-14589354 ] Joep Rottinghuis commented on YARN-3706: Fixed code, now have successful run: {noformat} 15/06/16 22:57:15 INFO mapreduce.Job: Counters: 23 File System Counters FILE: Number of bytes read=1651635 FILE: Number of bytes written=1927484 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=0 HDFS: Number of bytes written=0 HDFS: Number of read operations=0 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Map-Reduce Framework Map input records=1 Map output records=0 Input split bytes=48 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=18 Total committed heap usage (bytes)=325058560 org.apache.hadoop.mapred.TimelineServicePerformanceV2$PerfCounters TIMELINE_SERVICE_WRITE_COUNTER=230 TIMELINE_SERVICE_WRITE_KBS=230 TIMELINE_SERVICE_WRITE_TIME=66 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 TRANSACTION RATE (per mapper): 3484.848484848485 ops/s IO RATE (per mapper): 3484.848484848485 KB/s TRANSACTION RATE (total): 3484.848484848485 ops/s IO RATE (total): 3484.848484848485 KB/s {noformat} and an individual history file: {noformat} 15/06/16 22:58:06 INFO mapreduce.Job: Job job_local1358267884_0001 completed successfully 15/06/16 22:58:06 INFO mapreduce.Job: Counters: 22 File System Counters FILE: Number of bytes read=1651635 FILE: Number of bytes written=1927113 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=141020 HDFS: Number of bytes written=0 HDFS: Number of read operations=3 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Map-Reduce Framework Map input records=1 Map output records=0 Input split bytes=48 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=63 Total committed heap usage (bytes)=460324864 org.apache.hadoop.mapred.TimelineServicePerformanceV2$PerfCounters TIMELINE_SERVICE_WRITE_COUNTER=25 TIMELINE_SERVICE_WRITE_TIME=145 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 TRANSACTION RATE (per mapper): 172.41379310344828 ops/s IO RATE (per mapper): 0.0 KB/s TRANSACTION RATE (total): 172.41379310344828 ops/s IO RATE (total): 0.0 KB/s {noformat} Will make change suggested by [~zjshen], run another test run and upload a new patch. Generalize native HBase writer for additional tables Key: YARN-3706 URL: https://issues.apache.org/jira/browse/YARN-3706 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Joep Rottinghuis Assignee: Joep Rottinghuis Priority: Minor Attachments: YARN-3706-YARN-2928.001.patch, YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, YARN-3706-YARN-2928.014.patch, YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, YARN-3726-YARN-2928.009.patch When reviewing YARN-3411 we noticed that we could change the class hierarchy a little in order to accommodate additional tables easily. In order to get ready for benchmark testing we left the original layout in place, as performance would not be impacted by the code hierarchy. Here is a separate jira to address the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
[ https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3617: Summary: Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1 (was: Fix unused variable to get CPU frequency on Windows systems) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1 - Key: YARN-3617 URL: https://issues.apache.org/jira/browse/YARN-3617 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Environment: Windows 7 x64 SP1 Reporter: Georg Berendt Assignee: J.Andreina Priority: Minor Attachments: YARN-3617.1.patch Original Estimate: 1h Remaining Estimate: 1h In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, there is an unused variable for CPU frequency. /** {@inheritDoc} */ @Override public long getCpuFrequency() { refreshIfNeeded(); return -1; } Please change '-1' to use 'cpuFrequencyKhz'. org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
[ https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3617: Component/s: (was: yarn) Flags: (was: Patch) Hadoop Flags: Reviewed +1, will commit it shortly. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1 - Key: YARN-3617 URL: https://issues.apache.org/jira/browse/YARN-3617 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: Windows 7 x64 SP1 Reporter: Georg Berendt Assignee: J.Andreina Priority: Minor Attachments: YARN-3617.1.patch Original Estimate: 1h Remaining Estimate: 1h In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, there is an unused variable for CPU frequency. /** {@inheritDoc} */ @Override public long getCpuFrequency() { refreshIfNeeded(); return -1; } Please change '-1' to use 'cpuFrequencyKhz'. org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)