[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585709#comment-14585709 ] Bibin A Chundatt commented on YARN-3789: [~devaraj.k] Thnks for review . Seems you have reviewed based on 003-YARN-3789 . In {{activateApplications()}} its checks for new AM limits if the application is activated that is the reason why i updated the logs like . {code} LOG.info(Not activating + applicationId + . If application activated usedAMResource + amIfStarted + will exceed amLimit + amLimit); {code} If 004-YARN-3789 is also still confusing will surely update log message. I will also handle unused imports as part of next patch after your reply. Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1042) add ability to specify affinity/anti-affinity in container requests
[ https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-1042: -- Attachment: YARN-1042.002.patch add ability to specify affinity/anti-affinity in container requests --- Key: YARN-1042 URL: https://issues.apache.org/jira/browse/YARN-1042 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 3.0.0 Reporter: Steve Loughran Assignee: Arun C Murthy Attachments: YARN-1042-demo.patch, YARN-1042-design-doc.pdf, YARN-1042.001.patch, YARN-1042.002.patch container requests to the AM should be able to request anti-affinity to ensure that things like Region Servers don't come up on the same failure zones. Similarly, you may be able to want to specify affinity to same host or rack without specifying which specific host/rack. Example: bringing up a small giraph cluster in a large YARN cluster would benefit from having the processes in the same rack purely for bandwidth reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585660#comment-14585660 ] Bibin A Chundatt commented on YARN-3804: Can we check for AccessControlException in {{ActiveStandbyElector#becomeActive()}} send event to shutdown ? Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2
[ https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585676#comment-14585676 ] Naganarasimha G R commented on YARN-3792: - * Test case reported is not due to this patch and already YARN-3790 has been raised to address it. * white space is not caused by this patch * incorrect findbugs alert, report has no issues. Patch is good enough to review ! Test case failures in TestDistributedShell and some issue fixes related to ATSV2 Key: YARN-3792 URL: https://issues.apache.org/jira/browse/YARN-3792 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3792-YARN-2928.001.patch # encountered [testcase failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] which was happening even without the patch modifications in YARN-3044 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression # Remove unused {{enableATSV1}} in testDisstributedShell # container metrics needs to be published only for v2 test cases of testDisstributedShell # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux service was not configured and {{TimelineClient.putObjects}} was getting invoked. # Race condition for the Application events to published and test case verification for RM's ApplicationFinished Timeline Events # Application Tags for converted to lowercase in ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to detect to custom flow details of the app -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585664#comment-14585664 ] Rohith commented on YARN-3789: -- +1(non-binding) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
Bibin A Chundatt created YARN-3804: -- Summary: Both RM are on standBy state when kerberos user not in yarn.admin.acl Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585735#comment-14585735 ] Devaraj K commented on YARN-3789: - I had a look into the 0003-YARN-3789.patch previously, sorry for that. I think the latest patch also has the same issue with the message which I mentioned. Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585685#comment-14585685 ] Devaraj K commented on YARN-3789: - Thanks [~bibinchundatt] for the patch. 1. {code:xml} + LOG.info(Not activating + applicationId + + if application activated usedAMResource + amIfStarted + + exceeds amLimit + amLimit); {code} {code:xml} + LOG.info(Not activating + applicationId + for user + user + + if application activated usedUserAMResource + + userAmIfStarted + exceeds userAmLimit + userAMLimit); {code} These logs are still confusing(atleast to me), can you make some thing like this or anything better, {code:xml} Not activating application applicationId as amIfStarted: amIfStarted exceeds amLimit: amLimit. {code} {code:xml} Not activating application applicationId for user: user as amIfStarted: amIfStarted exceeds userAmLimit: userAMLimit. {code} 2. Can you also remove the unused imports in the same file LeafQueue.java as part of this patch? Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585855#comment-14585855 ] Tsuyoshi Ozawa commented on YARN-3798: -- Thanks [~zxu] for your explanation. I also traced application_1433764310492_7152 in the log, but the application was not removed from RMStateStore. It means application_1433764310492_7152 and appattempt_application_1433764310492_7152_* are not visible without removing the znodes. [~bibinchundatt] what's the ZK version are you using? BTW, I found an improvement point: when error code is CONNETIONLOSS or OPERATION TIMEOUT, ZKRMStateStore closes a current connection and try to create a new connection before retrying. This shouldn't be done in general. We should just wait for accepting a SyncConnected event until timeout occur, while current code looks good to me. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585897#comment-14585897 ] Tsuyoshi Ozawa commented on YARN-3798: -- 5. [ZKRMStateStore] ZKRMStateStore uses new connection and old connection is still alive. Old view can be seen from new connection since it's another client. In this case, there is no guarantee. 6. When old session is expired, all updates by old session can be seen because of virtual synchrony. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at
[jira] [Updated] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3789: --- Attachment: 0005-YARN-3789.patch [~devaraj.k] i have removed unused {{import org.apache.hadoop.yarn.server.resourcemanager.RMContext;}} Also updated the log as per your comments . Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585891#comment-14585891 ] Tsuyoshi Ozawa commented on YARN-3798: -- {quote} while current code looks good to me. {quote} I found one corner case current code doesn't work correctly: 1. [ZKRMStateStore] Receiving CONNECTIONLOSS or OPERATIONTIMEOUT in ZKRMStateStore#runWithRetries. 2. [ZKRMStateStore] Failing to zkClient.close() in ZKRMStateStore#createConnection, but IOException is ignored. 3. [ZK Server] Failing to accept close() request. A previous session is still alive. 4. [ZKRMStateStore] Creating new connection in ZKRMStateStore#createConnection. In this case, correct fix is to wait for SESSIONEXPIRED or SESSIONMOVED. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586174#comment-14586174 ] Bibin A Chundatt commented on YARN-3789: Looked at the precommit build results. Checkstyle already existing , Test failure due to YARN-3790. Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586161#comment-14586161 ] Hadoop QA commented on YARN-3789: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 59s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 32s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 46s | The applied patch generated 1 new checkstyle issues (total was 153, now 151). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 50m 52s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 88m 41s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739612/0005-YARN-3789.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 4c5da9b | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8251/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8251/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8251/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8251/console | This message was automatically generated. Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586139#comment-14586139 ] MENG DING commented on YARN-1197: - Had a very good discussion with [~leftnoteasy] at the Hadoop summit. We all agreed that due to the complexity of the current design, it is worthwhile to revisit the idea of increasing and decreasing container size both through Resource Manager, that would at least eliminate the need for token expiration logic, and also eliminate the need for AM-NM protocol and APIs. I am currently working on the new design, and will post it for review when it is ready. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2
[ https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586128#comment-14586128 ] Junping Du commented on YARN-3792: -- Thanks [~Naganarasimha] for delivering the patch to fix it! Will review your patch soon. Test case failures in TestDistributedShell and some issue fixes related to ATSV2 Key: YARN-3792 URL: https://issues.apache.org/jira/browse/YARN-3792 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3792-YARN-2928.001.patch # encountered [testcase failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] which was happening even without the patch modifications in YARN-3044 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression # Remove unused {{enableATSV1}} in testDisstributedShell # container metrics needs to be published only for v2 test cases of testDisstributedShell # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux service was not configured and {{TimelineClient.putObjects}} was getting invoked. # Race condition for the Application events to published and test case verification for RM's ApplicationFinished Timeline Events # Application Tags for converted to lowercase in ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to detect to custom flow details of the app -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1042) add ability to specify affinity/anti-affinity in container requests
[ https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585494#comment-14585494 ] Weiwei Yang commented on YARN-1042: --- Hi Steve From your comments, I think you want yarn API to support requesting containers with given rules per request. Guess you want it to have the API looks like in AMRMClient.java, ContainerRequest supports following arguments * Resource capability * String[] nodes * String[] racks * Priority priority * boolean relaxLocality * String nodeLabelExpression * *ContainerAllocateRule containerAllocateRule* the last one (in bold text) is the new argument for you to specify a particular rule (we will discuss rule details later). Problem here is, RM needs to know which application (or role in slider's context) the rule applies for, only then RM can assign containers by obeying that rule when dealing with the request coming from the same application. However, if you only specify the rule per container request, how can RM know what are the containers need to be considered to when applies the rule ? Let me give an example to explain {code} ContainerRequest containerReq1 = new ContainerRequest(capability1, nodes, racks, priority, affinityRequiredRule); amClient.addContainerRequest(containerReq1); AllocateResponse allocResponse = amClient.allocate(0.1f) {code} The AllocationRequest AM sent to RM only tells RM that these container requests need to use affinityRequiredRule, but RM does not know which containers this request affine with, so RM cannot place the rule during the allocation. This is the reason why I propose to register the mapping about {code} application - allocation-rule {code} when client submits the application, and keep it in the RM context, so RM can apply the rule when there is a request coming from AM. add ability to specify affinity/anti-affinity in container requests --- Key: YARN-1042 URL: https://issues.apache.org/jira/browse/YARN-1042 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 3.0.0 Reporter: Steve Loughran Assignee: Arun C Murthy Attachments: YARN-1042-demo.patch, YARN-1042-design-doc.pdf, YARN-1042.001.patch container requests to the AM should be able to request anti-affinity to ensure that things like Region Servers don't come up on the same failure zones. Similarly, you may be able to want to specify affinity to same host or rack without specifying which specific host/rack. Example: bringing up a small giraph cluster in a large YARN cluster would benefit from having the processes in the same rack purely for bandwidth reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses
[ https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587389#comment-14587389 ] Masatake Iwasaki commented on YARN-3711: Thanks, [~ozawa]! Documentation of ResourceManager HA should explain configurations about listen addresses Key: YARN-3711 URL: https://issues.apache.org/jira/browse/YARN-3711 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.7.1 Attachments: YARN-3711.002.patch, YARN-3711.003.patch There should be explanation about webapp address in addition to RPC address. AM proxy filter needs explicit definition of {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses in RM-HA mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3806) Proposal of Generic Scheduling Framework for YARN
Wei Shao created YARN-3806: -- Summary: Proposal of Generic Scheduling Framework for YARN Key: YARN-3806 URL: https://issues.apache.org/jira/browse/YARN-3806 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Wei Shao Currently, A typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3808) Proposal of Time Based Fair Scheduling for YARN
[ https://issues.apache.org/jira/browse/YARN-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Shao updated YARN-3808: --- Attachment: ProposalOfTimeBasedFairSchedulingForYARN-V1.0.pdf Proposal of Time Based Fair Scheduling for YARN --- Key: YARN-3808 URL: https://issues.apache.org/jira/browse/YARN-3808 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler, scheduler Reporter: Wei Shao Attachments: ProposalOfTimeBasedFairSchedulingForYARN-V1.0.pdf This proposal talks about the issues of YARN fair scheduling policy, and tries to solve them by YARN-3806 and the new scheduling policy called time based fair scheduling. Time based fair scheduling policy is proposed to enforces time based fairness among users. For example, if two users share the cluster weekly, each user’s fair share is half of the cluster per week. At a particular week, if the first user has used the whole cluster for first half of the week, then in second half of the week, second user will always have priority to use cluster resources since the first user has used up its fair share of the cluster already. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
Jun Gong created YARN-3809: -- Summary: Failed to launch new attempts because ApplicationMasterLauncher's threads all hang Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables
[ https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joep Rottinghuis updated YARN-3706: --- Attachment: YARN-3706-YARN-2928.013.patch Thanks for the review [~sjlee0], all your comments are addressed in the latest patch (YARN-3706-YARN-2928.013.patch). I have not yet tested this on a (semi) real cluster aside from the unit test. I plan to do this in the next couple of days, preferably on the real test cluster we used for previous benchmarking and preferably against the same input set in order to confirm execution times are comparable (or better). Generalize native HBase writer for additional tables Key: YARN-3706 URL: https://issues.apache.org/jira/browse/YARN-3706 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Joep Rottinghuis Assignee: Joep Rottinghuis Priority: Minor Attachments: YARN-3706-YARN-2928.001.patch, YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, YARN-3726-YARN-2928.009.patch When reviewing YARN-3411 we noticed that we could change the class hierarchy a little in order to accommodate additional tables easily. In order to get ready for benchmark testing we left the original layout in place, as performance would not be impacted by the code hierarchy. Here is a separate jira to address the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587233#comment-14587233 ] Sangjin Lee commented on YARN-3801: --- +1 from me. Although it's bit early for JDK 8, we can have this proactively on our feature branch. We can remove the exclusion rules later if and when we decide to move to a HBase version that doesn't bring in conflicting versions. Folks, let me know if you have any objection, or I'll merge it to our feature branch soon. [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util - Key: YARN-3801 URL: https://issues.apache.org/jira/browse/YARN-3801 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Tsuyoshi Ozawa Assignee: Tsuyoshi Ozawa Attachments: YARN-3801.001.patch timelineservice depends on hbase-client and hbase-testing-util, and they dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. {quote} [WARNING] Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-client:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-testing-util:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence failed with message: Failed while enforcing releasability the error(s) are [ Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-client:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-testing-util:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain about webapp address configuration
[ https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587262#comment-14587262 ] Tsuyoshi Ozawa commented on YARN-3711: -- +1, commtting this shortly. Documentation of ResourceManager HA should explain about webapp address configuration - Key: YARN-3711 URL: https://issues.apache.org/jira/browse/YARN-3711 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3711.002.patch, YARN-3711.003.patch There should be explanation about webapp address in addition to RPC address. AM proxy filter needs explicit definition of {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses in RM-HA mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3798: - Affects Version/s: 2.7.0 RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587275#comment-14587275 ] Bibin A Chundatt commented on YARN-3804: [~vinodkv] {quote}Allow the daemon user to do the refresh irrespective of what admin configures{quote} sounds better to me. Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587297#comment-14587297 ] Xuan Gong commented on YARN-433: remove LaunchedTransition from RMContainerImpl, and move to RMNodeImpl which will be called when RMNode caches up the ContainerStatus. When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-433.1.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3709) RM Web UI AM link shown before MRAppMaster launch
[ https://issues.apache.org/jira/browse/YARN-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587391#comment-14587391 ] Weiwei Yang commented on YARN-3709: --- I don't think this is a bug. Once an application is ACCEPTED, you can click ApplicationMaster link to track the progress, before it starts to run, it shows like YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM. FinalStatus Reported by AM: Application has not completed yet. Started:Mon Jun 15 20:16:16 -0700 2015 Elapsed:3mins, 21sec that gives you the information about the current status of the application, and how long the application has been waiting. That is useful, isn't it ? RM Web UI AM link shown before MRAppMaster launch - Key: YARN-3709 URL: https://issues.apache.org/jira/browse/YARN-3709 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Priority: Minor Attachments: ApplicationMasterLink.png Steps to reproduce === 1.Configure HA setup with 2 NM 2.AM allocated memory 1024 MB in CS 3.Submit 5 pi jobs in parallel 4.2 AM runs in parallel *Expected :* Only for running Applications Tracking Url/AM link should be shown *Actual:* For all 5 application *ApplicationMaster* link shown For application unassigned with AM Tracking URL should be shown as *UNASSIGNED* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587402#comment-14587402 ] Devaraj K commented on YARN-3789: - Thanks [~bibinchundatt] for the updated patch. It looks good to me. [~rohithsharma], do you have any comments on the latest patch? Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN
[ https://issues.apache.org/jira/browse/YARN-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Shao updated YARN-3807: --- Description: This proposal talks about limitations of the YARN scheduling policies for SLA applications, and tries to solve them by YARN-3806 and the new scheduling policy called guaranteed capacity scheduling. Guaranteed capacity scheduling makes guarantee to the applications that they can get resources under specified capacity cap in totally predictable manner. The application can meet SLA more easily since it is self-contained in the shared cluster - external uncertainties are eliminated. was: This proposal talks about limitations of the YARN scheduling policies for SLA applications, and tries to solve them by [Link] and the new scheduling policy called guaranteed capacity scheduling. Guaranteed capacity scheduling makes guarantee to the applications that they can get resources under specified capacity cap in totally predictable manner. The application can meet SLA more easily since it is self-contained in the shared cluster - external uncertainties are eliminated. Proposal of Guaranteed Capacity Scheduling for YARN --- Key: YARN-3807 URL: https://issues.apache.org/jira/browse/YARN-3807 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, fairscheduler Reporter: Wei Shao This proposal talks about limitations of the YARN scheduling policies for SLA applications, and tries to solve them by YARN-3806 and the new scheduling policy called guaranteed capacity scheduling. Guaranteed capacity scheduling makes guarantee to the applications that they can get resources under specified capacity cap in totally predictable manner. The application can meet SLA more easily since it is self-contained in the shared cluster - external uncertainties are eliminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN
[ https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Shao updated YARN-3806: --- Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf Proposal of Generic Scheduling Framework for YARN - Key: YARN-3806 URL: https://issues.apache.org/jira/browse/YARN-3806 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Wei Shao Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf Currently, A typical YARN cluster runs many different kinds of applications: production applications, ad hoc user applications, long running services and so on. Different YARN scheduling policies may be suitable for different applications. For example, capacity scheduling can manage production applications well since application can get guaranteed resource share, fair scheduling can manage ad hoc user applications well since it can enforce fairness among users. However, current YARN scheduling framework doesn’t have a mechanism for multiple scheduling policies work hierarchically in one cluster. YARN-3306 talked about many issues of today’s YARN scheduling framework, and proposed a per-queue policy driven framework. In detail, it supported different scheduling policies for leaf queues. However, support of different scheduling policies for upper level queues is not seriously considered yet. A generic scheduling framework is proposed here to address these limitations. It supports different policies for any queue consistently. The proposal tries to solve many other issues in current YARN scheduling framework as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587465#comment-14587465 ] Tsuyoshi Ozawa commented on YARN-3798: -- Thanks! RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at
[jira] [Updated] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3798: - Priority: Blocker (was: Major) RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at
[jira] [Updated] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3798: - Target Version/s: 2.7.1 RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at
[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN
[ https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587370#comment-14587370 ] Chun Chen commented on YARN-1983: - [~vinodkv], according to your suggestion, I propose the following change: 1. Allow NM_CE to specify a comma list CE classes. 2. Allow user to specify a env named NM_CLIENT_CE in CLC. If the value of NM_CLIENT_CE is one of the CE class configured previous, choose that one to execute the container, throw exception otherwise. 3. If user specify only one CE class of NM_CE, ignore NM_CLIENT_CE in env of CLC and always use that one to execute containers. 4. If user specify multiple classes of NM_CE, he has to configure for a default CE named NM_DEFAULT_CE in yarn-site.xml in case he doesn't specify env NM_CLIENT_CE on submit containers. NM_CE=yarn.nodemanager.container-executor.class NM_CLIENT_CE=yarn.nodemanager.client.container-executor.class NM_DEFAULT_CE=yarn.nodemanager.default.container-executor.class Support heterogeneous container types at runtime on YARN Key: YARN-1983 URL: https://issues.apache.org/jira/browse/YARN-1983 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Attachments: YARN-1983.2.patch, YARN-1983.patch Different container types (default, LXC, docker, VM box, etc.) have different semantics on isolation of security, namespace/env, performance, etc. Per discussions in YARN-1964, we have some good thoughts on supporting different types of containers running on YARN and specified by application at runtime which largely enhance YARN's flexibility to meet heterogenous app's requirement on isolation at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN
Wei Shao created YARN-3807: -- Summary: Proposal of Guaranteed Capacity Scheduling for YARN Key: YARN-3807 URL: https://issues.apache.org/jira/browse/YARN-3807 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, fairscheduler Reporter: Wei Shao This proposal talks about limitations of the YARN scheduling policies for SLA applications, and tries to solve them by [Link] and the new scheduling policy called guaranteed capacity scheduling. Guaranteed capacity scheduling makes guarantee to the applications that they can get resources under specified capacity cap in totally predictable manner. The application can meet SLA more easily since it is self-contained in the shared cluster - external uncertainties are eliminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses
[ https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587346#comment-14587346 ] Tsuyoshi Ozawa commented on YARN-3711: -- Committed this to trunk, branch-2, and branch-2.7. Thanks [~iwasakims] for your contribution. Documentation of ResourceManager HA should explain configurations about listen addresses Key: YARN-3711 URL: https://issues.apache.org/jira/browse/YARN-3711 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.7.1 Attachments: YARN-3711.002.patch, YARN-3711.003.patch There should be explanation about webapp address in addition to RPC address. AM proxy filter needs explicit definition of {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses in RM-HA mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587412#comment-14587412 ] Rohith commented on YARN-3789: -- Looks good to me too.. Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3808) Proposal of Time Based Fair Scheduling for YARN
Wei Shao created YARN-3808: -- Summary: Proposal of Time Based Fair Scheduling for YARN Key: YARN-3808 URL: https://issues.apache.org/jira/browse/YARN-3808 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler, scheduler Reporter: Wei Shao This proposal talks about the issues of YARN fair scheduling policy, and tries to solve them by YARN-3806 and the new scheduling policy called time based fair scheduling. Time based fair scheduling policy is proposed to enforces time based fairness among users. For example, if two users share the cluster weekly, each user’s fair share is half of the cluster per week. At a particular week, if the first user has used the whole cluster for first half of the week, then in second half of the week, second user will always have priority to use cluster resources since the first user has used up its fair share of the cluster already. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587241#comment-14587241 ] Bibin A Chundatt commented on YARN-3798: [~ozawa] Hadoop version 2.7.0 and ZK 3.5.0 we are using RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at
[jira] [Updated] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses
[ https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3711: - Summary: Documentation of ResourceManager HA should explain configurations about listen addresses (was: Documentation of ResourceManager HA should explain about webapp address configuration) Documentation of ResourceManager HA should explain configurations about listen addresses Key: YARN-3711 URL: https://issues.apache.org/jira/browse/YARN-3711 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3711.002.patch, YARN-3711.003.patch There should be explanation about webapp address in addition to RPC address. AM proxy filter needs explicit definition of {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses in RM-HA mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses
[ https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587292#comment-14587292 ] Hudson commented on YARN-3711: -- FAILURE: Integrated in Hadoop-trunk-Commit #8023 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8023/]) YARN-3711. Documentation of ResourceManager HA should explain configurations about listen addresses. Contributed by Masatake Iwasaki. (ozawa: rev e8c514373f2d258663497a33ffb3b231d0743b57) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerHA.md * hadoop-yarn-project/CHANGES.txt Documentation of ResourceManager HA should explain configurations about listen addresses Key: YARN-3711 URL: https://issues.apache.org/jira/browse/YARN-3711 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3711.002.patch, YARN-3711.003.patch There should be explanation about webapp address in addition to RPC address. AM proxy filter needs explicit definition of {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses in RM-HA mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN
[ https://issues.apache.org/jira/browse/YARN-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Shao updated YARN-3807: --- Attachment: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.0.pdf Proposal of Guaranteed Capacity Scheduling for YARN --- Key: YARN-3807 URL: https://issues.apache.org/jira/browse/YARN-3807 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, fairscheduler Reporter: Wei Shao Attachments: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.0.pdf This proposal talks about limitations of the YARN scheduling policies for SLA applications, and tries to solve them by YARN-3806 and the new scheduling policy called guaranteed capacity scheduling. Guaranteed capacity scheduling makes guarantee to the applications that they can get resources under specified capacity cap in totally predictable manner. The application can meet SLA more easily since it is self-contained in the shared cluster - external uncertainties are eliminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586249#comment-14586249 ] Sandy Ryza commented on YARN-1197: -- Sorry, I've been quiet here for a while, but I'd be concerned about a design that requires going through the ResourceManager for decreases. If I understand correctly, this would be considerable hit to performance, which could be prohibitive for frameworks like Spark that might use container-resizing for allocating per-task resources. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586455#comment-14586455 ] Vinod Kumar Vavilapalli commented on YARN-1197: --- bq. We all agreed that due to the complexity of the current design, it is worthwhile to revisit the idea of increasing and decreasing container size both through Resource Manager +1 for this idea. Letting this go through NodeManager directly adds too much complexity and difficult to understand semantics for the application writers. bq. If I understand correctly, this would be considerable hit to performance [~sandyr], as I understand, going through NM is in fact a worse solution w.r.t allocation throughput. Going through RM directly is better as the RM will immediately know that the resource is available for future allocations - the decrease on the NM can happen offline. The control flow I expect is - the framework/app decides it doesn't need that many resources anymore. By this time, the container already should have given up on the physical resources it doesn't need - informs the RM about the required decrement - RM informs NM to resize the container (cgroups etc) Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3804: -- Assignee: Varun Saxena Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3804: -- Priority: Critical (was: Major) Target Version/s: 2.8.0, 2.7.1 Seems like a critical issue to me. Two options # Fail correctly and assume that admin adds yarn user explicitly if it needs to work. # Allow the daemon user to do the refresh irrespective of what admin configures I get a feeling (2) is better. Thoughts? /cc [~leftnoteasy], [~jianhe] Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Priority: Critical Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1012) Report NM aggregated container resource utilization in heartbeat
[ https://issues.apache.org/jira/browse/YARN-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586451#comment-14586451 ] Hadoop QA commented on YARN-1012: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 19m 43s | Pre-patch trunk has 3 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 43s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 55s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 31s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 38s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 5m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 26s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 59s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-server-common. | | {color:green}+1{color} | yarn tests | 6m 9s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 56m 56s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739650/YARN-1012-8.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 32ffda1 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8253/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8253/console | This message was automatically generated. Report NM aggregated container resource utilization in heartbeat Key: YARN-1012 URL: https://issues.apache.org/jira/browse/YARN-1012 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.7.0 Reporter: Arun C Murthy Assignee: Inigo Goiri Attachments: YARN-1012-1.patch, YARN-1012-2.patch, YARN-1012-3.patch, YARN-1012-4.patch, YARN-1012-5.patch, YARN-1012-6.patch, YARN-1012-7.patch, YARN-1012-8.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3802) Two RMNodes for the same NodeId are used in RM sometimes after NM is reconnected.
[ https://issues.apache.org/jira/browse/YARN-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586507#comment-14586507 ] Xuan Gong commented on YARN-3802: - [~zxu] The patch looks good overall. One nit: Could you fix the comment, too ? {code} // Only add new node if old state is RUNNING {code} Two RMNodes for the same NodeId are used in RM sometimes after NM is reconnected. - Key: YARN-3802 URL: https://issues.apache.org/jira/browse/YARN-3802 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3802.000.patch Two RMNodes for the same NodeId are used in RM sometimes after NM is reconnected. Scheduler and RMContext use different RMNode reference for the same NodeId sometimes after NM is reconnected, which is not correct. Scheduler and RMContext should always use same RMNode reference for the same NodeId. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables
[ https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586363#comment-14586363 ] Joep Rottinghuis commented on YARN-3706: This patch is ready for review. Generalize native HBase writer for additional tables Key: YARN-3706 URL: https://issues.apache.org/jira/browse/YARN-3706 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Joep Rottinghuis Assignee: Joep Rottinghuis Priority: Minor Attachments: YARN-3706-YARN-2928.001.patch, YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, YARN-3706-YARN-2928.012.patch, YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, YARN-3726-YARN-2928.009.patch When reviewing YARN-3411 we noticed that we could change the class hierarchy a little in order to accommodate additional tables easily. In order to get ready for benchmark testing we left the original layout in place, as performance would not be impacted by the code hierarchy. Here is a separate jira to address the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586307#comment-14586307 ] Varun Saxena commented on YARN-3798: ZK version is {{3.5}} I think RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at
[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586263#comment-14586263 ] Hadoop QA commented on YARN-3714: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 22m 43s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 9m 15s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 13s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 35s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 50s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 43s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 27s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 15s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-server-web-proxy. | | | | 56m 32s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739513/YARN-3714.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 4c5da9b | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8252/artifact/patchprocess/whitespace.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8252/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8252/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-web-proxy test log | https://builds.apache.org/job/PreCommit-YARN-Build/8252/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8252/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8252/console | This message was automatically generated. AM proxy filter can not get proper default proxy address if RM-HA is enabled Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3714.001.patch, YARN-3714.002.patch, YARN-3714.003.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586360#comment-14586360 ] Varun Saxena commented on YARN-3798: [~ozawa], thanks for your explanation. This specific log scenario(Logs attached with the JIRA) looks like a zookeeper issue. We unfortunately lost the zookeeper logs. Otherwise could have confirmed. And are unable to reproduce it since then :( As you explained, consistent data is guaranteed if a single Zookeeper object is used. The scenario you explained above though is a good catch and I think we can fix it. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at
[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586377#comment-14586377 ] Xuan Gong commented on YARN-3714: - bq. HAUtil.verifyAndSetConfiguration works only on the RM node. AMs running in slave nodes also need to know the RM webapp addresses. Thanks for the explanation. That makes sense. The patch looks good overall. One small nit: in {code} public static ListString getRMHAWebappAddresses( final YarnConfiguration conf) { {code} We could check whether RM_WEBAPP_ADDRESS has been set with RM-ids. If not, we only need to check whether RM_HOSTNAME has been set with RM-ids instead of calling {code} HAUtil.verifyAndSetRMHAIdsList(conf); {code} ? AM proxy filter can not get proper default proxy address if RM-HA is enabled Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3714.001.patch, YARN-3714.002.patch, YARN-3714.003.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586281#comment-14586281 ] Wangda Tan commented on YARN-1197: -- [~sandyr], Thanks for coming back :). I'm not very sure about what's the performance issue you mentioned if decreases goes to RM, what's the expected (ideal) delay in your mind of Sparking releasing resource. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586286#comment-14586286 ] Wangda Tan commented on YARN-1197: -- Sparking-Spark Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586285#comment-14586285 ] Wangda Tan commented on YARN-1197: -- Sparking-Spark Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1012) Report NM aggregated container resource utilization in heartbeat
[ https://issues.apache.org/jira/browse/YARN-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri updated YARN-1012: -- Attachment: YARN-1012-8.patch Report aggregated utilization. Report NM aggregated container resource utilization in heartbeat Key: YARN-1012 URL: https://issues.apache.org/jira/browse/YARN-1012 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.7.0 Reporter: Arun C Murthy Assignee: Inigo Goiri Attachments: YARN-1012-1.patch, YARN-1012-2.patch, YARN-1012-3.patch, YARN-1012-4.patch, YARN-1012-5.patch, YARN-1012-6.patch, YARN-1012-7.patch, YARN-1012-8.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586645#comment-14586645 ] Xuan Gong commented on YARN-3804: - I am OK with that. In transitionToActive(), we are re-using all the refresh* code, if we choose option 2, we need to re-factory all the refresh* functions. Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586582#comment-14586582 ] Karthik Kambatla commented on YARN-3803: This seems like a serious issue. Any reason for marking it Minor? Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman Priority: Minor In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586610#comment-14586610 ] Jian He commented on YARN-3804: --- +1 for 2) Not too much point having RM to depend on the admin acl to do transition for itself. [~kasha], [~xgong], sounds good ? Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586676#comment-14586676 ] Wangda Tan commented on YARN-3804: -- There's a inconsistent check in current code path: - AdminService.checkAccess uses YarnAuthorizationProvider to do the check. In its default implementation: {{ConfiguredYarnAuthorizer}}, it uses configured {{yarn.admin.acl}} - ClientRMService.checkAccess uses AdminCLIsManager, it uses configured {{yarn.admin.acl}} + {{daemon_user}} I think we should fix the inconsistency issue, 2) will be completed with if we make both of them allow {{daemont_user}}. Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586675#comment-14586675 ] Yuliya Feldman commented on YARN-3803: -- [~kasha] It happens only if you have single node (at least in my testing) - since AM 2nd+ attempt will happen on the same node. Though - I was debating whether to make it Major or not. I can change it to major. I will post a patch later today for the fix. Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman Priority: Minor In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3469) ZKRMStateStore: Avoid setting watches that are not required
[ https://issues.apache.org/jira/browse/YARN-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586586#comment-14586586 ] Karthik Kambatla commented on YARN-3469: 2.8 uses Curator and all watch handling is now implicit. ZKRMStateStore: Avoid setting watches that are not required --- Key: YARN-3469 URL: https://issues.apache.org/jira/browse/YARN-3469 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Priority: Minor Fix For: 2.7.1 Attachments: YARN-3469.01.patch In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706]. Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587247#comment-14587247 ] Tsuyoshi Ozawa commented on YARN-3798: -- Thank you for the sharing, Bibin. Marking this as a blocker of 2.7.1. BTW, this problem looks to be solved since 2.8 or later uses Curator. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at
[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-433: --- Attachment: YARN-433.1.patch When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-433.1.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587425#comment-14587425 ] Hadoop QA commented on YARN-433: \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 13s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 39s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 47s | The applied patch generated 2 new checkstyle issues (total was 129, now 131). | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 23s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 50m 52s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 89m 11s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMNodeTransitions | | | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739752/YARN-433.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / e8c5143 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8256/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8256/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8256/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8256/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8256/console | This message was automatically generated. When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-433.1.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587424#comment-14587424 ] Zhijie Shen commented on YARN-3801: --- +1, we'd better fix Java 8 issues before merging branch YARN-2928 back to trunk. HADOOP-11090 is targeting 2.8. [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util - Key: YARN-3801 URL: https://issues.apache.org/jira/browse/YARN-3801 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Tsuyoshi Ozawa Assignee: Tsuyoshi Ozawa Attachments: YARN-3801.001.patch timelineservice depends on hbase-client and hbase-testing-util, and they dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. {quote} [WARNING] Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-client:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-testing-util:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence failed with message: Failed while enforcing releasability the error(s) are [ Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-client:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-testing-util:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587474#comment-14587474 ] Jun Gong commented on YARN-3809: How about setting thread pool size in ApplicationMasterLauncher larger, or make the size configurable? Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl
[ https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586694#comment-14586694 ] Karthik Kambatla commented on YARN-3804: On board with suggestions here. Both RM are on standBy state when kerberos user not in yarn.admin.acl - Key: YARN-3804 URL: https://issues.apache.org/jira/browse/YARN-3804 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3, 2 RM, Secure Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Critical Steps to reproduce 1. Configure cluster in secure mode 2. On RM Configure yarn.admin.acl=dsperf 3. Configure in arn.resourcemanager.principal=yarn 4. Start Both RM Both RM will be in Standby forever {code} 2015-06-15 12:20:21,556 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE DESCRIPTION=Unauthorized userPERMISSIONS= 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518) Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295) ... 5 more Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't have permission to call 'refreshAdminAcls' at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228) ... 7 more {code} *Analysis* On each RM attempt to switch to Active refreshACl is called and acl permission not available for the user Infinite retry for the same switch to Active and always false returned from {{ActiveStandbyElector#becomeActive()}} *Expected* RM should get shutdown event after few retry or even at first attempt Since at runtime user from which it retries for refreshacl can never be updated. *States from commands* ./yarn rmadmin -getServiceState rm2 *standby* ./yarn rmadmin -getServiceState rm1 *standby* ./yarn rmadmin -checkHealth rm1 *echo $? = 0* ./yarn rmadmin -checkHealth rm2 *echo $? = 0* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586731#comment-14586731 ] Karthik Kambatla commented on YARN-3803: We have this other issue because of which multiple AMs for the same app get assigned to the same node. So, this could be a pretty serious issue. Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman Priority: Minor In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586999#comment-14586999 ] MENG DING commented on YARN-1197: - [~sandyr], Yes. The key assumption is that by the time the Application Master requests resource decrease from RM for a particular container, that container should have already reduced its resource usage. Therefore, RM can immediately allocate resource to others. So to summarize the main idea: * Both container resource increase and decrease requests go through RM. This eliminates the race condition where while a container increase is in progress, a decrease for the same container takes place. * There is no need for AM-NM protocol anymore. This greatly simplifies the logic for application writers. * Resource decrease can happen immediately in RM, and the actual enforce/monitor of the decrease can happen offline, as mentioned by Vinod. * Resource increase, on the other hand, needs more thoughts. ** In the current design, the RM gives out an increase token to be used by AM to initiate the increase on NM. There is no need for this. RM can notify the increase to NM through RM-NM heartbeat response. ** RM still needs to wait for an acknowledgement from NM to confirm that the increase is done before sending out response to AM. This will take two heartbeat cycles, but this is not much worse than giving out a token to AM first, and then letting AM initiating the increase. ** Since RM needs to wait for acknowledgement from NM to confirm the increase, we must handle such cases as timeout, NM restart/recovery, etc. So we probably still need to have a container increase token, and token expiration logic for this purpose, but the token will be sent to NM through RM-NM heartbeat protocol. (I am still working out the details) Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586687#comment-14586687 ] Sandy Ryza commented on YARN-1197: -- bq. Going through RM directly is better as the RM will immediately know that the resource is available for future allocations Is the idea that the RM would make allocations using the space before receiving acknowledgement from the NodeManager that it has resized the container (adjusted cgroups)? Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586733#comment-14586733 ] Karthik Kambatla commented on YARN-2768: Thanks for the clarification, [~zhiguohong]. Let me take a closer look at the patch and provide review comments. optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables
[ https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586980#comment-14586980 ] Sangjin Lee commented on YARN-3706: --- I took a look at the latest patch, and it looks pretty good overall. I do have a few comments. Also, at a high level, were you able to run some tests using a pseudo-distributed cluster to verify that it still works as before? If not, it'd be great if you can try that out. (BaseTable.java) - l.54: nit: space (RowKey.java/EntityRowKey.java) - I'm not 100% sure of the value of the inheritance model here. The getRowKey() and the getRowKeyPrefix() methods are not common across the supposed subtypes (as the arguments change from table to table). If method contracts are not shared among the subtypes, there is little commonality among them. In other words, you will not be able to use type {{RowKeyEntityTable}} in code that uses it. You'll always have to use {{EntityRowKey}}. Also, it's not like they have to implement common instance methods. Then does it warrant the inheritance model? Are you considering adding real inherited (instance) methods later? (Separator.java) - l.232: Although this gives you a nice way of combining both methods, I'm thinking it is OK to provide a separate implementation for the array argument. How often can this method be invoked? If it can be invoked often, it may cause List's to be created unnecessarily (TimelineEntitySchemaConstants.java) - l.62: nit: spacing Generalize native HBase writer for additional tables Key: YARN-3706 URL: https://issues.apache.org/jira/browse/YARN-3706 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Joep Rottinghuis Assignee: Joep Rottinghuis Priority: Minor Attachments: YARN-3706-YARN-2928.001.patch, YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, YARN-3706-YARN-2928.012.patch, YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, YARN-3726-YARN-2928.009.patch When reviewing YARN-3411 we noticed that we could change the class hierarchy a little in order to accommodate additional tables easily. In order to get ready for benchmark testing we left the original layout in place, as performance would not be impacted by the code hierarchy. Here is a separate jira to address the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587098#comment-14587098 ] Wangda Tan commented on YARN-1197: -- [~sandyr], I think increasing via AM-NM and RM-NM are in very similar range of delay. (multi-seconds for now) a. AM-NM needs 3 stages 1) AM Get increase token from RM 2) AM send increase token to NM 3) Pooling NM about increase status (because we cannot assume increasing can be done in NM side very fast) b. RM-NM needs 4 stages 1) RM send back increasing token to NM 2) NM doing increase locally 3) NM report back to RM when increasing done 4) RM send increase done to AM Solution b. has an additional RM-NM heartbeat interval Benefits of b. (Some of them also mentioned by Meng) - Simpler to AM, only need to know about increase done, don't need to receive token and submit/pool NM. - Create a consistency way for application to increase/decrease containers - Recovery is simpler, AM only knows increase when its finished, only need to handle 2 component recovery (NM/RM) instead of 3 components (NM/RM/AM) Before we have a fast scheduling design/plan (I don't think we can support milli-seconds scheduling for now, too frequent AM heartbeating will overload RM), I don't think add an additional NM-RM heartbeat interval is a big problem. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587127#comment-14587127 ] Sandy Ryza commented on YARN-1197: -- Option (a) can occur in the low hundreds of milliseconds if the cluster is tuned properly, independent of cluster size. 1) Submit increase request to RM. Poll RM 100 milliseconds later after continuous scheduling thread has run in order to pick up the increase token. 2) Send increase token to NM. Why does the AM need to poll the NM about increase status before taking action? Does the NM need to do anything other than update its tracking of the resources allotted to the container? Also, it's not unlikely that schedulers will be improved to return the increase token on the same heartbeat that it's requested. So this could all happen in 2 RPCs + a scheduler decision, and no additional wait time. Anything more than this is probably prohibitively expensive for a framework like Spark to submit an increase request before running each task. Would option (b) ever be able to achieve this kind of latency? Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman resolved YARN-3803. -- Resolution: Not A Problem I apologize for this one. It is not an issue in branches I mentioned, we just had duplicates handled incorrectly. Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587128#comment-14587128 ] Masatake Iwasaki commented on YARN-3174: NodeManager.md currently explains about only health cheaker and is not linked from site.xml. I am going to move the contents of NodeManagerRestart.md into NodeManager.md and update the site index. Consolidate the NodeManager documentation into one -- Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Masatake Iwasaki We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587146#comment-14587146 ] Wangda Tan commented on YARN-1197: -- [~sandyr], Thanks for replying, bq. Why does the AM need to poll the NM about increase status before taking action? Does the NM need to do anything other than update its tracking of the resources allotted to the container? Yes, NM only needs to update tracking of the resource and cgroups. We cannot assume this can happen immediately, so we cannot put container increased to the same RPC. This is same as startContainer, even if launching a container is fast in most cases, AM needs to poll NM after invoked startContainer. bq. Would option (b) ever be able to achieve this kind of latency? If you consider all now/future optimizations, such as continous-scheduling / scheduler make decision at same AM-RM heart-beat. (b) needs one more NM-RM heart-beat interval. I agree with you, it could be hundreds of milli-seconds (a) vs. multi-seconds (b). when the cluster is idle. But I'm wondering do we really need add these complexity to AM before we have mature optimizatons listed above? And also, if the cluster is busier, we cannot expect the delay as well. I tend to do (b) now since it's simpler to app developer to use this feature, I'm open to add AM-NM channel if we have YARN scheduler supports fast scheduling better. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3790: Component/s: fairscheduler TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Assignee: zhihai xu Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3789: --- Attachment: 0004-YARN-3789.patch Refactor logs for LeafQueue#activateApplications() to remove duplicate logging -- Key: YARN-3789 URL: https://issues.apache.org/jira/browse/YARN-3789 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 0003-YARN-3789.patch, 0004-YARN-3789.patch Duplicate logging from resource manager during am limit check for each application {code} 015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit 2015-06-09 17:32:40,019 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585471#comment-14585471 ] zhihai xu commented on YARN-3790: - [~rohithsharma] thanks for the review, yes, I just updated the component to FairScheduler. TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Assignee: zhihai xu Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-3803: - Priority: Major (was: Minor) Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587025#comment-14587025 ] Yuliya Feldman commented on YARN-3803: -- Changed to Major Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587044#comment-14587044 ] Tsuyoshi Ozawa commented on YARN-3798: -- [~varun_saxena] Thanks for your help. In addition to ZooKeeper version, could you share the Hadoop version? Is it 2.7.0? If it's 2.7.0, we can mark this issue as a blocker of 2.7.1 release. {quote} We unfortunately lost the zookeeper logs. {quote} The log of ZooKeeper when ZooKeeper#close() fails is dumped only with DEBUG mode. It's a bit difficult to get it. BTW, can I work with you to fix the corner case? I appreciate if you could help me to back port the fix to a branch you're using. RM shutdown with NoNode exception while updating appAttempt on zk - Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Attachments: RM.log RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587063#comment-14587063 ] Vinod Kumar Vavilapalli commented on YARN-1197: --- The details looks good. Let's make sure we handle RM, AM and NM restarts correctly. Also, let's design the RM - NM protocol to be generic and common enough for regular launch/stop and increase/decrease. Tx again for driving this! Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587072#comment-14587072 ] Sandy Ryza commented on YARN-1197: -- bq. RM still needs to wait for an acknowledgement from NM to confirm that the increase is done before sending out response to AM. This will take two heartbeat cycles, but this is not much worse than giving out a token to AM first, and then letting AM initiating the increase. I would argue that waiting for an NM-RM heartbeat is much worse than waiting for an AM-RM heartbeat. With continuous scheduling, the RM can make decisions in millisecond time, and the AM can regulate its heartbeats according to the application's needs to get fast responses. If an NM-RM heartbeat is involved, the application is at the mercy of the cluster settings, which should be in the multi-second range for large clusters. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3174) Consolidate the NodeManager documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki reassigned YARN-3174: -- Assignee: Masatake Iwasaki Consolidate the NodeManager documentation into one -- Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Masatake Iwasaki We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-3714: --- Attachment: YARN-3714.004.patch bq. We could check whether RM_WEBAPP_ADDRESS has been set with RM-ids. If not, we only need to check whether RM_HOSTNAME has been set with RM-ids instead of calling Yeah. I rethinked that using HAUtil#verifyAndSetRMHAIdsList which update conf is not safe. I attached 004. AM proxy filter can not get proper default proxy address if RM-HA is enabled Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3714.001.patch, YARN-3714.002.patch, YARN-3714.003.patch, YARN-3714.004.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587067#comment-14587067 ] Sandy Ryza commented on YARN-1197: -- Is my understanding correct that the broader plan is to move stopping containers out of the AM-NM protocol? Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587094#comment-14587094 ] Masatake Iwasaki commented on YARN-3174: There are 4 files refering to NodeManager under hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown. * NodeManager.md * NodeManagerCgroups.md * NodeManagerRest.md * NodeManagerRestart.md NodeManagerCgroups.md: It is not the doc about NodeManager and just the file name is not appropiate. It explains about feature supported by LinuxContainerExecutor only. Even if CGroups is supported by other modules in future, it might not be speficic to NodeManager. NodeManagerRest.md: This is relatively big and reasonable to be independent page same as other REST API docs. Consolidate the NodeManager documentation into one -- Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Masatake Iwasaki We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3779) Aggregated Logs Deletion doesnt work after refreshing Log Retention Settings in secure cluster
[ https://issues.apache.org/jira/browse/YARN-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587165#comment-14587165 ] Zhijie Shen commented on YARN-3779: --- [~varun_saxena], do you know why ugi is still the same, but kerberos authentication gets failed? Aggregated Logs Deletion doesnt work after refreshing Log Retention Settings in secure cluster -- Key: YARN-3779 URL: https://issues.apache.org/jira/browse/YARN-3779 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: mrV2, secure mode Reporter: Zhang Wei Assignee: Varun Saxena Priority: Critical Attachments: YARN-3779.01.patch, YARN-3779.02.patch, log_aggr_deletion_on_refresh_error.log, log_aggr_deletion_on_refresh_fix.log {{GSSException}} is thrown everytime log aggregation deletion is attempted after executing bin/mapred hsadmin -refreshLogRetentionSettings in a secure cluster. The problem can be reproduced by following steps: 1. startup historyserver in secure cluster. 2. Log deletion happens as per expectation. 3. execute {{mapred hsadmin -refreshLogRetentionSettings}} command to refresh the configuration value. 4. All the subsequent attempts of log deletion fail with {{GSSException}} Following exception can be found in historyserver's log if log deletion is enabled. {noformat} 2015-06-04 14:14:40,070 | ERROR | Timer-3 | Error reading root log dir this deletion attempt is being aborted | AggregatedLogDeletionService.java:127 java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: vm-31/9.91.12.31; destination host is: vm-33:25000; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1414) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy9.getListing(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:519) at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy10.getListing(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1767) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1750) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:691) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:753) at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:749) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:749) at org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:68) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:677) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1641) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:640) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:724) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462) at
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587168#comment-14587168 ] Sandy Ryza commented on YARN-1197: -- bq. If you consider all now/future optimizations, such as continous-scheduling / scheduler make decision at same AM-RM heart-beat. (b) needs one more NM-RM heart-beat interval. I agree with you, it could be hundreds of milli-seconds (a) vs. multi-seconds (b). when the cluster is idle. To clarify: with proper tuning, we can currently get low hundreds of milliseconds without adding any new scheduler features. With the new scheduler feature I'm imagining, we'd only be limited by the RPC + scheduler time, so we could get 10s of milliseconds with proper tuning. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587174#comment-14587174 ] Sandy Ryza commented on YARN-1197: -- Regarding complexity in the AM, the NMClient utility so far has been an API that's fairly easy for app developers to interact with. I've used it more than once and had no issues. Would we not be able to handle most of the additional complexity behind it? Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587181#comment-14587181 ] Hadoop QA commented on YARN-3714: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 59s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 48s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 13s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 14s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 2m 2s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-server-web-proxy. | | | | 42m 54s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739710/YARN-3714.004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 2cb09e9 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8254/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-web-proxy test log | https://builds.apache.org/job/PreCommit-YARN-Build/8254/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8254/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8254/console | This message was automatically generated. AM proxy filter can not get proper default proxy address if RM-HA is enabled Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3714.001.patch, YARN-3714.002.patch, YARN-3714.003.patch, YARN-3714.004.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2
[ https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587183#comment-14587183 ] Sangjin Lee commented on YARN-3792: --- Thanks [~Naganarasimha] for identifying the issues and providing a patch! I applied the patch on top of the current YARN-2928 branch, rebuilt, and ran the TestDistributedShell test locally. I still see one test failing: {noformat} --- T E S T S --- Running org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell Tests run: 13, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 581.546 sec FAILURE! - in org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell testDSShellWithoutDomainV2CustomizedFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 29.651 sec FAILURE! java.lang.AssertionError: Application finished event should be published atleast once expected:1 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.verifyStringExistsSpecifiedTimes(TestDistributedShell.java:483) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:431) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:323) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow(TestDistributedShell.java:209) Results : Failed tests: TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow:209-testDSShell:323-checkTimelineV2:431-verifyStringExistsSpecifiedTimes:483 Application finished event should be published atleast once expected:1 but was:0 Tests run: 13, Failures: 1, Errors: 0, Skipped: 0 {noformat} Have you seen this? Could you kindly look into that? I'll also see if this is reproducible on my end. Some quick comments: (TestDistributedShell.java) - l.71-75: Is this comment necessary here? I'm not sure if we want to add a generic comment like this to a specific test... - l.106: Are the checks for null necessary? I thought that the test name was populated by junit and made available to test methods. Do things fail if we do not check for null? - l.376: I don't really like the sleep call as it is not completely deterministic; could there be a way to make this completely deterministic (using things like CountDownLatch, etc.)? (TimelineClientImpl.java) - l.385: nit: the C-style conditional check is not necessary; I would suggest a more natural check of {{(timelineServiceAddress == null)}} (ContainersMonitorImpl.java) - l.96: It is unrelated to this patch itself, but should we rename the variable name threadPool? It is a completely generic name. We should rename it to something like timelineWriterThreadPool or something to that effect. Let me know if you have a suggestion. Test case failures in TestDistributedShell and some issue fixes related to ATSV2 Key: YARN-3792 URL: https://issues.apache.org/jira/browse/YARN-3792 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3792-YARN-2928.001.patch # encountered [testcase failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] which was happening even without the patch modifications in YARN-3044 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression # Remove unused {{enableATSV1}} in testDisstributedShell # container metrics needs to be published only for v2 test cases of testDisstributedShell # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux service was not configured and {{TimelineClient.putObjects}} was getting invoked. # Race condition for the Application events to published and test case verification for RM's ApplicationFinished Timeline Events # Application Tags for converted to lowercase in ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to detect to custom flow details of the app -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-3714: --- Summary: AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id (was: AM proxy filter can not get proper default proxy address if RM-HA is enabled) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id -- Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3714.001.patch, YARN-3714.002.patch, YARN-3714.003.patch, YARN-3714.004.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3174) Consolidate the NodeManager documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-3174: --- Attachment: YARN-3174.001.patch Consolidate the NodeManager documentation into one -- Key: YARN-3174 URL: https://issues.apache.org/jira/browse/YARN-3174 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Masatake Iwasaki Attachments: YARN-3174.001.patch We really don't need a different document for every individual nodemanager feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587196#comment-14587196 ] Wangda Tan commented on YARN-1197: -- bq. To clarify: with proper tuning, we can currently get low hundreds of milliseconds without adding any new scheduler features. With the new scheduler feature I'm imagining, we'd only be limited by the RPC + scheduler time, so we could get 10s of milliseconds with proper tuning. I think this assumes cluster is quite idle, I understand the low latency could be achieved, but it's not guaranteed since we don't support oversubscribing, etc. If you assume the cluster is very idle, one solution might be holding more resource at the beginning instead of increasing. In real environment, I think the expectation of delay should still be seconds level. From YARN's perspective, (b) handles most of logic within YARN daemons (instead of AM), we don't need to consider inconsistency status between RM/AM when doing recovery, that is really what I prefer :). I'm not against of doing (a), but I prefer to do that when we have solid foundation for fast scheduling. I'm not sure if there's any resource management platform in production supports that, but some research papers such as Sparrow uses quite different protocol/approach than YARN. I expect there're still some TODO items for YARN to get guaranteed fast scheduling. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, YARN-1197_Design.pdf The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)