[ https://issues.apache.org/jira/browse/YARN-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796868#comment-15796868 ]
Gour Saha commented on YARN-6009: --------------------------------- [~rohithsharma] I understand that, but I am a little worried here. No matter what the issue with the state store of a particular app may be, it should not block the RM from starting. Note, this is not just limited to lifetime property. We can log appropriate messages for the problematic apps (and maybe even update the app diagnostics) and move on with graceful start of RM. The app owners can later work on the individual problematic apps, but at least the cluster will be up and running, ready to serve new apps. > RM fails to start during an upgrade - Failed to load/recover state > (YarnException: Invalid application timeout, value=0 for type=LIFETIME) > ------------------------------------------------------------------------------------------------------------------------------------------ > > Key: YARN-6009 > URL: https://issues.apache.org/jira/browse/YARN-6009 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Gour Saha > Assignee: Rohith Sharma K S > Priority: Critical > > ResourceManager fails to start during an upgrade with the following > exceptions - > Exception 1: > {color:red} > {code} > 2016-12-09 14:57:23,508 INFO capacity.CapacityScheduler > (CapacityScheduler.java:initScheduler(328)) - Initialized CapacityScheduler > with calculator=class > org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, > minimumAllocation=<<memory:256, vCores:1>>, > maximumAllocation=<<memory:101376, vCores:64>>, asynchronousScheduling=false, > asyncScheduleInterval=5ms > 2016-12-09 14:57:23,509 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(863)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:129) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:859) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:463) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:611) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:318) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:127) > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: Invalid application timeout, > value=0 for type=LIFETIME > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:991) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1032) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1028) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313) > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Invalid > application timeout, value=0 for type=LIFETIME > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateApplicationTimeouts(RMServerUtils.java:305) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:365) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:330) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:463) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1184) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:594) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 13 more > {code} > {color} > Exception 2: > {color:red} > {code} > 2016-12-09 14:57:26,162 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(790)) - > application_1477927786494_0008 State change from NEW to FINISHED > 2016-12-09 14:57:26,162 ERROR resourcemanager.ResourceManager > (ResourceManager.java:serviceStart(599)) - Failed to load/recover state > org.apache.hadoop.yarn.exceptions.YarnException: Invalid application timeout, > value=0 for type=LIFETIME > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateApplicationTimeouts(RMServerUtils.java:305) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:365) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:330) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:463) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1184) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:594) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:991) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1032) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1028) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:127) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:859) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:463) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:611) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > 2016-12-09 14:57:26,162 INFO service.AbstractService > (AbstractService.java:noteFailure(272)) - Service RMActiveServices failed in > state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnException: > Invalid application timeout, value=0 for type=LIFETIME > org.apache.hadoop.yarn.exceptions.YarnException: Invalid application timeout, > value=0 for type=LIFETIME > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateApplicationTimeouts(RMServerUtils.java:305) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:365) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:330) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:463) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1184) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:594) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:991) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1032) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1028) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:127) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:859) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:463) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:611) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > {code} > {color} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org