[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kumar Vavilapalli updated YARN-3641: ------------------------------------------ Target Version/s: 2.7.1 (was: 2.8.0) Marking it as critical for 2.7.1 whichever way we go.. > NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen > in stopping NM's sub-services. > ----------------------------------------------------------------------------------------------------------- > > Key: YARN-3641 > URL: https://issues.apache.org/jira/browse/YARN-3641 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, rolling upgrade > Affects Versions: 2.6.0 > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: YARN-3641.patch > > > If NM' services not get stopped properly, we cannot start NM with enabling NM > restart with work preserving. The exception is as following: > {noformat} > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock > /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource > temporarily unavailable > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: > lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: > Resource temporarily unavailable > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > ... 5 more > 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager > (LogAdapter.java:info(45)) - SHUTDOWN_MSG: > /************************************************************ > SHUTDOWN_MSG: Shutting down NodeManager at > c6403.ambari.apache.org/192.168.64.103 > ************************************************************/ > {noformat} > The related code is as below in NodeManager.java: > {code} > @Override > protected void serviceStop() throws Exception { > if (isStopping.getAndSet(true)) { > return; > } > super.serviceStop(); > stopRecoveryStore(); > DefaultMetricsSystem.shutdown(); > } > {code} > We can see we stop all NM registered services (NodeStatusUpdater, > LogAggregationService, ResourceLocalizationService, etc.) first. Any of > services get stopped with exception could cause stopRecoveryStore() get > skipped which means levelDB store is not get closed. So next time NM start, > it will get failed with exception above. > We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)