Junping Du created YARN-3641:
--------------------------------
Summary: stopRecoveryStore() shouldn't be skipped when exceptions
happen in stopping NM's sub-services.
Key: YARN-3641
URL: https://issues.apache.org/jira/browse/YARN-3641
Project: Hadoop YARN
Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
If NM' services not get stopped properly, we cannot start NM with enabling NM
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException:
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource
temporarily unavailable
at
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error:
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK:
Resource temporarily unavailable
at
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
at
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2015-05-12 00:34:45,262 INFO nodemanager.NodeManager
(LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at
c6403.ambari.apache.org/192.168.64.103
************************************************************/
{noformat}
The related code is as below in NodeManager.java:
{code}
@Override
protected void serviceStop() throws Exception {
if (isStopping.getAndSet(true)) {
return;
}
super.serviceStop();
stopRecoveryStore();
DefaultMetricsSystem.shutdown();
}
{code}
We can see we stop all NM registered services (NodeStatusUpdater,
LogAggregationService, ResourceLocalizationService, etc.) first. Any of
services get stopped with exception could cause stopRecoveryStore() get skipped
which means levelDB store is not get closed. So next time NM start, it will get
failed with exception above.
We should put stopRecoveryStore(); in a final block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)