[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3641:
-----------------------------
    Description: 
If NM' services not get stopped properly, we cannot start NM with enabling NM 
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
temporarily unavailable
        at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
Resource temporarily unavailable
        at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
        at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
        at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
        at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
        at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        ... 5 more
2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
(LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at 
c6403.ambari.apache.org/192.168.64.103
************************************************************/
{noformat}

The related code is as below in NodeManager.java:
{code}
  @Override
  protected void serviceStop() throws Exception {
    if (isStopping.getAndSet(true)) {
      return;
    }
    super.serviceStop();
    stopRecoveryStore();
    DefaultMetricsSystem.shutdown();
  }
{code}
We can see we stop all NM registered services (NodeStatusUpdater, 
LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
services get stopped with exception could cause stopRecoveryStore() get skipped 
which means levelDB store is not get closed. So next time NM start, it will get 
failed with exception above. 
We should put stopRecoveryStore(); in a finally block.

  was:
If NM' services not get stopped properly, we cannot start NM with enabling NM 
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
temporarily unavailable
        at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
Resource temporarily unavailable
        at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
        at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
        at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
        at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
        at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        ... 5 more
2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
(LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at 
c6403.ambari.apache.org/192.168.64.103
************************************************************/
{noformat}

The related code is as below in NodeManager.java:
{code}
  @Override
  protected void serviceStop() throws Exception {
    if (isStopping.getAndSet(true)) {
      return;
    }
    super.serviceStop();
    stopRecoveryStore();
    DefaultMetricsSystem.shutdown();
  }
{code}
We can see we stop all NM registered services (NodeStatusUpdater, 
LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
services get stopped with exception could cause stopRecoveryStore() get skipped 
which means levelDB store is not get closed. So next time NM start, it will get 
failed with exception above. 
We should put stopRecoveryStore(); in a final block.


> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3641
>                 URL: https://issues.apache.org/jira/browse/YARN-3641
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, rolling upgrade
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>       at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>       at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>       at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>       at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> ************************************************************/
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
>     if (isStopping.getAndSet(true)) {
>       return;
>     }
>     super.serviceStop();
>     stopRecoveryStore();
>     DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to