[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

Jason Lowe (JIRA) Thu, 14 May 2015 07:00:18 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543696#comment-14543696
 ]


Jason Lowe commented on YARN-3641:
----------------------------------

bq. I think DefaultMetricsSystem.shutdown(); also should be called in the 
finally block otherwise if custom implementation of MetricsSinkAdapter like 
HADOOP-11932 would hang the JVM.

Arguably there are a lot of different things that can cause the JVM to hang 
during shutdown.  An auxiliary service that doesn't always shutdown cleanly.  
Someone adds a new service that launches non-daemon threads and their 
shutdown/stop isn't called, etc. etc.  IMHO if we really want to prevent 
shutdowns from hanging in general then the NM should explicitly call 
ExitUtil.terminate rather than relying on all the non-daemon threads to 
eventually exit.

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3641
>                 URL: https://issues.apache.org/jira/browse/YARN-3641
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, rolling upgrade
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>             Fix For: 2.7.1
>
>         Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>       at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>       at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>       at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>       at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> ************************************************************/
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
>     if (isStopping.getAndSet(true)) {
>       return;
>     }
>     super.serviceStop();
>     stopRecoveryStore();
>     DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

Reply via email to