[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

Jason Lowe (JIRA) Wed, 13 May 2015 07:29:00 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542000#comment-14542000
 ]


Jason Lowe commented on YARN-3641:
----------------------------------

I think the patch approach is OK, but I'm not sure I agree with the problem 
analysis.  We kill -9 the NM during rolling upgrades, which obviously will not 
cleanly shutdown the state store, yet we don't have the IO error lock problem.  
The issue is that the old NM process must still be running, which is why 
leveldb refuses to open the still-in-use database.  In that sense this JIRA 
appears to be a duplicate of the same problems described in YARN-3585 and 
YARN-3640.

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3641
>                 URL: https://issues.apache.org/jira/browse/YARN-3641
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, rolling upgrade
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>       at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>       at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>       at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>       at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> ************************************************************/
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
>     if (isStopping.getAndSet(true)) {
>       return;
>     }
>     super.serviceStop();
>     stopRecoveryStore();
>     DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

Reply via email to