[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-08-19 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703280#comment-14703280
 ] 

Junping Du commented on YARN-3641:
--

bq. I can't see how to change this from 'Pending Closed' to 'Fixed'. 
I cannot either. The weird thing is I even cannot reopen it...

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
  Labels: 2.6.1-candidate
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-08-06 Thread Siqi Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660704#comment-14660704
 ] 

Siqi Li commented on YARN-3641:
---

The latest patch can be applied to 2.6.0 branch cleanly

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
  Labels: 2.6.1-candidate
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635380#comment-14635380
 ] 

Allen Wittenauer commented on YARN-3641:


I can't see how to change this from 'Pending Closed' to 'Fixed'. :(

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543524#comment-14543524
 ] 

Hudson commented on YARN-3641:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #927 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/927/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when 
exceptions happen in stopping NM's sub-services. Contributed by Junping Du 
(jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java


 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543580#comment-14543580
 ] 

Hudson commented on YARN-3641:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #196 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/196/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when 
exceptions happen in stopping NM's sub-services. Contributed by Junping Du 
(jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* hadoop-yarn-project/CHANGES.txt


 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543739#comment-14543739
 ] 

Rohith commented on YARN-3641:
--

bq. so we probably should still call ExitUtil.terminate.
I think this is right way to overcome from JVM hang during graceful shutdown.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543696#comment-14543696
 ] 

Jason Lowe commented on YARN-3641:
--

bq. I think DefaultMetricsSystem.shutdown(); also should be called in the 
finally block otherwise if custom implementation of MetricsSinkAdapter like 
HADOOP-11932 would hang the JVM.

Arguably there are a lot of different things that can cause the JVM to hang 
during shutdown.  An auxiliary service that doesn't always shutdown cleanly.  
Someone adds a new service that launches non-daemon threads and their 
shutdown/stop isn't called, etc. etc.  IMHO if we really want to prevent 
shutdowns from hanging in general then the NM should explicitly call 
ExitUtil.terminate rather than relying on all the non-daemon threads to 
eventually exit.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543855#comment-14543855
 ] 

Hudson commented on YARN-3641:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #185 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/185/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when 
exceptions happen in stopping NM's sub-services. Contributed by Junping Du 
(jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java


 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543798#comment-14543798
 ] 

Hudson commented on YARN-3641:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #195 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/195/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when 
exceptions happen in stopping NM's sub-services. Contributed by Junping Du 
(jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java


 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543817#comment-14543817
 ] 

Hudson commented on YARN-3641:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2125 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2125/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when 
exceptions happen in stopping NM's sub-services. Contributed by Junping Du 
(jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* hadoop-yarn-project/CHANGES.txt


 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543903#comment-14543903
 ] 

Hudson commented on YARN-3641:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2143 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2143/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when 
exceptions happen in stopping NM's sub-services. Contributed by Junping Du 
(jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* hadoop-yarn-project/CHANGES.txt


 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542000#comment-14542000
 ] 

Jason Lowe commented on YARN-3641:
--

I think the patch approach is OK, but I'm not sure I agree with the problem 
analysis.  We kill -9 the NM during rolling upgrades, which obviously will not 
cleanly shutdown the state store, yet we don't have the IO error lock problem.  
The issue is that the old NM process must still be running, which is why 
leveldb refuses to open the still-in-use database.  In that sense this JIRA 
appears to be a duplicate of the same problems described in YARN-3585 and 
YARN-3640.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542028#comment-14542028
 ] 

Junping Du commented on YARN-3641:
--

bq. We kill -9 the NM during rolling upgrades, which obviously will not cleanly 
shutdown the state store, yet we don't have the IO error lock problem.
Yes. I also suspect that if old NM is still running. The bad news is our 
original environment is gone, may need sometime to reproduce this to see if the 
same problem of YARN-3585.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542052#comment-14542052
 ] 

Hadoop QA commented on YARN-3641:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 39s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 35s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  3s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   6m  0s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  42m  5s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732578/YARN-3641.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 065d8f2 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7921/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7921/console |


This message was automatically generated.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
 

[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543198#comment-14543198
 ] 

Rohith commented on YARN-3641:
--

Apologies for coming late into this JIRA.. I think 
{{DefaultMetricsSystem.shutdown();}} also should be called in the finally block 
otherwise if custom implementation of MetricsSinkAdapter like HADOOP-11932 
would hang the JVM.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542742#comment-14542742
 ] 

Hudson commented on YARN-3641:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7823 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7823/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when 
exceptions happen in stopping NM's sub-services. Contributed by Junping Du 
(jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* hadoop-yarn-project/CHANGES.txt


 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542447#comment-14542447
 ] 

Jason Lowe commented on YARN-3641:
--

+1 for the patch.  Will commit this later today, and fix the whitespace nit as 
part of the commit.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)