[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-09-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3641:
--
Fix Version/s: 2.6.1

Pulled this into 2.6.1. Ran compilation before the push. Patch applied cleanly.


> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3641:
--
Labels: 2.6.1-candidate  (was: )

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
  Labels: 2.6.1-candidate
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3641:
---
Reporter: Junping Du  (was: Allen Wittenauer)

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Allen Wittenauer
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3641:
---
Reporter: Allen Wittenauer  (was: Junping Du)

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Allen Wittenauer
Assignee: Allen Wittenauer
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3641:
---
Assignee: Junping Du  (was: Allen Wittenauer)

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3641:
-
  Component/s: rolling upgrade
   nodemanager
Affects Version/s: 2.6.0
  Summary: NodeManager: stopRecoveryStore() shouldn't be skipped 
when exceptions happen in stopping NM's sub-services.  (was: 
stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping 
NM's sub-services.)

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical

 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a final block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3641:
-
Description: 
If NM' services not get stopped properly, we cannot start NM with enabling NM 
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
temporarily unavailable
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
Resource temporarily unavailable
at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
(LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NodeManager at 
c6403.ambari.apache.org/192.168.64.103
/
{noformat}

The related code is as below in NodeManager.java:
{code}
  @Override
  protected void serviceStop() throws Exception {
if (isStopping.getAndSet(true)) {
  return;
}
super.serviceStop();
stopRecoveryStore();
DefaultMetricsSystem.shutdown();
  }
{code}
We can see we stop all NM registered services (NodeStatusUpdater, 
LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
services get stopped with exception could cause stopRecoveryStore() get skipped 
which means levelDB store is not get closed. So next time NM start, it will get 
failed with exception above. 
We should put stopRecoveryStore(); in a finally block.

  was:
If NM' services not get stopped properly, we cannot start NM with enabling NM 
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
temporarily unavailable
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
Resource temporarily unavailable
at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
(LogAdapter.java:info(45)) 

[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3641:
-
Attachment: YARN-3641.patch

Upload a quick patch to fix it. The issue is obviously and the solution is 
simple enough, not need unit test.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3641:
--
Target Version/s: 2.7.1  (was: 2.8.0)

Marking it as critical for 2.7.1 whichever way we go..

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)