[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703280#comment-14703280 ] Junping Du commented on YARN-3641: -- bq. I can't see how to change this from 'Pending Closed' to 'Fixed'. I cannot either. The weird thing is I even cannot reopen it... NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Labels: 2.6.1-candidate Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660704#comment-14660704 ] Siqi Li commented on YARN-3641: --- The latest patch can be applied to 2.6.0 branch cleanly NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Labels: 2.6.1-candidate Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635380#comment-14635380 ] Allen Wittenauer commented on YARN-3641: I can't see how to change this from 'Pending Closed' to 'Fixed'. :( NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543524#comment-14543524 ] Hudson commented on YARN-3641: -- FAILURE: Integrated in Hadoop-Yarn-trunk #927 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/927/]) YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543580#comment-14543580 ] Hudson commented on YARN-3641: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #196 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/196/]) YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/CHANGES.txt NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543739#comment-14543739 ] Rohith commented on YARN-3641: -- bq. so we probably should still call ExitUtil.terminate. I think this is right way to overcome from JVM hang during graceful shutdown. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543696#comment-14543696 ] Jason Lowe commented on YARN-3641: -- bq. I think DefaultMetricsSystem.shutdown(); also should be called in the finally block otherwise if custom implementation of MetricsSinkAdapter like HADOOP-11932 would hang the JVM. Arguably there are a lot of different things that can cause the JVM to hang during shutdown. An auxiliary service that doesn't always shutdown cleanly. Someone adds a new service that launches non-daemon threads and their shutdown/stop isn't called, etc. etc. IMHO if we really want to prevent shutdowns from hanging in general then the NM should explicitly call ExitUtil.terminate rather than relying on all the non-daemon threads to eventually exit. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543855#comment-14543855 ] Hudson commented on YARN-3641: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #185 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/185/]) YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543798#comment-14543798 ] Hudson commented on YARN-3641: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #195 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/195/]) YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543817#comment-14543817 ] Hudson commented on YARN-3641: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2125 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2125/]) YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/CHANGES.txt NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543903#comment-14543903 ] Hudson commented on YARN-3641: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2143 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2143/]) YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/CHANGES.txt NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542000#comment-14542000 ] Jason Lowe commented on YARN-3641: -- I think the patch approach is OK, but I'm not sure I agree with the problem analysis. We kill -9 the NM during rolling upgrades, which obviously will not cleanly shutdown the state store, yet we don't have the IO error lock problem. The issue is that the old NM process must still be running, which is why leveldb refuses to open the still-in-use database. In that sense this JIRA appears to be a duplicate of the same problems described in YARN-3585 and YARN-3640. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542028#comment-14542028 ] Junping Du commented on YARN-3641: -- bq. We kill -9 the NM during rolling upgrades, which obviously will not cleanly shutdown the state store, yet we don't have the IO error lock problem. Yes. I also suspect that if old NM is still running. The bad news is our original environment is gone, may need sometime to reproduce this to see if the same problem of YARN-3585. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542052#comment-14542052 ] Hadoop QA commented on YARN-3641: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 36s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 35s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 3s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 6m 0s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 42m 5s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732578/YARN-3641.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 065d8f2 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7921/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7921/console | This message was automatically generated. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543198#comment-14543198 ] Rohith commented on YARN-3641: -- Apologies for coming late into this JIRA.. I think {{DefaultMetricsSystem.shutdown();}} also should be called in the finally block otherwise if custom implementation of MetricsSinkAdapter like HADOOP-11932 would hang the JVM. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542742#comment-14542742 ] Hudson commented on YARN-3641: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7823 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7823/]) YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/CHANGES.txt NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542447#comment-14542447 ] Jason Lowe commented on YARN-3641: -- +1 for the patch. Will commit this later today, and fix the whitespace nit as part of the commit. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)