[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
[ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522890#comment-14522890 ] zhihai xu commented on YARN-2873: - We didn't see this problem any more after store levelDB files away from tmp directory. > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > --- > > Key: YARN-2873 > URL: https://issues.apache.org/jira/browse/YARN-2873 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2873.000.patch, YARN-2873.001.patch > > > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > We saw the following three level DB exceptions, all these exceptions cause NM > start failure. > DBException 1 in ShuffleHandler > {code} > INFO org.apache.hadoop.service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state STARTED; cause: > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475) > at > org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443) > at > org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {code} > DBException 2 in NMLeveldbStateStoreService: > {code} > Error starting NodeManager > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/ya
[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
[ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216464#comment-14216464 ] zhihai xu commented on YARN-2873: - Hi [~jlowe], I agree with you. The root cause is the Sorted Tables(*.sst) and MANIFEST file being deleted. If these files are stored away from tmp directory, it may solve the problem. thanks zhihai > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > --- > > Key: YARN-2873 > URL: https://issues.apache.org/jira/browse/YARN-2873 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2873.000.patch, YARN-2873.001.patch > > > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > We saw the following three level DB exceptions, all these exceptions cause NM > start failure. > DBException 1 in ShuffleHandler > {code} > INFO org.apache.hadoop.service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state STARTED; cause: > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475) > at > org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443) > at > org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {code} > DBException 2 in NMLeveldbStateStoreService: > {code} > Error starting NodeManager > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > > Caused by: org.fusesource.level
[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
[ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216259#comment-14216259 ] Jason Lowe commented on YARN-2873: -- I have some serious concerns about this approach. As I mentioned during the discussions on YARN-2816, this is trying to recover from a completely invalid setup. If something is coming along and deleting (i.e.: corrupting) parts of the database then _that_ is the problem that needs to be corrected rather than worked around in the NM. Reaching into the internals of the leveldb files and assuming we can just delete some files and the database can open isn't a general solution. At that point arbitrary state has been lost, potentially entire container/application lifecycles, and who knows what will happen. Rather than assume we know how leveldb internals work (which could completely change if we upgrade the leveldb dependency and invalidate our assumptions), we should use JniDBFactory.factory.repair to try to repair the database rather than delete files here and there ourselves. Arguably if leveldb's own repair doesn't work and we're insistent that the NM must come up at all costs then we should just nuke the database and start without state. Of course the log should be filled with all sorts of errors to indicate this was in no way a normal startup. > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > --- > > Key: YARN-2873 > URL: https://issues.apache.org/jira/browse/YARN-2873 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2873.000.patch, YARN-2873.001.patch > > > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > We saw the following three level DB exceptions, all these exceptions cause NM > start failure. > DBException 1 in ShuffleHandler > {code} > INFO org.apache.hadoop.service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state STARTED; cause: > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475) > at > org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443) > at > org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {code} > DBException 2 in NMLe
[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
[ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215676#comment-14215676 ] Hadoop QA commented on YARN-2873: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12682074/YARN-2873.001.patch against trunk revision 9dd5d67. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5864//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5864//console This message is automatically generated. > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > --- > > Key: YARN-2873 > URL: https://issues.apache.org/jira/browse/YARN-2873 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2873.000.patch, YARN-2873.001.patch > > > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > We saw the following three level DB exceptions, all these exceptions cause NM > start failure. > DBException 1 in ShuffleHandler > {code} > INFO org.apache.hadoop.service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state STARTED; cause: > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(J
[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
[ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215658#comment-14215658 ] zhihai xu commented on YARN-2873: - Attached a new patch YARN-2873.001.patch to fix findbugs warnings. > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > --- > > Key: YARN-2873 > URL: https://issues.apache.org/jira/browse/YARN-2873 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2873.000.patch, YARN-2873.001.patch > > > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > We saw the following three level DB exceptions, all these exceptions cause NM > start failure. > DBException 1 in ShuffleHandler > {code} > INFO org.apache.hadoop.service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state STARTED; cause: > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475) > at > org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443) > at > org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {code} > DBException 2 in NMLeveldbStateStoreService: > {code} > Error starting NodeManager > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.ss
[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
[ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215646#comment-14215646 ] Hadoop QA commented on YARN-2873: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12682062/YARN-2873.000.patch against trunk revision 9dd5d67. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5863//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5863//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-shuffle.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5863//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5863//console This message is automatically generated. > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > --- > > Key: YARN-2873 > URL: https://issues.apache.org/jira/browse/YARN-2873 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2873.000.patch > > > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > We saw the following three level DB exceptions, all these exceptions cause NM > start failure. > DBException 1 in ShuffleHandler > {code} > INFO org.apache.hadoop.service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state STARTED; cause: > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recove
[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
[ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215610#comment-14215610 ] zhihai xu commented on YARN-2873: - Uploaded a patch YARN-2873.000.patch to delete levelDB file CURRENT when DBException happened, So the NM can start successfully from DBException instead of failing to start. > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > --- > > Key: YARN-2873 > URL: https://issues.apache.org/jira/browse/YARN-2873 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2873.000.patch > > > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > We saw the following three level DB exceptions, all these exceptions cause NM > start failure. > DBException 1 in ShuffleHandler > {code} > INFO org.apache.hadoop.service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state STARTED; cause: > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475) > at > org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443) > at > org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {code} > DBException 2 in NMLeveldbStateStoreService: > {code} > Error starting NodeManager > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corrupt