[
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215646#comment-14215646
]
Hadoop QA commented on YARN-2873:
---------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12682062/YARN-2873.000.patch
against trunk revision 9dd5d67.
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:red}-1 tests included{color}. The patch doesn't appear to include
any new or modified tests.
Please justify why no new tests are needed for this
patch.
Also please list what manual steps were performed to
verify this patch.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 javadoc{color}. There were no new javadoc warning messages.
{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.
{color:red}-1 findbugs{color}. The patch appears to introduce 2 new
Findbugs (version 2.0.3) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.
{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.
Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5863//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5863//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-shuffle.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5863//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5863//console
This message is automatically generated.
> improve LevelDB error handling for missing files DBException to avoid NM
> start failure.
> ---------------------------------------------------------------------------------------
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.5.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Attachments: YARN-2873.000.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
> failed in state STARTED; cause:
> org.apache.hadoop.service.ServiceStateException:
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1
> missing files; e.g.:
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
> org.apache.hadoop.service.ServiceStateException:
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1
> missing files; e.g.:
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
> at
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException:
> Corruption: 1 missing files; e.g.:
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
> at
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at
> org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
> at
> org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
> at
> org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {code}
> DBException 2 in NMLeveldbStateStoreService:
> {code}
> Error starting NodeManager
> org.apache.hadoop.service.ServiceStateException:
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1
> missing files; e.g.:
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000005.sst
> at
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
>
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
>
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException:
> Corruption: 1 missing files; e.g.:
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000005.sst
> at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:842)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:195)
>
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> {code}
> DBException 3 in NMLeveldbStateStoreService:
> {code}
> INFO org.apache.hadoop.service.AbstractService
> Service
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService
> failed in state INITED; cause:
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error:
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/MANIFEST-000004: No such file
> or directory
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error:
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/MANIFEST-000004: No such file
> or directory
> at
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:842)
> at
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:195)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> {code}
> DBException 1 and 2 is due to Sorted table file 000005.sst being deleted
> accidentally.
> DBException 3 is due to MANIFEST being deleted accidentally.
> It would be better to handle these errors instead of NM failed to start with
> DBException.
> For these DBExceptions, if we delete the LevelDB text file CURRENT, NM will
> recover successfully from the DBException.
> CURRENT is a simple text file that contains the name of the latest MANIFEST
> file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)