[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2015-05-01 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522890#comment-14522890
 ] 

zhihai xu commented on YARN-2873:
-

We didn't see this problem any more after store levelDB files away from tmp 
directory.

> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> ---
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2873.000.patch, YARN-2873.001.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM 
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state STARTED; cause: 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   ... 10 more
> {code}
> DBException 2 in NMLeveldbStateStoreService:
> {code}
> Error starting NodeManager 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
>  
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/ya

[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2014-11-18 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216464#comment-14216464
 ] 

zhihai xu commented on YARN-2873:
-

Hi [~jlowe],

I agree with you. The root cause is the Sorted Tables(*.sst) and MANIFEST file 
being deleted.
If these files are stored away from tmp directory, it may solve the problem. 

thanks
zhihai

> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> ---
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2873.000.patch, YARN-2873.001.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM 
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state STARTED; cause: 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   ... 10 more
> {code}
> DBException 2 in NMLeveldbStateStoreService:
> {code}
> Error starting NodeManager 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
>  
> Caused by: org.fusesource.level

[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2014-11-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216259#comment-14216259
 ] 

Jason Lowe commented on YARN-2873:
--

I have some serious concerns about this approach.  As I mentioned during the 
discussions on YARN-2816, this is trying to recover from a completely invalid 
setup.  If something is coming along and deleting (i.e.: corrupting) parts of 
the database then _that_ is the problem that needs to be corrected rather than 
worked around in the NM.  Reaching into the internals of the leveldb files and 
assuming we can just delete some files and the database can open isn't a 
general solution.  At that point arbitrary state has been lost, potentially 
entire container/application lifecycles, and who knows what will happen.

Rather than assume we know how leveldb internals work (which could completely 
change if we upgrade the leveldb dependency and invalidate our assumptions), we 
should use JniDBFactory.factory.repair to try to repair the database rather 
than delete files here and there ourselves.  Arguably if leveldb's own repair 
doesn't work and we're insistent that the NM must come up at all costs then we 
should just nuke the database and start without state.  Of course the log 
should be filled with all sorts of errors to indicate this was in no way a 
normal startup.

> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> ---
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2873.000.patch, YARN-2873.001.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM 
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state STARTED; cause: 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   ... 10 more
> {code}
> DBException 2 in NMLe

[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2014-11-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215676#comment-14215676
 ] 

Hadoop QA commented on YARN-2873:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12682074/YARN-2873.001.patch
  against trunk revision 9dd5d67.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5864//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5864//console

This message is automatically generated.

> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> ---
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2873.000.patch, YARN-2873.001.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM 
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state STARTED; cause: 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(J

[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2014-11-17 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215658#comment-14215658
 ] 

zhihai xu commented on YARN-2873:
-

Attached a new patch YARN-2873.001.patch to fix findbugs warnings.

> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> ---
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2873.000.patch, YARN-2873.001.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM 
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state STARTED; cause: 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   ... 10 more
> {code}
> DBException 2 in NMLeveldbStateStoreService:
> {code}
> Error starting NodeManager 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
>  
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.ss

[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2014-11-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215646#comment-14215646
 ] 

Hadoop QA commented on YARN-2873:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12682062/YARN-2873.000.patch
  against trunk revision 9dd5d67.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5863//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5863//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-shuffle.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5863//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5863//console

This message is automatically generated.

> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> ---
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2873.000.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM 
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state STARTED; cause: 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recove

[jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2014-11-17 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215610#comment-14215610
 ] 

zhihai xu commented on YARN-2873:
-

Uploaded a patch YARN-2873.000.patch to delete levelDB file CURRENT when 
DBException happened, So the NM can start successfully from DBException instead 
of failing to start.

> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> ---
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2873.000.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM 
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state STARTED; cause: 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   ... 10 more
> {code}
> DBException 2 in NMLeveldbStateStoreService:
> {code}
> Error starting NodeManager 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
>  
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corrupt