[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247202#comment-17247202 ] Hemanth Boyina commented on HDFS-13314: --- thanks for the discussions here {quote}*bq. WithCount withCount = (WithCount) referred.getParentReference(); << Crashes here as referred is null.* {quote} we have come across this issue and faced a null pointer exception and the fsimages are corrupted do you have added any follow up Jira or have any suggestions for this [~arp] [~yzhangal] > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.3 > > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420978#comment-16420978 ] Yongjun Zhang commented on HDFS-13314: -- {quote} Hi Yongjun, thanks for looking at the Jira! Please post your comments in the Jira also for support. # Yes we saw duplicate entries. # The crash we saw was a NPE due to the referred INode being absent. The check looks for such dangling references. I don’t think we have seen a crash at the location you pointed out. private INodeReference loadINodeReference( INodeReferenceSection.INodeReference r) throws IOException { long referredId = r.getReferredId(); INode referred = fsDir.getInode(referredId); *WithCount withCount = (WithCount) referred.getParentReference(); << Crashes here as referred is null.* # We have not seen misordered entries yet. Also, the *!misordered* check was deliberate. Once there is one such entry the whole list is compromised. # The Assertion actually results in a runtime exception which fails the request. However we suspect that the list was somehow corrupted by other means, not the insert call. We are not sure how it happened. Let me know if you have any concerns or ideas for improving the checks. We can certainly do a follow up jira. {quote} > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2 > > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420973#comment-16420973 ] Yongjun Zhang commented on HDFS-13314: -- I had couple of email exchange with [~arpitagarwal] {quote} HI Arpit, Sorry I'm behind to catch up with: https://issues.apache.org/jira/browse/HDFS-13314 Good work you did there! I have some questions and thoughts: 1. Did you observe duplicate entries in deleted list below? I assumed so, would like to confirm just in case. List deleted = diff.getChildrenDiff().getDeletedUnmodifiable(); 2. The fsimage loading code could crash here: for (int refId : e.getRefChildrenList()) { INodeReference ref = refList.get(refId); addToParent(p, ref); } due to fsimage corruption. Does your checking cover that? 3. The code below: if the list is sorted, how could misorder happen? Did you observe disorder at all? or just to make the checking complete? Also very minor, misorder seems to be detected once only because of the highlighted code check !misordered. INode previousNode = null; boolean misordered = false; for (INode d : deleted) { // getBytes() may return null below, and that is okay. final int result = previousNode == null ? -1 : previousNode.compareTo(d.getLocalNameBytes()); if (result == 0) { FSImage.LOG.error( "Name '" + d.getLocalName() + "' is repeated in the " + "'deleted' difflist of directory " + dir.getFullPathName() + ", INodeId=" + dir.getId()); ++numImageErrors; } else if (result > 0 && !misordered) { misordered = true; ++numImageErrors; } previousNode = d; 4. I assume the code above detect only duplicate entries. Can we have the check in the place where new entries are added to the deleteList? That way, we can know exactly the trace stack that caused duplicated entries are added, and probably additional information? Specifically, in the following code, we can change the AssertionError into a real exception? private void insert(final ListType type, final E element, final int i) { List list = type == ListType.CREATED? created: deleted; if (i >= 0) { throw new AssertionError("Element already exists: element=" + element + ", " + type + "=" + list); } if (list == null) { list = new ArrayList(DEFAULT_ARRAY_INITIAL_CAPACITY); if (type == ListType.CREATED) { created = list; } else if (type == ListType.DELETED){ deleted = list; } } list.add(-i - 1, element); } Thanks a lot. --Yongjun {quote} > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2 > > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419417#comment-16419417 ] Tsz Wo Nicholas Sze commented on HDFS-13314: {quote} The test case would protect this feature if someone in future removes/modifies this if statement. {quote} [~shahrs87], yes or no since the "someone" probably may also modify the tests. The protection is very weak. {quote} Almost all of the code contains if, while or for statements. That doesn't mean it needs no test cases. {quote} Do you mean that you are testing all if, while or for statements in all your code? Wow, unbelievable! > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2 > > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419307#comment-16419307 ] Arpit Agarwal commented on HDFS-13314: -- bq. In most of the patches that I have submitted, writing test cases took majority of time compared to writing production code but that doesn't give free pass to commit. Rushabh, please try to be polite or at the very least be civil to your fellow contributors. Your tone has been confrontational since the start of this Jira without any reasonable cause for it. bq. writing test cases took majority of time compared to writing production code but that doesn't give free pass to commit Good for you and we have all written our share of unit tests in the past. This is not just about the amount of effort involved, but also about the risk of refactoring existing code for questionable benefit. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2 > > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419273#comment-16419273 ] Rushabh S Shah commented on HDFS-13314: --- bq. We test this code path since we have many unit tests that exercise saveNamespace. e.g. multiple test cases in TestFSImage would fail if NN exits when numErrors == 0. Agreed. bq. No easy way to do so without refactoring existing classes or inserting some test hooks. I don't think it is worth the effort. In most of the patches that I have submitted, writing test cases took majority of time compared to writing production code but that doesn't give free pass to commit. bq. More details: The suggested unit test sounds like that we should test whether the if-statement in Java is working properly. [~szetszwo], The test case would protect this feature if someone in future removes/modifies this _if statement_. Almost all of the code contains if, while or for statements. That doesn't mean it needs no test cases. But looks like its little late now for this comment. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2 > > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418042#comment-16418042 ] Hudson commented on HDFS-13314: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13896 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13896/]) HDFS-13314. NameNode should optionally exit if it detects FsImage (arp: rev a991e899fb9f98d2089f37ac9ac7c485d3bbb959) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormatProtobuf.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/FSImageFormatPBSnapshot.java > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Fix For: 3.1.0, 2.10.0, 3.0.2 > > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416260#comment-16416260 ] Arpit Agarwal commented on HDFS-13314: -- bq. if numErrors == 0 then namenode should not exit. We test this code path since we have many unit tests that exercise saveNamespace. e.g. multiple test cases in TestFSImage would fail if NN exits when numErrors == 0. bq. if numErrors > 0 then namenode should exit. No easy way to do so without refactoring or inserting test hooks. I don't think it is worth the effort. Do you feel it is a blocker to committing this validation? > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416209#comment-16416209 ] Tsz Wo Nicholas Sze commented on HDFS-13314: More details: The suggested unit test sounds like that we should test whether the if-statement in Java is working properly. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416199#comment-16416199 ] Tsz Wo Nicholas Sze commented on HDFS-13314: [~shahrs87], imho, the unit test you suggested does not sound useful. Thanks. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416046#comment-16416046 ] Rushabh S Shah commented on HDFS-13314: --- bq. perhaps we could do some ugly fault injection to create dangling references and corrupted diff lists in the image. I am not that much concerned with the test case showing how the image is corrupted. I would like to see a test case to verify the following 2 scenarios * if numErrors == 0 then namenode should not exit. * if numErrors > 0 then namenode should exit. Quickly going through ExitUtil class, I see you can use {{ExitUtil#disableSystemExit}}. This will save the exception somewhere in ExitUtil class. {{ExitUtil#terminate}} will throw an {{ExitException}}. Hope this helps. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415919#comment-16415919 ] Arpit Agarwal commented on HDFS-13314: -- Thanks for reviewing the patch. I don't see an easy way to write a test case, perhaps we could do some ugly fault injection to create dangling references and corrupted diff lists in the image. If you have any ideas to test it more easily or want to contribute some test code, I'd be happy to incorporate either. Meanwhile, I've manually tested the patch with a corrupted image (modified the loading code to ignore the assertions while loading, then tried to save the image with this patch applied and verified the error conditions were detected). > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415743#comment-16415743 ] Rushabh S Shah commented on HDFS-13314: --- Overall the changes looks good. I would like to see a test case for this change. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415676#comment-16415676 ] Rushabh S Shah commented on HDFS-13314: --- I will review it today. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414882#comment-16414882 ] Tsz Wo Nicholas Sze commented on HDFS-13314: +1 the 05 patch looks good. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410615#comment-16410615 ] genericqa commented on HDFS-13314: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 24m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 86 unchanged - 1 fixed = 86 total (was 87) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 52s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}100m 57s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}179m 45s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b | | JIRA Issue | HDFS-13314 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12915758/HDFS-13314.05.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux d7832a5bf337 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 8620d2b | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/23630/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/23630/testReport/ | | Max. process+thread count | 3838 (vs. ulimit of 1) | | modules | C: hadoop-hdfs-project/hadoop-hdfs
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410236#comment-16410236 ] Arpit Agarwal commented on HDFS-13314: -- v05 patch: Remove the config key which made this behavior optional. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409920#comment-16409920 ] Arpit Agarwal commented on HDFS-13314: -- bq. not halting the NN risks removing the only good image. The defaults will keep the good image, bad image and all edits in between. Since the Standby cannot be restarted unless the problem is manually worked around or fixed, no new checkpoints will be generated. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409895#comment-16409895 ] Arpit Agarwal commented on HDFS-13314: -- bq. Yes, no config option. Detected corruption = unconditional hard stop. Ok, will do. bq. The in-memory state is corrupt but the edit stream (hopefully) isn't. Which is easier to do: Hack up the NN to attempt to load the bad image? Or replay a partial edit stream perhaps w/o the snapshot removal? I'd suggest that the former is safer as it exposes the problem quicker e.g. if the administrator has configured auto-restart which many of our customers do. If we don't write an image, the NN shuts down but it can be restarted easily and continue to run with corrupted state (note that we don't yet know how to detect the corruption when replaying edit logs). > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409860#comment-16409860 ] Daryn Sharp commented on HDFS-13314: bq. Is there anything you suggest doing differently? Yes, no config option. Detected corruption = unconditional hard stop. bq. Once we get to this point, the metadata is already corrupt. Writing out a new FsImage doesn't make it any worse because replaying the prior image and edits would lead to the same state. The in-memory state is corrupt but the edit stream (hopefully) isn't. Which is easier to do: Hack up the NN to attempt to load the bad image? Or replay a partial edit stream perhaps w/o the snapshot removal? If you agree to the latter, then as Rushabh pointed out, _not_ halting the NN risks removing the only good image. The defaults allow at most 2 hours (2 images retained, 1h checkpoint interval unless max edits exceeded). Running in the corrupted state risks data loss. As cited above, the original report of this bug resulted in the NN causing *9300 missing blocks*. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409657#comment-16409657 ] Arpit Agarwal commented on HDFS-13314: -- Thanks [~szetszwo]. I'll hold off committing in case [~daryn] or Rushabh have further comments. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409463#comment-16409463 ] genericqa commented on HDFS-13314: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 25s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 55s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 518 unchanged - 1 fixed = 518 total (was 519) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 35s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 42s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}136m 44s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.tools.TestHdfsConfigFields | | | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f | | JIRA Issue | HDFS-13314 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12915540/HDFS-13314.04.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux e584a90790f8 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 8d898ab | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/23621/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/23621/testReport/ | | Max. process+thread count | 4184 (vs.
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408443#comment-16408443 ] Arpit Agarwal commented on HDFS-13314: -- Thanks [~szetszwo]. The v4 patch removes savedImage and addresses Jenkins failures. bq. Question: why using ExitUtil.terminate(-1) but not thrown an IOException? I want to guarantee process exit. Don't want the exception to be swallowed up the call stack. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch, HDFS-13314.04.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407400#comment-16407400 ] Tsz Wo Nicholas Sze commented on HDFS-13314: [~arpitagarwal], thanks for the update. I have one minor comment and a question: - In saveNamespace, savedImage is always true in the if-statement at the end. So that we can remove it. - Question: why using ExitUtil.terminate(-1) but not thrown an IOException? > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407364#comment-16407364 ] genericqa commented on HDFS-13314: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 24s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 26s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 46s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 5 new + 517 unchanged - 1 fixed = 522 total (was 518) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 55s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 42s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 15s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}152m 5s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts | | | hadoop.tools.TestHdfsConfigFields | | | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f | | JIRA Issue | HDFS-13314 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12915379/HDFS-13314.03.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 691ce0556be8 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 138c1ed | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs | v3.1.0-RC1 | | checkstyle |
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407196#comment-16407196 ] Arpit Agarwal commented on HDFS-13314: -- bq. How is the "safe" choice to knowingly write a corrupt image? One that renders the NN incapable of starting up? Once we get to this point, the metadata is already corrupt. Writing out a new FsImage doesn't make it any worse because replaying the prior image and edits would lead to the same state. bq. We need to immediately do a full stop anytime data structures are known to be corrupt. That is my intention. Looks like you've gone over the patch. Is there anything you suggest doing differently? > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407124#comment-16407124 ] Daryn Sharp commented on HDFS-13314: I think Rushabh thought the "don't exit" option didn't delete prior images and edits after checkpointing based on the claim +"The purge step is skipped if a bad image was written"+. However, the code appears to only skip the purge if it's configured to shutdown on corruption. {quote} bq. I'm inclined to agree with Arpit that we should not change the default behavior, though. In the extreme case where someone really wants the checkpoint done (e.g. has not checkpointed for a long time so lots of edits, etc.), keeping the old behavior seems better - you cannot let them reconfigure and do it again. Yes, this is a good explanation. Not changing the default, and ensuring we write a new image is the safe choice. {quote} No, that is a terrible explanation. How is the "safe" choice to knowingly write a corrupt image? One that renders the NN incapable of starting up? There's no "reconfigure" your way out of that. How is it safe to allow the NN to start obliterating data? (see HDFS-9406, 9300 blocks invalidated). bq. I think it may also be possible if the workflow deletes a bunch of stuff (e.g. the problematic snapshot, parent dir, etc.), and checkpoint, the corruption may not happen at all - although this is an untested guess. Wishful thinking + data durability = russian data roulette. I'd predict data loss due to incorrect invalidations, further corruption of the in-memory state, probably corrupted edits, and eventual crash. We need to immediately do a full stop anytime data structures are known to be corrupt. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406982#comment-16406982 ] Arpit Agarwal commented on HDFS-13314: -- v03 patch addresses feedback from [~szetszwo]. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, > HDFS-13314.03.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406819#comment-16406819 ] Arpit Agarwal commented on HDFS-13314: -- [~shahrs87] I am unsure how your question relates to this change. If you were affected by the snapshots bug, your NN would already be writing corrupt images. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406798#comment-16406798 ] Rushabh S Shah commented on HDFS-13314: --- bq. Checkpointing is done by the standby. But I don't need to restart the standby to do the checkpointing. Standby does the checkpointing every 12 hours (in our case). It will continue to write the corrupted image every 12 hours (and old images won't be purged) if we don't upgrade the cluster in a month. Am I missing something ? > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406753#comment-16406753 ] Arpit Agarwal commented on HDFS-13314: -- Thanks for the look [~xiaochen]. bq. I'm inclined to agree with Arpit that we should not change the default behavior, though. In the extreme case where someone really wants the checkpoint done (e.g. has not checkpointed for a long time so lots of edits, etc.), keeping the old behavior seems better - you cannot let them reconfigure and do it again. Yes, this is a good explanation. Not changing the default, and ensuring we write a new image is the safe choice. Also thanks for the feedback [~szetszwo], I will post an updated patch shortly to address your comments. Running some sanity tests first. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406741#comment-16406741 ] Arpit Agarwal commented on HDFS-13314: -- bq. I don't understand why it is impossible. Why do I need to restart the standby ? Checkpointing is done by the standby. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406429#comment-16406429 ] Rushabh S Shah commented on HDFS-13314: --- bq. Impossible, as you will need to restart the standby to get any new checkpoints. And it will fail to restart because the latest fsimage is bad. I don't understand why it is impossible. Why do I need to restart the standby ? > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405890#comment-16405890 ] Xiao Chen commented on HDFS-13314: -- Thanks [~arpitagarwal] and all for the effort here. Also ping [~yzhangal] for interest. I echo the difficulty and sometimes frustration on not able to reproduce a corruption. The idea here sounds good. I'm inclined to agree with Arpit that we should not change the default behavior, though. In the extreme case where someone really wants the checkpoint done (e.g. has not checkpointed for a long time so lots of edits, etc.), keeping the old behavior seems better - you cannot let them reconfigure and do it again. I think it may also be possible if the workflow deletes a bunch of stuff (e.g. the problematic snapshot, parent dir, etc.), and checkpoint, the corruption may not happen at all - although this is an untested guess. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405647#comment-16405647 ] genericqa commented on HDFS-13314: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 10s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 54s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 518 unchanged - 1 fixed = 522 total (was 519) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 4s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 4s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 52s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 18s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}133m 10s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs-project/hadoop-hdfs | | | Inconsistent synchronization of org.apache.hadoop.hdfs.server.namenode.FSImage.exitAfterSave; locked 66% of time Unsynchronized access at FSImage.java:66% of time Unsynchronized access at FSImage.java:[line 973] | | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | | | hadoop.tools.TestHdfsConfigFields | | | hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f | | JIRA Issue | HDFS-13314 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12915200/HDFS-13314.02.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 178676a1c9b8 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | |
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405556#comment-16405556 ] Arpit Agarwal commented on HDFS-13314: -- Hi Rushabh, bq. You need to change the namenode code and build again to bypass Correct. bq. On top of that if we don't purge the old images in case of corruption, we will run out of disk space in 3-4 weeks. Impossible, as you will need to restart the standby to get any new checkpoints. And it will fail to restart because the latest fsimage is bad. bq. IMO instead of putting a hack in namenode, we should actively chase the bugs and root cause them. This is not a hack - it is the equivalent of an {{assert}}. We have been chasing this bug for a few weeks without success. Other community members have been doing the same for a while (please see HDFS-13101 and comments on HDFS-9406). To root cause the problem, we need (1) a good image and (2) the sequence of write transactions causing the corruption. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405506#comment-16405506 ] Rushabh S Shah commented on HDFS-13314: --- bq. In the cases we ran into, the corrupted image was loadable after bypassing some checks during NameNode startup. You need to change the namenode code and build again to bypass or are you talking about some conf ? {quote} The corruption was detected the next time a NameNode is restarted which may be weeks after it occurred. The default value of dfs.namenode.num.checkpoints.retained is 2, so the older image is not lost. The purge step is skipped if a bad image was written. {quote} As you mentioned it takes few weeks to detect that a bad image was written. Coming from a company which has huge churn of write ops, we checkpoint atleast every 12 hours and write image of about 25GB. On top of that if we don't purge the old images in case of corruption, we will run out of disk space in 3-4 weeks. IMO instead of putting a hack in namenode, we should actively chase the bugs and root cause them. Also I still think the default value should be *to exit*. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405475#comment-16405475 ] Tsz Wo Nicholas Sze commented on HDFS-13314: Thanks [~arpitagarwal], some comments on the patch: - Print also the fsimage file name in the log messages below. {code:java} + LOG.error("Detected " + numErrors + " errors while saving FsImage."); {code} {code:java} + LOG.fatal("NameNode process will exit now... The saved FsImage is " + + "potentially corrupted."); {code} - Add numErrors in the log message below. {code:java} +long numErrors = saveInternal(fout, compression, file.getAbsolutePath()); LOG.info("Image file {} of size {} bytes saved in {} seconds.", file, file.length(), (monotonicNow() - startTime) / 1000); +return numErrors; {code} - Print the full path in the log message below {code:java} +FSImage.LOG.error("FSImageFormatPBSnapshot: Missing referred INodeId " + +ref.getId() + " for INodeReference index " + refIndex); {code} - Let's not only check INodeReference but all INodes. Also, let's use compareTo to detect also out-of-order cases. {code:java} INode previous = null; for (INode d : deleted) { if (previous != null) { final int cmp = d.compareTo(previous.getLocalNameBytes()); if (cmp <= 0) { final String err = cmp == 0? "repeated": "out-of-order"; FSImage.LOG.error("Names " + err + " in the 'deleted' difflist of directory " ...); ++numImageErrors; } } previous = d; {code} > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405451#comment-16405451 ] Arpit Agarwal commented on HDFS-13314: -- Thanks for the look Rushabh. In the cases we ran into, the corrupted image was loadable after bypassing some checks during NameNode startup. IAC there are couple of safeguards against losing the older image: # The default value of {{dfs.namenode.num.checkpoints.retained}} is 2, so the older image is not lost. # The purge step is skipped if a bad image was written. bq. We can default to always exit and if someone is ok having their image corrupted then they can set it to not exit. I am okay with changing the default behavior to just exit if corruption is detected, if the community is fine with it. I expect leaving the default behavior unchanged will be less controversial. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405429#comment-16405429 ] Rushabh S Shah commented on HDFS-13314: --- Just curious why we want to go ahead and still write the image if we know we won't be able to recover ? IMO, it *should always* exit so that we have an earlier good image to recover from. We can default to always exit and if someone is ok having their image corrupted then they can set it to _not exit_. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption
[ https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405373#comment-16405373 ] Arpit Agarwal commented on HDFS-13314: -- We've seen two FsImage corruption symptoms correlated with heavy usage of HDFS snapshots. # Dangling INodeReferences (likely the same as HDFS-13101) # Duplicate entries in snapshot diff list (this may have been caused by attempting to work #1) This usually occurs when someone has dozens of snapshots on a large directory e.g. {{/}}, or {{/apps/hive/warehouse}}. We have not been able to repro the problem with load testing yet. The corruption detected the next time a NameNode is restarted which may be weeks after it occurred. Since both problems can be trivially detected while writing the FsImage, this patch proposes that the NameNode self-terminate after writing a bad image. > NameNode should optionally exit if it detects FsImage corruption > > > Key: HDFS-13314 > URL: https://issues.apache.org/jira/browse/HDFS-13314 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal >Priority: Major > Attachments: HDFS-13314.01.patch > > > The NameNode should optionally exit after writing an FsImage if it detects > the following kinds of corruptions: > # INodeReference pointing to non-existent INode > # Duplicate entries in snapshot deleted diff list. > This behavior is controlled via an undocumented configuration setting, and > disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org