[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2020-12-10 Thread Hemanth Boyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247202#comment-17247202
 ] 

Hemanth Boyina commented on HDFS-13314:
---

thanks for the discussions here
{quote}*bq. WithCount withCount = (WithCount) referred.getParentReference();
   << Crashes here as referred is null.*
{quote}
we have come across this issue and faced a null pointer exception and the 
fsimages are corrupted

do you have added any follow up Jira or have any suggestions for this [~arp] 
[~yzhangal]

 

 

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.3
>
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-30 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420978#comment-16420978
 ] 

Yongjun Zhang commented on HDFS-13314:
--

{quote}

Hi Yongjun, thanks for looking at the Jira! Please post your comments in the 
Jira also for support. 

 
 # Yes we saw duplicate entries.
 # The crash we saw was a NPE due to the referred INode being absent. The check 
looks for such dangling references. I don’t think we have seen a crash at the 
location you pointed out.

    private INodeReference loadINodeReference(

    INodeReferenceSection.INodeReference r) throws IOException {

  long referredId = r.getReferredId();

  INode referred = fsDir.getInode(referredId);

  *WithCount withCount = (WithCount) referred.getParentReference();   
<< Crashes here as referred is null.*
 # We have not seen misordered entries yet. Also, the *!misordered* check was 
deliberate. Once there is one such entry the whole list is compromised.
 # The Assertion actually results in a runtime exception which fails the 
request. However we suspect that the list was somehow corrupted by other means, 
not the insert call. We are not sure how it happened.

 

Let me know if you have any concerns or ideas for improving the checks. We can 
certainly do a follow up jira.

{quote}

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2
>
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-30 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420973#comment-16420973
 ] 

Yongjun Zhang commented on HDFS-13314:
--

I had couple of email exchange with [~arpitagarwal]
{quote}
HI Arpit,

Sorry I'm behind to catch up with:

https://issues.apache.org/jira/browse/HDFS-13314

Good work you did there!

I have some questions and thoughts:

1. Did you observe duplicate entries in deleted list below? I assumed so, would 
like to confirm just in case.

List deleted = diff.getChildrenDiff().getDeletedUnmodifiable();

2. The fsimage loading code could crash here:

   for (int refId : e.getRefChildrenList()) {
  INodeReference ref = refList.get(refId);
  addToParent(p, ref);
}

due to fsimage corruption. Does your checking cover that?

3. The code below: if the list is sorted, how could misorder happen? Did you 
observe disorder at all? or just to make the checking complete? Also very 
minor, misorder seems to be detected once only because of the highlighted code 
check !misordered. 

 INode previousNode = null;
  boolean misordered = false;
  for (INode d : deleted) {
// getBytes() may return null below, and that is okay.
final int result = previousNode == null ? -1 :
previousNode.compareTo(d.getLocalNameBytes());
if (result == 0) {
  FSImage.LOG.error(
  "Name '" + d.getLocalName() + "' is repeated in the " +
  "'deleted' difflist of directory " +
  dir.getFullPathName() + ", INodeId=" + dir.getId());
  ++numImageErrors;
} else if (result > 0 && !misordered) {
  misordered = true;
  ++numImageErrors;
}
previousNode = d;

4. I assume the code above detect only duplicate entries. Can we have the check 
in the place where new entries are added to the deleteList? That way, we can 
know exactly the trace stack that caused duplicated entries are added, and 
probably additional information?

Specifically, in the following code, we can change the AssertionError into a 
real exception?

  private void insert(final ListType type, final E element, final int i) {
List list = type == ListType.CREATED? created: deleted; 
if (i >= 0) {
  throw new AssertionError("Element already exists: element=" + element
  + ", " + type + "=" + list);
}
if (list == null) {
  list = new ArrayList(DEFAULT_ARRAY_INITIAL_CAPACITY);
  if (type == ListType.CREATED) {
created = list;
  } else if (type == ListType.DELETED){
deleted = list;
  }
}
list.add(-i - 1, element);
  }


Thanks a lot.

--Yongjun
{quote}



> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2
>
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-29 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419417#comment-16419417
 ] 

Tsz Wo Nicholas Sze commented on HDFS-13314:


{quote}
 The test case would protect this feature if someone in future removes/modifies 
this if statement.
{quote}
[~shahrs87], yes or no since the "someone" probably may also modify the tests.  
The protection is very weak.
{quote}
Almost all of the code contains if, while or for statements. That doesn't mean 
it needs no test cases.
{quote}
Do you mean that you are testing all if, while or for statements in all your 
code?  Wow, unbelievable!


> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2
>
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-29 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419307#comment-16419307
 ] 

Arpit Agarwal commented on HDFS-13314:
--

bq. In most of the patches that I have submitted, writing test cases took 
majority of time compared to writing production code but that doesn't give free 
pass to commit.
Rushabh, please try to be polite or at the very least be civil to your fellow 
contributors. Your tone has been confrontational since the start of this Jira 
without any reasonable cause for it.

bq. writing test cases took majority of time compared to writing production 
code but that doesn't give free pass to commit
Good for you and we have all written our share of unit tests in the past. This 
is not just about the amount of effort involved, but also about the risk of 
refactoring existing code for questionable benefit.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2
>
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-29 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419273#comment-16419273
 ] 

Rushabh S Shah commented on HDFS-13314:
---

bq. We test this code path since we have many unit tests that exercise 
saveNamespace. e.g. multiple test cases in TestFSImage would fail if NN exits 
when numErrors == 0.
Agreed.
bq. No easy way to do so without refactoring existing classes or inserting some 
test hooks. I don't think it is worth the effort.
In most of the patches that I have submitted, writing test cases took majority 
of time compared to writing production code but that doesn't give free pass to 
commit.

bq. More details: The suggested unit test sounds like that we should test 
whether the if-statement in Java is working properly.
[~szetszwo], The test case would protect this feature if someone in future 
removes/modifies this _if statement_.
Almost all of the code contains if, while or for statements. That doesn't mean 
it needs no test cases.

But looks like its little late now for this comment.


> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2
>
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418042#comment-16418042
 ] 

Hudson commented on HDFS-13314:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13896 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13896/])
HDFS-13314. NameNode should optionally exit if it detects FsImage (arp: rev 
a991e899fb9f98d2089f37ac9ac7c485d3bbb959)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormatProtobuf.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/FSImageFormatPBSnapshot.java


> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 3.0.2
>
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-27 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416260#comment-16416260
 ] 

Arpit Agarwal commented on HDFS-13314:
--

bq. if numErrors == 0 then namenode should not exit.
We test this code path since we have many unit tests that exercise 
saveNamespace. e.g. multiple test cases in TestFSImage would fail if NN exits 
when numErrors == 0.

bq. if numErrors > 0 then namenode should exit.
No easy way to do so without refactoring or inserting test hooks. I don't think 
it is worth the effort. Do you feel it is a blocker to committing this 
validation?

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-27 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416209#comment-16416209
 ] 

Tsz Wo Nicholas Sze commented on HDFS-13314:


More details: The suggested unit test sounds like that we should test whether 
the if-statement in Java is working properly.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-27 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416199#comment-16416199
 ] 

Tsz Wo Nicholas Sze commented on HDFS-13314:


[~shahrs87], imho, the unit test  you suggested does not sound useful.  Thanks.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-27 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416046#comment-16416046
 ] 

Rushabh S Shah commented on HDFS-13314:
---

bq. perhaps we could do some ugly fault injection to create dangling references 
and corrupted diff lists in the image.
I am not that much concerned with the test case showing how the image is 
corrupted.
I would like to see a test case to verify the following 2 scenarios
* if numErrors == 0 then namenode should not exit.
* if numErrors > 0 then namenode should exit. 
Quickly going through ExitUtil class, I see you can use 
{{ExitUtil#disableSystemExit}}.
This will save the exception somewhere in ExitUtil class. 
{{ExitUtil#terminate}} will throw an {{ExitException}}.
Hope this helps.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-27 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415919#comment-16415919
 ] 

Arpit Agarwal commented on HDFS-13314:
--

Thanks for reviewing the patch. I don't see an easy way to write a test case, 
perhaps we could do some ugly fault injection to create dangling references and 
corrupted diff lists in the image. If you have any ideas to test it more easily 
or want to contribute some test code, I'd be happy to incorporate either.

Meanwhile, I've manually tested the patch with a corrupted image (modified the 
loading code to ignore the assertions while loading, then tried to save the 
image with this patch applied and verified the error conditions were detected).

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-27 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415743#comment-16415743
 ] 

Rushabh S Shah commented on HDFS-13314:
---

Overall the changes looks good.
I would like to see a test case for this change.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-27 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415676#comment-16415676
 ] 

Rushabh S Shah commented on HDFS-13314:
---

I will review it today.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-26 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414882#comment-16414882
 ] 

Tsz Wo Nicholas Sze commented on HDFS-13314:


+1 the 05 patch looks good.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-22 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410615#comment-16410615
 ] 

genericqa commented on HDFS-13314:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 24m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 
0 new + 86 unchanged - 1 fixed = 86 total (was 87) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}  
9m 52s{color} | {color:green} patch has no errors when building and testing our 
client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}100m 57s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}179m 45s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b |
| JIRA Issue | HDFS-13314 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12915758/HDFS-13314.05.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux d7832a5bf337 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 
21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 8620d2b |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23630/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23630/testReport/ |
| Max. process+thread count | 3838 (vs. ulimit of 1) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs 

[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-22 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410236#comment-16410236
 ] 

Arpit Agarwal commented on HDFS-13314:
--

v05 patch: Remove the config key which made this behavior optional.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-22 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409920#comment-16409920
 ] 

Arpit Agarwal commented on HDFS-13314:
--

bq. not halting the NN risks removing the only good image.
The defaults will keep the good image, bad image and all edits in between. 
Since the Standby cannot be restarted unless the problem is manually worked 
around or fixed, no new checkpoints will be generated.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-22 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409895#comment-16409895
 ] 

Arpit Agarwal commented on HDFS-13314:
--

bq. Yes, no config option. Detected corruption = unconditional hard stop.
Ok, will do.

bq. The in-memory state is corrupt but the edit stream (hopefully) isn't. Which 
is easier to do: Hack up the NN to attempt to load the bad image? Or replay a 
partial edit stream perhaps w/o the snapshot removal? 
I'd suggest that the former is safer as it exposes the problem quicker e.g. if 
the administrator has configured auto-restart which many of our customers do. 
If we don't write an image, the NN shuts down but it can be restarted easily 
and continue to run with corrupted state (note that we don't yet know how to 
detect the corruption when replaying edit logs).

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-22 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409860#comment-16409860
 ] 

Daryn Sharp commented on HDFS-13314:


bq. Is there anything you suggest doing differently?

Yes, no config option.  Detected corruption = unconditional hard stop.

bq. Once we get to this point, the metadata is already corrupt. Writing out a 
new FsImage doesn't make it any worse because replaying the prior image and 
edits would lead to the same state.

The in-memory state is corrupt but the edit stream (hopefully) isn't.  Which is 
easier to do: Hack up the NN to attempt to load the bad image?  Or replay a 
partial edit stream perhaps w/o the snapshot removal?  If you agree to the 
latter, then as Rushabh pointed out, _not_ halting the NN risks removing the 
only good image.  The defaults allow at most 2 hours (2 images retained, 1h 
checkpoint interval unless max edits exceeded).

Running in the corrupted state risks data loss.  As cited above, the original 
report of this bug resulted in the NN causing *9300 missing blocks*.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-22 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409657#comment-16409657
 ] 

Arpit Agarwal commented on HDFS-13314:
--

Thanks [~szetszwo]. I'll hold off committing in case [~daryn] or Rushabh have 
further comments.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-22 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409463#comment-16409463
 ] 

genericqa commented on HDFS-13314:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 25s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
55s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 
0 new + 518 unchanged - 1 fixed = 518 total (was 519) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 35s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 42s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}136m 44s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.tools.TestHdfsConfigFields |
|   | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | HDFS-13314 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12915540/HDFS-13314.04.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux e584a90790f8 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 
14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 8d898ab |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23621/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23621/testReport/ |
| Max. process+thread count | 4184 (vs. 

[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-21 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408443#comment-16408443
 ] 

Arpit Agarwal commented on HDFS-13314:
--

Thanks [~szetszwo]. The v4 patch removes savedImage and addresses Jenkins 
failures.

bq. Question: why using ExitUtil.terminate(-1) but not thrown an IOException?
I want to guarantee process exit. Don't want the exception to be swallowed up 
the call stack.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407400#comment-16407400
 ] 

Tsz Wo Nicholas Sze commented on HDFS-13314:


[~arpitagarwal], thanks for the update.  I have one minor comment and a 
question:
- In saveNamespace, savedImage is always true in the if-statement at the end.  
So that we can remove it.
- Question: why using ExitUtil.terminate(-1) but not thrown an IOException?

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407364#comment-16407364
 ] 

genericqa commented on HDFS-13314:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
24s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 26s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 46s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 5 new + 517 unchanged - 1 fixed = 522 total (was 518) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}  
9m 55s{color} | {color:green} patch has no errors when building and testing our 
client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
42s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 1 new + 1 
unchanged - 0 fixed = 2 total (was 1) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 15s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
19s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}152m  5s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts |
|   | hadoop.tools.TestHdfsConfigFields |
|   | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | HDFS-13314 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12915379/HDFS-13314.03.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 691ce0556be8 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 
11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 138c1ed |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 

[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407196#comment-16407196
 ] 

Arpit Agarwal commented on HDFS-13314:
--

bq. How is the "safe" choice to knowingly write a corrupt image? One that 
renders the NN incapable of starting up?
Once we get to this point, the metadata is already corrupt. Writing out a new 
FsImage doesn't make it any worse because replaying the prior image and edits 
would lead to the same state. 

bq. We need to immediately do a full stop anytime data structures are known to 
be corrupt.
That is my intention. Looks like you've gone over the patch. Is there anything 
you suggest doing differently?

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407124#comment-16407124
 ] 

Daryn Sharp commented on HDFS-13314:


I think Rushabh thought the "don't exit" option didn't delete prior images and 
edits after checkpointing based on the claim +"The purge step is skipped if a 
bad image was written"+.  However, the code appears to only skip the purge if 
it's configured to shutdown on corruption.

{quote}
bq. I'm inclined to agree with Arpit that we should not change the default 
behavior, though. In the extreme case where someone really wants the checkpoint 
done (e.g. has not checkpointed for a long time so lots of edits, etc.), 
keeping the old behavior seems better - you cannot let them reconfigure and do 
it again.
Yes, this is a good explanation. Not changing the default, and ensuring we 
write a new image is the safe choice.
{quote}

No, that is a terrible explanation.  How is the "safe" choice to knowingly 
write a corrupt image?  One that renders the NN incapable of starting up?  
There's no "reconfigure" your way out of that.  How is it safe to allow the NN 
to start obliterating data? (see HDFS-9406, 9300 blocks invalidated).

bq. I think it may also be possible if the workflow deletes a bunch of stuff 
(e.g. the problematic snapshot, parent dir, etc.), and checkpoint, the 
corruption may not happen at all - although this is an untested guess.

Wishful thinking + data durability = russian data roulette.  I'd predict data 
loss due to incorrect invalidations, further corruption of the in-memory state, 
probably corrupted edits, and eventual crash.

We need to immediately do a full stop anytime data structures are known to be 
corrupt.


> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406982#comment-16406982
 ] 

Arpit Agarwal commented on HDFS-13314:
--

v03 patch addresses feedback from [~szetszwo].

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406819#comment-16406819
 ] 

Arpit Agarwal commented on HDFS-13314:
--

[~shahrs87] I am unsure how your question relates to this change.

If you were affected by the snapshots bug, your NN would already be writing 
corrupt images.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406798#comment-16406798
 ] 

Rushabh S Shah commented on HDFS-13314:
---

bq. Checkpointing is done by the standby.
But I don't need to restart the standby to do the checkpointing.
Standby does the checkpointing every 12 hours (in our case).
 It will continue to write the corrupted image every 12 hours (and old images 
won't be purged) if we don't upgrade the cluster in a month.
Am I missing something ?

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406753#comment-16406753
 ] 

Arpit Agarwal commented on HDFS-13314:
--

Thanks for the look [~xiaochen]. 
bq. I'm inclined to agree with Arpit that we should not change the default 
behavior, though. In the extreme case where someone really wants the checkpoint 
done (e.g. has not checkpointed for a long time so lots of edits, etc.), 
keeping the old behavior seems better - you cannot let them reconfigure and do 
it again.
Yes, this is a good explanation. Not changing the default, and ensuring we 
write a new image is the safe choice.

Also thanks for the feedback [~szetszwo], I will post an updated patch shortly 
to address your comments. Running some sanity tests first.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406741#comment-16406741
 ] 

Arpit Agarwal commented on HDFS-13314:
--

bq. I don't understand why it is impossible. Why do I need to restart the 
standby ?
Checkpointing is done by the standby.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406429#comment-16406429
 ] 

Rushabh S Shah commented on HDFS-13314:
---

bq. Impossible, as you will need to restart the standby to get any new 
checkpoints. And it will fail to restart because the latest fsimage is bad.
I don't understand why it is impossible. Why do I need to restart the standby ?

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-20 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405890#comment-16405890
 ] 

Xiao Chen commented on HDFS-13314:
--

Thanks [~arpitagarwal] and all for the effort here. Also ping [~yzhangal] for 
interest.

I echo the difficulty and sometimes frustration on not able to reproduce a 
corruption. The idea here sounds good.

I'm inclined to agree with Arpit that we should not change the default 
behavior, though. In the extreme case where someone really wants the checkpoint 
done (e.g. has not checkpointed for a long time so lots of edits, etc.), 
keeping the old behavior seems better - you cannot let them reconfigure and do 
it again. I think it may also be possible if the workflow deletes a bunch of 
stuff (e.g. the problematic snapshot, parent dir, etc.), and checkpoint, the 
corruption may not happen at all - although this is an untested guess. 

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-19 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405647#comment-16405647
 ] 

genericqa commented on HDFS-13314:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
18s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 54s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 4 new + 518 unchanged - 1 fixed = 522 total (was 519) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  4s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m  
4s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 
unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
52s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 1 new + 1 
unchanged - 0 fixed = 2 total (was 1) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 18s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
23s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}133m 10s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs-project/hadoop-hdfs |
|  |  Inconsistent synchronization of 
org.apache.hadoop.hdfs.server.namenode.FSImage.exitAfterSave; locked 66% of 
time  Unsynchronized access at FSImage.java:66% of time  Unsynchronized access 
at FSImage.java:[line 973] |
| Failed junit tests | 
hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.tools.TestHdfsConfigFields |
|   | hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | HDFS-13314 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12915200/HDFS-13314.02.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 178676a1c9b8 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 
14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| 

[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-19 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405556#comment-16405556
 ] 

Arpit Agarwal commented on HDFS-13314:
--

Hi Rushabh,

bq. You need to change the namenode code and build again to bypass
Correct.

bq. On top of that if we don't purge the old images in case of corruption, we 
will run out of disk space in 3-4 weeks.
Impossible, as you will need to restart the standby to get any new checkpoints. 
And it will fail to restart because the latest fsimage is bad.

bq. IMO instead of putting a hack in namenode, we should actively chase the 
bugs and root cause them.
This is not a hack - it is the equivalent of an {{assert}}. We have been 
chasing this bug for a few weeks without success. Other community members have 
been doing the same for a while (please see HDFS-13101 and comments on 
HDFS-9406). To root cause the problem, we need (1) a good image and (2) the 
sequence of write transactions causing the corruption.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-19 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405506#comment-16405506
 ] 

Rushabh S Shah commented on HDFS-13314:
---

bq. In the cases we ran into, the corrupted image was loadable after bypassing 
some checks during NameNode startup. 
You need to change the namenode code and build again to bypass or are you 
talking about some conf ?

{quote}
The corruption was detected the next time a NameNode is restarted which may be 
weeks after it occurred.
 The default value of dfs.namenode.num.checkpoints.retained is 2, so the older 
image is not lost.
The purge step is skipped if a bad image was written.
{quote}
As you mentioned it takes few weeks to detect that a bad image was written.
Coming from a company which has huge churn of write ops, we checkpoint atleast 
every 12 hours and write image of about 25GB.
On top of that if we don't purge the old images in case of corruption, we will 
run out of disk space in 3-4 weeks.

IMO instead of putting a hack in namenode, we should actively chase the bugs 
and root cause them.
Also I still think the default value should be *to exit*.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-19 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405475#comment-16405475
 ] 

Tsz Wo Nicholas Sze commented on HDFS-13314:


Thanks [~arpitagarwal], some comments on the patch:
- Print also the fsimage file name in the log messages below.
{code:java}
+  LOG.error("Detected " + numErrors + " errors while saving FsImage.");
{code}
{code:java}
+  LOG.fatal("NameNode process will exit now... The saved FsImage is " +
+  "potentially corrupted.");
{code}

 - Add numErrors in the log message below.
{code:java}
+long numErrors = saveInternal(fout, compression, 
file.getAbsolutePath());
 LOG.info("Image file {} of size {} bytes saved in {} seconds.", file,
 file.length(), (monotonicNow() - startTime) / 1000);
+return numErrors;
{code}

 - Print the full path in the log message below
{code:java}
+FSImage.LOG.error("FSImageFormatPBSnapshot: Missing referred INodeId " 
+
+ref.getId() + " for INodeReference index " + refIndex);
{code}

 - Let's not only check INodeReference but all INodes. Also, let's use 
compareTo to detect also out-of-order cases.
{code:java}
  INode previous = null;
  for (INode d : deleted) {
if (previous != null) {
  final int cmp = d.compareTo(previous.getLocalNameBytes());
  if (cmp <= 0) {
final String err = cmp == 0? "repeated": "out-of-order";
FSImage.LOG.error("Names " + err + " in the 'deleted' difflist 
of directory " ...);
++numImageErrors;
  }
}
previous = d;
{code}

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-19 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405451#comment-16405451
 ] 

Arpit Agarwal commented on HDFS-13314:
--

Thanks for the look Rushabh. In the cases we ran into, the corrupted image was 
loadable after bypassing some checks during NameNode startup. IAC there are 
couple of safeguards against losing the older image:
# The default value of {{dfs.namenode.num.checkpoints.retained}} is 2, so the 
older image is not lost.
# The purge step is skipped if a bad image was written.

bq. We can default to always exit and if someone is ok having their image 
corrupted then they can set it to not exit.
I am okay with changing the default behavior to just exit if corruption is 
detected, if the community is fine with it. I expect leaving the default 
behavior unchanged will be less controversial.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-19 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405429#comment-16405429
 ] 

Rushabh S Shah commented on HDFS-13314:
---

Just curious why we want to go ahead and still write the image if we know we 
won't be able to recover ?
IMO, it *should always* exit so that we have an earlier good image to recover 
from.
We can default to always exit and if someone is ok having their image corrupted 
then they can set it to _not exit_.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-19 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405373#comment-16405373
 ] 

Arpit Agarwal commented on HDFS-13314:
--

We've seen two FsImage corruption symptoms correlated with heavy usage of HDFS 
snapshots. 

# Dangling INodeReferences (likely the same as HDFS-13101)
# Duplicate entries in snapshot diff list (this may have been caused by 
attempting to work #1)

This usually occurs when someone has dozens of snapshots on a large directory 
e.g. {{/}}, or {{/apps/hive/warehouse}}. We have not been able to repro the 
problem with load testing yet.

The corruption detected the next time a NameNode is restarted which may be 
weeks after it occurred. Since both problems can be trivially detected while 
writing the FsImage, this patch proposes that the NameNode self-terminate after 
writing a bad image.

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Attachments: HDFS-13314.01.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org