[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)

2016-10-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623795#comment-15623795
 ] 

Hudson commented on HDFS-11030:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10738 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/10738/])
HDFS-11030. TestDataNodeVolumeFailure#testVolumeFailure is flaky (though 
(liuml07: rev 0c49f73a6c19ce0d0cd59cf8dfaa9a35f67f47ab)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java


> TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
> -
>
> Key: HDFS-11030
> URL: https://issues.apache.org/jira/browse/HDFS-11030
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, test
>Affects Versions: 2.7.0
>Reporter: Mingliang Liu
>Assignee: Mingliang Liu
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: HDFS-11030-branch-2.000.patch, HDFS-11030.000.patch
>
>
> TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the 
> blocks and files are replicated correctly.
> # To fail a volume, it deletes all the blocks and sets the data dir read only.
> {code:title=testVolumeFailure() snippet}
> // fail the volume
> // delete/make non-writable one of the directories (failed volume)
> data_fail = new File(dataDir, "data3");
> failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
> cluster.getNamesystem().getBlockPoolId());
> if (failedDir.exists() &&
> //!FileUtil.fullyDelete(failedDir)
> !deteteBlocks(failedDir)
> ) {
>   throw new IOException("Could not delete hdfs directory '" + failedDir + 
> "'");
> }
> data_fail.setReadOnly();
> failedDir.setReadOnly();
> {code}
> However, there are two bugs here, which make the blocks not deleted.
> #- The {{failedDir}} directory for finalized blocks is not calculated 
> correctly. It should use {{data_fail}} instead of {{dataDir}} as the base 
> directory.
> #- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that 
> there is no subdirectories in the data dir. This assumption was also noted in 
> the comments.
> {quote}
> // we use only small number of blocks to avoid creating subdirs in the 
> data dir..
> {quote}
> This is not true. On my local cluster and MiniDFSCluster, there will be 
> subdir0/subdir0/ two level directories regardless of the number of blocks.
> # Meanwhile, to fail a volume, it also needs to trigger the DataNode removing 
> the volume and send block report to NN. This is basically in the 
> {{triggerFailure()}} method.
> {code}
>   private void triggerFailure(String path, long size) throws IOException {
> NamenodeProtocols nn = cluster.getNameNodeRpc();
> List locatedBlocks =
>   nn.getBlockLocations(path, 0, size).getLocatedBlocks();
> 
> for (LocatedBlock lb : locatedBlocks) {
>   DatanodeInfo dinfo = lb.getLocations()[1];
>   ExtendedBlock b = lb.getBlock();
>   try {
> accessBlock(dinfo, lb);
>   } catch (IOException e) {
> System.out.println("Failure triggered, on block: " + b.getBlockId() + 
>  
> "; corresponding volume should be removed by now");
> break;
>   }
> }
>   }
> {code}
> Accessing those blocks will not trigger failures if the directory is 
> read-only (while the block files are all there). I ran the tests multiple 
> times without triggering this failure. We have to write new block files to 
> the data directories, or we should have deleted the blocks correctly. I think 
> we need to add some assertion code after triggering the volume failure. The 
> assertions should check the datanode volume failure summary explicitly to 
> make sure a volume failure is triggered (and noticed).
> # To make sure the NameNode be aware of the volume failure, the code 
> explictily send block reports to NN.
> {code:title=TestDataNodeVolumeFailure#testVolumeFailure()}
> cluster.getNameNodeRpc().blockReport(dnR, bpid, reports,
> new BlockReportContext(1, 0, System.nanoTime(), 0, false));
> {code}
> Generating block report code is complex, which is actually the internal logic 
> of {{BPServiceActor}}. We may have to update this code it changes. In fact, 
> the volume failure is now sent by DataNode via heartbeats. We should trigger 
> a heartbeat request here; and make sure the NameNode handles the heartbeat 
> before we verify the block states.
> # When verifying via {{verify()}}, it counts the real block files and assert 
> that real block files plus underreplicated blocks should cover all blocks. 
> Before counting underreplicated blocks, it triggered the {{BlockManager}} to 
> compute the datanode work:
> {code}
> // 

[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)

2016-10-31 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623509#comment-15623509
 ] 

Jitendra Nath Pandey commented on HDFS-11030:
-

+1. Thanks for the patch [~liuml07]

> TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
> -
>
> Key: HDFS-11030
> URL: https://issues.apache.org/jira/browse/HDFS-11030
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, test
>Affects Versions: 2.7.0
>Reporter: Mingliang Liu
>Assignee: Mingliang Liu
> Attachments: HDFS-11030-branch-2.000.patch, HDFS-11030.000.patch
>
>
> TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the 
> blocks and files are replicated correctly.
> # To fail a volume, it deletes all the blocks and sets the data dir read only.
> {code:title=testVolumeFailure() snippet}
> // fail the volume
> // delete/make non-writable one of the directories (failed volume)
> data_fail = new File(dataDir, "data3");
> failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
> cluster.getNamesystem().getBlockPoolId());
> if (failedDir.exists() &&
> //!FileUtil.fullyDelete(failedDir)
> !deteteBlocks(failedDir)
> ) {
>   throw new IOException("Could not delete hdfs directory '" + failedDir + 
> "'");
> }
> data_fail.setReadOnly();
> failedDir.setReadOnly();
> {code}
> However, there are two bugs here, which make the blocks not deleted.
> #- The {{failedDir}} directory for finalized blocks is not calculated 
> correctly. It should use {{data_fail}} instead of {{dataDir}} as the base 
> directory.
> #- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that 
> there is no subdirectories in the data dir. This assumption was also noted in 
> the comments.
> {quote}
> // we use only small number of blocks to avoid creating subdirs in the 
> data dir..
> {quote}
> This is not true. On my local cluster and MiniDFSCluster, there will be 
> subdir0/subdir0/ two level directories regardless of the number of blocks.
> # Meanwhile, to fail a volume, it also needs to trigger the DataNode removing 
> the volume and send block report to NN. This is basically in the 
> {{triggerFailure()}} method.
> {code}
>   private void triggerFailure(String path, long size) throws IOException {
> NamenodeProtocols nn = cluster.getNameNodeRpc();
> List locatedBlocks =
>   nn.getBlockLocations(path, 0, size).getLocatedBlocks();
> 
> for (LocatedBlock lb : locatedBlocks) {
>   DatanodeInfo dinfo = lb.getLocations()[1];
>   ExtendedBlock b = lb.getBlock();
>   try {
> accessBlock(dinfo, lb);
>   } catch (IOException e) {
> System.out.println("Failure triggered, on block: " + b.getBlockId() + 
>  
> "; corresponding volume should be removed by now");
> break;
>   }
> }
>   }
> {code}
> Accessing those blocks will not trigger failures if the directory is 
> read-only (while the block files are all there). I ran the tests multiple 
> times without triggering this failure. We have to write new block files to 
> the data directories, or we should have deleted the blocks correctly. I think 
> we need to add some assertion code after triggering the volume failure. The 
> assertions should check the datanode volume failure summary explicitly to 
> make sure a volume failure is triggered (and noticed).
> # To make sure the NameNode be aware of the volume failure, the code 
> explictily send block reports to NN.
> {code:title=TestDataNodeVolumeFailure#testVolumeFailure()}
> cluster.getNameNodeRpc().blockReport(dnR, bpid, reports,
> new BlockReportContext(1, 0, System.nanoTime(), 0, false));
> {code}
> Generating block report code is complex, which is actually the internal logic 
> of {{BPServiceActor}}. We may have to update this code it changes. In fact, 
> the volume failure is now sent by DataNode via heartbeats. We should trigger 
> a heartbeat request here; and make sure the NameNode handles the heartbeat 
> before we verify the block states.
> # When verifying via {{verify()}}, it counts the real block files and assert 
> that real block files plus underreplicated blocks should cover all blocks. 
> Before counting underreplicated blocks, it triggered the {{BlockManager}} to 
> compute the datanode work:
> {code}
> // force update of all the metric counts by calling computeDatanodeWork
> BlockManagerTestUtil.getComputedDatanodeWork(fsn.getBlockManager());
> {code}
> However, counting physical block files and underreplicated blocks are not 
> atomic. The NameNode will inform of the DataNode the computed work at next 
> heartbeat. So I think this part of code may fail when some

[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)

2016-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593363#comment-15593363
 ] 

Hadoop QA commented on HDFS-11030:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
51s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
39s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} | {color:green} branch-2 passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} branch-2 passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
16s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} branch-2 passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
34s{color} | {color:green} branch-2 passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 
0 new + 50 unchanged - 7 fixed = 50 total (was 57) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 48m 
20s{color} | {color:green} hadoop-hdfs in the patch passed with JDK v1.7.0_111. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}134m 54s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_101 Failed junit tests | hadoop.hdfs.TestEncryptionZones |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:b59b8b7 |
| JIRA Issue | HDFS-11030 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12834547/HDFS-11030-branch-2.000.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 2ebe64861e73 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | branch-2 / 1f384

[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)

2016-10-18 Thread Mingliang Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587343#comment-15587343
 ] 

Mingliang Liu commented on HDFS-11030:
--

Sending block report code seems complex. This is the internal logic of 
{{BPServiceActor}} and we may have to update this code it changes. I think 
{{cluster.triggerBlockReport()}} is a good alternative.
{code:title=TestDataNodeVolumeFailure#testVolumeFailure()}
// make sure a block report is sent 
DataNode dn = cluster.getDataNodes().get(1); //corresponds to dir data3
String bpid = cluster.getNamesystem().getBlockPoolId();
DatanodeRegistration dnR = dn.getDNRegistrationForBP(bpid);
Map perVolumeBlockLists =
dn.getFSDataset().getBlockReports(bpid);

// Send block report
StorageBlockReport[] reports =
new StorageBlockReport[perVolumeBlockLists.size()];

int reportIndex = 0;
for(Map.Entry kvPair : 
perVolumeBlockLists.entrySet()) {
DatanodeStorage dnStorage = kvPair.getKey();
BlockListAsLongs blockList = kvPair.getValue();
reports[reportIndex++] =
new StorageBlockReport(dnStorage, blockList);
}

cluster.getNameNodeRpc().blockReport(dnR, bpid, reports,
new BlockReportContext(1, 0, System.nanoTime(), 0, false));
{code}

> TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
> -
>
> Key: HDFS-11030
> URL: https://issues.apache.org/jira/browse/HDFS-11030
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, test
>Affects Versions: 2.7.0
>Reporter: Mingliang Liu
>Assignee: Mingliang Liu
>
> TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the 
> blocks and files are replicated correctly.
> To fail a volume, it deletes all the blocks and sets the data dir read only.
> {code:title=testVolumeFailure() snippet}
> // fail the volume
> // delete/make non-writable one of the directories (failed volume)
> data_fail = new File(dataDir, "data3");
> failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
> cluster.getNamesystem().getBlockPoolId());
> if (failedDir.exists() &&
> //!FileUtil.fullyDelete(failedDir)
> !deteteBlocks(failedDir)
> ) {
>   throw new IOException("Could not delete hdfs directory '" + failedDir + 
> "'");
> }
> data_fail.setReadOnly();
> failedDir.setReadOnly();
> {code}
> However, there are two bugs here, which make the blocks not deleted.
> # The {{failedDir}} directory for finalized blocks is not calculated 
> correctly. It should use {{data_fail}} instead of {{dataDir}} as the base 
> directory.
> # When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that 
> there is no subdirectories in the data dir. This assumption was also noted in 
> the comments.
> {quote}
> // we use only small number of blocks to avoid creating subdirs in the 
> data dir..
> {quote}
> This is not true. On my local cluster and MiniDFSCluster, there will be 
> subdir0/subdir0/ two level directories regardless of the number of blocks.
> Meanwhile, to fail a volume, it also needs to trigger the DataNode removing 
> the volume and send block report to NN. This is basically in the 
> {{triggerFailure()}} method.
> {code}
>   private void triggerFailure(String path, long size) throws IOException {
> NamenodeProtocols nn = cluster.getNameNodeRpc();
> List locatedBlocks =
>   nn.getBlockLocations(path, 0, size).getLocatedBlocks();
> 
> for (LocatedBlock lb : locatedBlocks) {
>   DatanodeInfo dinfo = lb.getLocations()[1];
>   ExtendedBlock b = lb.getBlock();
>   try {
> accessBlock(dinfo, lb);
>   } catch (IOException e) {
> System.out.println("Failure triggered, on block: " + b.getBlockId() + 
>  
> "; corresponding volume should be removed by now");
> break;
>   }
> }
>   }
> {code}
> Accessing those blocks will not trigger failures if the directory is 
> read-only (while the block files are all there). I ran the tests multiple 
> times without triggering this failure. We have to write new block files to 
> the data directories, or we should have deleted the blocks correctly.
> This unit test has been there for years and it seldom fails, just because 
> it's never triggered a real volume failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org