[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
[ https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623795#comment-15623795 ] Hudson commented on HDFS-11030: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10738 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/10738/]) HDFS-11030. TestDataNodeVolumeFailure#testVolumeFailure is flaky (though (liuml07: rev 0c49f73a6c19ce0d0cd59cf8dfaa9a35f67f47ab) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java > TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing) > - > > Key: HDFS-11030 > URL: https://issues.apache.org/jira/browse/HDFS-11030 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, test >Affects Versions: 2.7.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0, 3.0.0-alpha2 > > Attachments: HDFS-11030-branch-2.000.patch, HDFS-11030.000.patch > > > TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the > blocks and files are replicated correctly. > # To fail a volume, it deletes all the blocks and sets the data dir read only. > {code:title=testVolumeFailure() snippet} > // fail the volume > // delete/make non-writable one of the directories (failed volume) > data_fail = new File(dataDir, "data3"); > failedDir = MiniDFSCluster.getFinalizedDir(dataDir, > cluster.getNamesystem().getBlockPoolId()); > if (failedDir.exists() && > //!FileUtil.fullyDelete(failedDir) > !deteteBlocks(failedDir) > ) { > throw new IOException("Could not delete hdfs directory '" + failedDir + > "'"); > } > data_fail.setReadOnly(); > failedDir.setReadOnly(); > {code} > However, there are two bugs here, which make the blocks not deleted. > #- The {{failedDir}} directory for finalized blocks is not calculated > correctly. It should use {{data_fail}} instead of {{dataDir}} as the base > directory. > #- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that > there is no subdirectories in the data dir. This assumption was also noted in > the comments. > {quote} > // we use only small number of blocks to avoid creating subdirs in the > data dir.. > {quote} > This is not true. On my local cluster and MiniDFSCluster, there will be > subdir0/subdir0/ two level directories regardless of the number of blocks. > # Meanwhile, to fail a volume, it also needs to trigger the DataNode removing > the volume and send block report to NN. This is basically in the > {{triggerFailure()}} method. > {code} > private void triggerFailure(String path, long size) throws IOException { > NamenodeProtocols nn = cluster.getNameNodeRpc(); > List locatedBlocks = > nn.getBlockLocations(path, 0, size).getLocatedBlocks(); > > for (LocatedBlock lb : locatedBlocks) { > DatanodeInfo dinfo = lb.getLocations()[1]; > ExtendedBlock b = lb.getBlock(); > try { > accessBlock(dinfo, lb); > } catch (IOException e) { > System.out.println("Failure triggered, on block: " + b.getBlockId() + > > "; corresponding volume should be removed by now"); > break; > } > } > } > {code} > Accessing those blocks will not trigger failures if the directory is > read-only (while the block files are all there). I ran the tests multiple > times without triggering this failure. We have to write new block files to > the data directories, or we should have deleted the blocks correctly. I think > we need to add some assertion code after triggering the volume failure. The > assertions should check the datanode volume failure summary explicitly to > make sure a volume failure is triggered (and noticed). > # To make sure the NameNode be aware of the volume failure, the code > explictily send block reports to NN. > {code:title=TestDataNodeVolumeFailure#testVolumeFailure()} > cluster.getNameNodeRpc().blockReport(dnR, bpid, reports, > new BlockReportContext(1, 0, System.nanoTime(), 0, false)); > {code} > Generating block report code is complex, which is actually the internal logic > of {{BPServiceActor}}. We may have to update this code it changes. In fact, > the volume failure is now sent by DataNode via heartbeats. We should trigger > a heartbeat request here; and make sure the NameNode handles the heartbeat > before we verify the block states. > # When verifying via {{verify()}}, it counts the real block files and assert > that real block files plus underreplicated blocks should cover all blocks. > Before counting underreplicated blocks, it triggered the {{BlockManager}} to > compute the datanode work: > {code} > //
[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
[ https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623509#comment-15623509 ] Jitendra Nath Pandey commented on HDFS-11030: - +1. Thanks for the patch [~liuml07] > TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing) > - > > Key: HDFS-11030 > URL: https://issues.apache.org/jira/browse/HDFS-11030 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, test >Affects Versions: 2.7.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-11030-branch-2.000.patch, HDFS-11030.000.patch > > > TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the > blocks and files are replicated correctly. > # To fail a volume, it deletes all the blocks and sets the data dir read only. > {code:title=testVolumeFailure() snippet} > // fail the volume > // delete/make non-writable one of the directories (failed volume) > data_fail = new File(dataDir, "data3"); > failedDir = MiniDFSCluster.getFinalizedDir(dataDir, > cluster.getNamesystem().getBlockPoolId()); > if (failedDir.exists() && > //!FileUtil.fullyDelete(failedDir) > !deteteBlocks(failedDir) > ) { > throw new IOException("Could not delete hdfs directory '" + failedDir + > "'"); > } > data_fail.setReadOnly(); > failedDir.setReadOnly(); > {code} > However, there are two bugs here, which make the blocks not deleted. > #- The {{failedDir}} directory for finalized blocks is not calculated > correctly. It should use {{data_fail}} instead of {{dataDir}} as the base > directory. > #- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that > there is no subdirectories in the data dir. This assumption was also noted in > the comments. > {quote} > // we use only small number of blocks to avoid creating subdirs in the > data dir.. > {quote} > This is not true. On my local cluster and MiniDFSCluster, there will be > subdir0/subdir0/ two level directories regardless of the number of blocks. > # Meanwhile, to fail a volume, it also needs to trigger the DataNode removing > the volume and send block report to NN. This is basically in the > {{triggerFailure()}} method. > {code} > private void triggerFailure(String path, long size) throws IOException { > NamenodeProtocols nn = cluster.getNameNodeRpc(); > List locatedBlocks = > nn.getBlockLocations(path, 0, size).getLocatedBlocks(); > > for (LocatedBlock lb : locatedBlocks) { > DatanodeInfo dinfo = lb.getLocations()[1]; > ExtendedBlock b = lb.getBlock(); > try { > accessBlock(dinfo, lb); > } catch (IOException e) { > System.out.println("Failure triggered, on block: " + b.getBlockId() + > > "; corresponding volume should be removed by now"); > break; > } > } > } > {code} > Accessing those blocks will not trigger failures if the directory is > read-only (while the block files are all there). I ran the tests multiple > times without triggering this failure. We have to write new block files to > the data directories, or we should have deleted the blocks correctly. I think > we need to add some assertion code after triggering the volume failure. The > assertions should check the datanode volume failure summary explicitly to > make sure a volume failure is triggered (and noticed). > # To make sure the NameNode be aware of the volume failure, the code > explictily send block reports to NN. > {code:title=TestDataNodeVolumeFailure#testVolumeFailure()} > cluster.getNameNodeRpc().blockReport(dnR, bpid, reports, > new BlockReportContext(1, 0, System.nanoTime(), 0, false)); > {code} > Generating block report code is complex, which is actually the internal logic > of {{BPServiceActor}}. We may have to update this code it changes. In fact, > the volume failure is now sent by DataNode via heartbeats. We should trigger > a heartbeat request here; and make sure the NameNode handles the heartbeat > before we verify the block states. > # When verifying via {{verify()}}, it counts the real block files and assert > that real block files plus underreplicated blocks should cover all blocks. > Before counting underreplicated blocks, it triggered the {{BlockManager}} to > compute the datanode work: > {code} > // force update of all the metric counts by calling computeDatanodeWork > BlockManagerTestUtil.getComputedDatanodeWork(fsn.getBlockManager()); > {code} > However, counting physical block files and underreplicated blocks are not > atomic. The NameNode will inform of the DataNode the computed work at next > heartbeat. So I think this part of code may fail when some
[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
[ https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593363#comment-15593363 ] Hadoop QA commented on HDFS-11030: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 51s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 39s{color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green} branch-2 passed with JDK v1.8.0_101 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green} branch-2 passed with JDK v1.7.0_111 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s{color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s{color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s{color} | {color:green} branch-2 passed with JDK v1.8.0_101 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s{color} | {color:green} branch-2 passed with JDK v1.7.0_111 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green} the patch passed with JDK v1.8.0_101 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed with JDK v1.7.0_111 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 50 unchanged - 7 fixed = 50 total (was 57) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} the patch passed with JDK v1.8.0_101 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s{color} | {color:green} the patch passed with JDK v1.7.0_111 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 48m 20s{color} | {color:green} hadoop-hdfs in the patch passed with JDK v1.7.0_111. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}134m 54s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_101 Failed junit tests | hadoop.hdfs.TestEncryptionZones | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:b59b8b7 | | JIRA Issue | HDFS-11030 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12834547/HDFS-11030-branch-2.000.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 2ebe64861e73 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | branch-2 / 1f384
[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
[ https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587343#comment-15587343 ] Mingliang Liu commented on HDFS-11030: -- Sending block report code seems complex. This is the internal logic of {{BPServiceActor}} and we may have to update this code it changes. I think {{cluster.triggerBlockReport()}} is a good alternative. {code:title=TestDataNodeVolumeFailure#testVolumeFailure()} // make sure a block report is sent DataNode dn = cluster.getDataNodes().get(1); //corresponds to dir data3 String bpid = cluster.getNamesystem().getBlockPoolId(); DatanodeRegistration dnR = dn.getDNRegistrationForBP(bpid); Map perVolumeBlockLists = dn.getFSDataset().getBlockReports(bpid); // Send block report StorageBlockReport[] reports = new StorageBlockReport[perVolumeBlockLists.size()]; int reportIndex = 0; for(Map.Entry kvPair : perVolumeBlockLists.entrySet()) { DatanodeStorage dnStorage = kvPair.getKey(); BlockListAsLongs blockList = kvPair.getValue(); reports[reportIndex++] = new StorageBlockReport(dnStorage, blockList); } cluster.getNameNodeRpc().blockReport(dnR, bpid, reports, new BlockReportContext(1, 0, System.nanoTime(), 0, false)); {code} > TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing) > - > > Key: HDFS-11030 > URL: https://issues.apache.org/jira/browse/HDFS-11030 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, test >Affects Versions: 2.7.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > > TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the > blocks and files are replicated correctly. > To fail a volume, it deletes all the blocks and sets the data dir read only. > {code:title=testVolumeFailure() snippet} > // fail the volume > // delete/make non-writable one of the directories (failed volume) > data_fail = new File(dataDir, "data3"); > failedDir = MiniDFSCluster.getFinalizedDir(dataDir, > cluster.getNamesystem().getBlockPoolId()); > if (failedDir.exists() && > //!FileUtil.fullyDelete(failedDir) > !deteteBlocks(failedDir) > ) { > throw new IOException("Could not delete hdfs directory '" + failedDir + > "'"); > } > data_fail.setReadOnly(); > failedDir.setReadOnly(); > {code} > However, there are two bugs here, which make the blocks not deleted. > # The {{failedDir}} directory for finalized blocks is not calculated > correctly. It should use {{data_fail}} instead of {{dataDir}} as the base > directory. > # When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that > there is no subdirectories in the data dir. This assumption was also noted in > the comments. > {quote} > // we use only small number of blocks to avoid creating subdirs in the > data dir.. > {quote} > This is not true. On my local cluster and MiniDFSCluster, there will be > subdir0/subdir0/ two level directories regardless of the number of blocks. > Meanwhile, to fail a volume, it also needs to trigger the DataNode removing > the volume and send block report to NN. This is basically in the > {{triggerFailure()}} method. > {code} > private void triggerFailure(String path, long size) throws IOException { > NamenodeProtocols nn = cluster.getNameNodeRpc(); > List locatedBlocks = > nn.getBlockLocations(path, 0, size).getLocatedBlocks(); > > for (LocatedBlock lb : locatedBlocks) { > DatanodeInfo dinfo = lb.getLocations()[1]; > ExtendedBlock b = lb.getBlock(); > try { > accessBlock(dinfo, lb); > } catch (IOException e) { > System.out.println("Failure triggered, on block: " + b.getBlockId() + > > "; corresponding volume should be removed by now"); > break; > } > } > } > {code} > Accessing those blocks will not trigger failures if the directory is > read-only (while the block files are all there). I ran the tests multiple > times without triggering this failure. We have to write new block files to > the data directories, or we should have deleted the blocks correctly. > This unit test has been there for years and it seldom fails, just because > it's never triggered a real volume failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org