[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183487#comment-13183487 ] stack commented on HBASE-5163: -- The lock issue happens in a bunch of tests IIRC. Nice digging N. Can #2 fail ever? If so, I like #1. TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.) - Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing 12 we can start a new datanode that will reuse the available 2's directory. - sometimes: pipeline = 1 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183655#comment-13183655 ] nkeywal commented on HBASE-5163: I've got a fix using a variation of #2. Tested 100 times without any failure. The advantage of #1 for me is that it eliminates a quite tricky behavior, but the fix would then be outside HBase... TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.) - Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing 12 we can start a new datanode that will reuse the available 2's directory. - sometimes: pipeline = 1 3. In this case,when we try to launch the
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183682#comment-13183682 ] Zhihong Yu commented on HBASE-5163: --- +1 on patch. Looped over the test and didn't encounter test failure. TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.) - Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5163.patch The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing 12 we can start a new datanode that will reuse the available 2's directory. - sometimes: pipeline = 1 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183691#comment-13183691 ] Hadoop QA commented on HBASE-5163: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510114/5163.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/723//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/723//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/723//console This message is automatically generated. TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.) - Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5163.patch The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183908#comment-13183908 ] stack commented on HBASE-5163: -- This is kinda funny: {code} -Thread.sleep(1); + {code} given your comment above it +// This function is synchronous, when it returns the node is killed. Good stuff. (Woah, you get rid of two ten second pauses...) +1 on commit. TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.) - Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5163.patch The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing 12 we can start a new datanode that will reuse the available 2's
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182965#comment-13182965 ] Zhihong Yu commented on HBASE-5163: --- Nice analysis. I like #2 better. We should always make test scenario deterministic. TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.) - Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing 12 we can start a new datanode that will reuse the available 2's directory. - sometimes: pipeline = 1 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'. There are