[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183487#comment-13183487
 ] 

stack commented on HBASE-5163:
--

The lock issue happens in a bunch of tests IIRC.  Nice digging N.  Can #2 fail 
ever?  If so, I like #1.

 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
 hadoop QA on trunk (The directory is already locked.)
 -

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing 12 we can start a 
 new datanode that will reuse the available 2's directory.
  - sometimes: pipeline = 1  3. In this case,when we try to launch the new 
 datanode, it fails because it wants to use the same directory as the still 
 alive '2'.
 

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)

2012-01-10 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183655#comment-13183655
 ] 

nkeywal commented on HBASE-5163:


I've got a fix using a variation of #2. Tested 100 times without any failure.
The advantage of #1 for me is that it eliminates a quite tricky behavior, but 
the fix would then be outside HBase...


 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
 hadoop QA on trunk (The directory is already locked.)
 -

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing 12 we can start a 
 new datanode that will reuse the available 2's directory.
  - sometimes: pipeline = 1  3. In this case,when we try to launch the 

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183682#comment-13183682
 ] 

Zhihong Yu commented on HBASE-5163:
---

+1 on patch.
Looped over the test and didn't encounter test failure.

 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
 hadoop QA on trunk (The directory is already locked.)
 -

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5163.patch


 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing 12 we can start a 
 new datanode that will reuse the available 2's directory.
  - sometimes: pipeline = 1  3. In this case,when we try to launch the new 
 datanode, it fails because it wants to use the same directory as the still 
 

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183691#comment-13183691
 ] 

Hadoop QA commented on HBASE-5163:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510114/5163.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -147 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 78 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.replication.TestReplication
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/723//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/723//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/723//console

This message is automatically generated.

 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
 hadoop QA on trunk (The directory is already locked.)
 -

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5163.patch


 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last 

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183908#comment-13183908
 ] 

stack commented on HBASE-5163:
--

This is kinda funny:

{code}
-Thread.sleep(1);
+
{code}

given your comment above it +// This function is synchronous, when it 
returns the node is killed.

Good stuff.

(Woah, you get rid of two ten second pauses...)

+1 on commit.

 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
 hadoop QA on trunk (The directory is already locked.)
 -

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5163.patch


 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing 12 we can start a 
 new datanode that will reuse the available 2's 

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)

2012-01-09 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182965#comment-13182965
 ] 

Zhihong Yu commented on HBASE-5163:
---

Nice analysis.
I like #2 better. We should always make test scenario deterministic.

 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
 hadoop QA on trunk (The directory is already locked.)
 -

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing 12 we can start a 
 new datanode that will reuse the available 2's directory.
  - sometimes: pipeline = 1  3. In this case,when we try to launch the new 
 datanode, it fails because it wants to use the same directory as the still 
 alive '2'.
 There are