[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621295#comment-14621295
 ] 

Hudson commented on HBASE-13832:


SUCCESS: Integrated in HBase-1.2-IT #46 (See 
[https://builds.apache.org/job/HBase-1.2-IT/46/])
HBASE-13832 Procedure v2: try to roll the WAL master on sync failure before 
aborting (matteo.bertozzi: rev 246631d60e7b3b5ece039e67e8d3e9bb66d5f9a5)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestWALProcedureStoreOnHDFS.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStoreBase.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterFailoverWithProcedures.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStore.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/store/wal/TestWALProcedureStore.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureEnv.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/ProcedureTestingUtility.java


 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HBASE-13832-v6.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621294#comment-14621294
 ] 

Hudson commented on HBASE-13832:


FAILURE: Integrated in HBase-1.2 #60 (See 
[https://builds.apache.org/job/HBase-1.2/60/])
HBASE-13832 Procedure v2: try to roll the WAL master on sync failure before 
aborting (matteo.bertozzi: rev 246631d60e7b3b5ece039e67e8d3e9bb66d5f9a5)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureEnv.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/store/wal/TestWALProcedureStore.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStore.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterFailoverWithProcedures.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/ProcedureTestingUtility.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStoreBase.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestWALProcedureStoreOnHDFS.java


 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HBASE-13832-v6.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621198#comment-14621198
 ] 

Hudson commented on HBASE-13832:


SUCCESS: Integrated in HBase-1.3 #46 (See 
[https://builds.apache.org/job/HBase-1.3/46/])
HBASE-13832 Procedure v2: try to roll the WAL master on sync failure before 
aborting (matteo.bertozzi: rev 22f0f3026ffaaa2f00b64616876ae817abc422d7)
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStoreBase.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/ProcedureTestingUtility.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestWALProcedureStoreOnHDFS.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterFailoverWithProcedures.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureEnv.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/store/wal/TestWALProcedureStore.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStore.java


 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HBASE-13832-v6.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621192#comment-14621192
 ] 

Hudson commented on HBASE-13832:


SUCCESS: Integrated in HBase-1.1 #580 (See 
[https://builds.apache.org/job/HBase-1.1/580/])
HBASE-13832 Procedure v2: try to roll the WAL master on sync failure before 
aborting (matteo.bertozzi: rev ac95c1a0b2dfea38ee414fa72c8f5133b36eeb22)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureEnv.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/ProcedureTestingUtility.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterFailoverWithProcedures.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStore.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestWALProcedureStoreOnHDFS.java


 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HBASE-13832-v6.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621311#comment-14621311
 ] 

Hudson commented on HBASE-13832:


FAILURE: Integrated in HBase-1.3-IT #32 (See 
[https://builds.apache.org/job/HBase-1.3-IT/32/])
HBASE-13832 Procedure v2: try to roll the WAL master on sync failure before 
aborting (matteo.bertozzi: rev 22f0f3026ffaaa2f00b64616876ae817abc422d7)
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/ProcedureTestingUtility.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestWALProcedureStoreOnHDFS.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/store/wal/TestWALProcedureStore.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStoreBase.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterFailoverWithProcedures.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStore.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureEnv.java


 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HBASE-13832-v6.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621312#comment-14621312
 ] 

Hudson commented on HBASE-13832:


SUCCESS: Integrated in HBase-TRUNK #6641 (See 
[https://builds.apache.org/job/HBase-TRUNK/6641/])
HBASE-13832 Procedure v2: try to roll the WAL master on sync failure before 
aborting (matteo.bertozzi: rev ae1f485ee8cb76965882b7a9c1c1647ec2912d08)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterFailoverWithProcedures.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/store/wal/TestWALProcedureStore.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureEnv.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestWALProcedureStoreOnHDFS.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStoreBase.java
* 
hbase-procedure/src/test/java/org/apache/hadoop/hbase/procedure2/ProcedureTestingUtility.java
* 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/ProcedureStore.java


 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HBASE-13832-v6.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619752#comment-14619752
 ] 

Hadoop QA commented on HBASE-13832:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12744362/HBASE-13832-v6.patch
  against master branch at commit f5ad736282c8c9c27b14131919d60b72834ec9e4.
  ATTACHMENT ID: 12744362

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 12 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 

 {color:red}-1 core zombie tests{color}.  There are 1 zombie test(s):   
at 
org.apache.qpid.server.queue.ProducerFlowControlTest.testCapacityExceededCausesBlock(ProducerFlowControlTest.java:123)
at 
org.apache.qpid.test.utils.QpidBrokerTestCase.runBare(QpidBrokerTestCase.java:323)
at org.apache.qpid.test.utils.QpidTestCase.run(QpidTestCase.java:155)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14709//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14709//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14709//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14709//console

This message is automatically generated.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HBASE-13832-v6.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This 

[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-08 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619723#comment-14619723
 ] 

Enis Soztutar commented on HBASE-13832:
---

+1 for v6 patch. 

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HBASE-13832-v6.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615561#comment-14615561
 ] 

Hadoop QA commented on HBASE-13832:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12743753/HBASE-13832-v5.patch
  against master branch at commit 608c3aa15c34b9014f99e857b374645db58cbbe3.
  ATTACHMENT ID: 12743753

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 12 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS

 {color:red}-1 core zombie tests{color}.  There are 5 zombie test(s):   
at 
org.apache.hadoop.hbase.wal.TestWALSplit.testThreadingSlowWriterSmallBuffer(TestWALSplit.java:902)
at 
org.apache.hadoop.hbase.wal.TestWALSplit.testSplitDeletedRegion(TestWALSplit.java:722)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14679//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14679//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14679//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14679//console

This message is automatically generated.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HDFSPipeline.java, hbase-13832-test-hang.patch, hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-06 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615871#comment-14615871
 ] 

Enis Soztutar commented on HBASE-13832:
---

bq. org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS
This is the new test from the patch. Seems related. 
bq. before with the while (isRunning()) we were spinning after the signal, to 
make clear that there were no other run of the syncLoop(). in this case we may 
do another round of the loop and execute stuff which in theory is not what you 
expect after sending the abort signal.
I guess we can rethrow the exception after here: 
{code}
+} catch (Throwable t) {
+  syncException.compareAndSet(null, t);
{code}

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, 
 HDFSPipeline.java, hbase-13832-test-hang.patch, hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613622#comment-14613622
 ] 

Hadoop QA commented on HBASE-13832:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12743415/HBASE-13832-v4.patch
  against master branch at commit e640f1e76af8f32015f475629610da127897f01e.
  ATTACHMENT ID: 12743415

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 1 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1899 checkstyle errors (more than the master's current 1898 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14661//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14661//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14661//artifact/patchprocess/checkstyle-aggregate.html

Javadoc warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14661//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14661//console

This message is automatically generated.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HDFSPipeline.java, 
 hbase-13832-test-hang.patch, hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611530#comment-14611530
 ] 

Hadoop QA commented on HBASE-13832:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12743215/hbase-13832-v3.patch
  against master branch at commit f0e29c49a1f5f3773ba03b822805d863c149b443.
  ATTACHMENT ID: 12743215

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1898 checkstyle errors (more than the master's current 1897 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:red}-1 site{color}.  The patch appears to cause mvn post-site goal 
to fail.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.util.TestProcessBasedCluster
  org.apache.hadoop.hbase.mapreduce.TestImportExport
  org.apache.hadoop.hbase.TestRegionRebalancing
  
org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS

 {color:red}-1 core zombie tests{color}.  There are 5 zombie test(s):   
at 
org.apache.hadoop.hbase.mapreduce.TestTableSnapshotInputFormat.testInitTableSnapshotMapperJobConfig(TestTableSnapshotInputFormat.java:146)
at 
org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScanBase.testScan(TestTableInputFormatScanBase.java:244)
at 
org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan2.testScanOBBToQPP(TestTableInputFormatScan2.java:57)
at 
org.apache.hadoop.hbase.mapreduce.TestMultithreadedTableMapper.testMultithreadedTableMapper(TestMultithreadedTableMapper.java:133)
at 
org.apache.hadoop.hbase.mapreduce.TestLoadIncrementalHFiles.testRegionCrossingRowColBloom(TestLoadIncrementalHFiles.java:142)
at 
org.apache.hadoop.hbase.mapreduce.TestImportExport.testExportScannerBatching(TestImportExport.java:271)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14648//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14648//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14648//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14648//console

This message is automatically generated.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 

[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-02 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612488#comment-14612488
 ] 

Enis Soztutar commented on HBASE-13832:
---

bq. to get the same behavior you need to force running to false when you set 
syncException. so you prevent other procedure to be added.
Not sure whether we gain by ensuring that running is set to false before the 
next execution for syncLoop. Wal store will abort when the master calls abort. 
Before this happens, concurrent calls to {{pushData()}} will still get the 
exception because the exception from sync is not cleared at all. So the 
semantics is that if {{snyc()}} + wal roll fails, we effectively start 
rejecting all requests for {{pushData()}}, which is kind of similar to making 
sure to check isRunning(). 

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-02 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612713#comment-14612713
 ] 

Enis Soztutar commented on HBASE-13832:
---

bq. we know that we must die there. why not exit from that loop?
We can call {{stop(true)}} directly from Sync loop it is fine. It was not there 
in your original patch, that is why I did not change it. 
bq. with the actual implementation of abort we know that running will be false 
after a sendAbortProcessSignal() but that may not be the case in the future
The store can cause an abort to the whole procedure executor or the master 
itself. Right now, it does this through the ProcedureStoreListener calls. I'm 
fine with sending an {{Abortable}} directly to the store. These parts are 
mainly coming from the intiail proc v2 patch. 

Does the test rely on 1s / 2s timing? It may end up being flaky in slow jenkins 
hosts. Other than that +1 for the v4 patch. If you want to do the abort 
changes, we can do it here or a follow up. 

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HDFSPipeline.java, 
 hbase-13832-test-hang.patch, hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-02 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612722#comment-14612722
 ] 

Matteo Bertozzi commented on HBASE-13832:
-

calling directly stop() was not what I was proposing. what I was saying was 
just exiting from the syncLoop(). before with the while (isRunning()) we were 
spinning after the signal, to make clear that there were no other run of the 
syncLoop(). in this case we may do another round of the loop and execute stuff 
which in theory is not what you expect after sending the abort signal.

the test does not rely on the 1s/2s timing, it passes even without. but I was 
trying to make the problem more clear  to someone looking the code.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HBASE-13832-v4.patch, HDFSPipeline.java, 
 hbase-13832-test-hang.patch, hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-07-01 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611452#comment-14611452
 ] 

Matteo Bertozzi commented on HBASE-13832:
-

rethrowing the exception is ok to me, but that is not equivalent to the while 
(isRunning())

in one of your comment above you said: I think there is no guarantee that when 
master.abort() returns, the WALProcedureStore may still be not stopped.
which is true in a generic way, even though currently we know that at least 
running will be set to false immediately.
what the while (isRunning()) is doing, was just spinning until running was set 
in case the abort will not be sync as it is now. 

to get the same behavior you need to force running to false when you set 
syncException. so you prevent other procedure to be added. 

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
 hbase-13832-v3.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609413#comment-14609413
 ] 

Hadoop QA commented on HBASE-13832:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12742965/HBASE-13832-v2.patch
  against master branch at commit 42d5ef017d3d629e6ca9ee93e15ac4f0f9e00ce1.
  ATTACHMENT ID: 12742965

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1898 checkstyle errors (more than the master's current 1897 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14631//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14631//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14631//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14631//console

This message is automatically generated.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HBASE-13832-v2.patch, HDFSPipeline.java, hbase-13832-test-hang.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606642#comment-14606642
 ] 

Hadoop QA commented on HBASE-13832:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12742624/hbase-13832-test-hang.patch
  against master branch at commit 4f06279caaefa7dac795fb88d6711cc234a01600.
  ATTACHMENT ID: 12742624

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14609//console

This message is automatically generated.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HDFSPipeline.java, hbase-13832-test-hang.patch


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-26 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603860#comment-14603860
 ] 

Enis Soztutar commented on HBASE-13832:
---

Looks good overall. 

Why RTE, rather than rethrow? 
{code}
+  throw new RuntimeException(e);
{code}

I think we may have an issue about {{syncSlots()}} never throwing an exception. 
If a procedure executor thread calls {{pushData()}} which internally will wait 
on {{syncCond}}. In case of an exception in syncer thread, {{syncSlots}} will 
start the master abort process, but returning from master abort does not 
guarantee that the proc executor or rest of proc store is stopped. Since 
{{syncSlots}} does not rethrow the exception,  {{syncLoop}} will continue on 
and call  {{syncCond.signalAll(); }} which will cause the {{pushData}} to 
return and think that the proc state is persisted, while it is not. 

This new code seems to be a fix for it: 
{code}
+  if (!isRunning()) {
+throw new RuntimeException(sync aborted);
+  }
{code}
but I think there is no guarantee that when {{master.abort()}} returns, the 
WALProcedureStore may still be not stopped which means that there is a time 
window that the procedure can execute assuming persisted state. I maybe missing 
something in the above analysis though. 

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-26 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603897#comment-14603897
 ] 

Matteo Bertozzi commented on HBASE-13832:
-

{quote}Why RTE, rather than rethrow?{quote}
to avoid unecessary headache to the caller. the store will succeed or the 
master will abort. you don't have to write extra code to handle exceptions, 
because the master is going down anyway. 

{quote}but I think there is no guarantee that when master.abort() returns, the 
WALProcedureStore may still be not stopped{quote}
we spin until we have running == false
{code}
sendAbortProcessSignal();
while (isRunning()) Thread.yield();
{code}

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-26 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603914#comment-14603914
 ] 

Enis Soztutar commented on HBASE-13832:
---

bq. we spin until we have running == false
Ok that would work. This is still a very indirect way of re-throwing the 
exception, and seems fragile. Can't we simply get an atomic ref to the 
exception in sync thread and rethrow it to the pushDate() caller, kind of like 
simulating SyncFuture ?  

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-25 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601868#comment-14601868
 ] 

Matteo Bertozzi commented on HBASE-13832:
-

even with the patch the master will not start until the 3rd data node is back.
in theory you should ping pong between the backup masters until something a DN 
is available.
what the patch does is just retry for some time hoping that a 3rd data node is 
online, before giving up.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602383#comment-14602383
 ] 

Hadoop QA commented on HBASE-13832:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12742004/HBASE-13832-v1.patch
  against master branch at commit 2ed058554c0b6d6da0388497562e254107f13d67.
  ATTACHMENT ID: 12742004

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1902 checkstyle errors (more than the master's current 1901 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14582//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14582//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14582//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14582//console

This message is automatically generated.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
 HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-25 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601855#comment-14601855
 ] 

Nick Dimiduk commented on HBASE-13832:
--

Failure of master to start is a problem. Bumping priority and setting some 
fix-version targets.

[~jinghe] are you able to reproduce? Can you take the attached patch for a spin?

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602097#comment-14602097
 ] 

Hadoop QA commented on HBASE-13832:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12739966/HBASE-13832-v0.patch
  against master branch at commit e6ed79219966ce0dac3bc748261fce9478aa7550.
  ATTACHMENT ID: 12739966

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1908 checkstyle errors (more than the master's current 1907 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.util.TestHBaseFsck

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14577//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14577//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14577//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14577//console

This message is automatically generated.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
Priority: Critical
 Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13832-v0.patch, HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-23 Thread Jerry He (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598604#comment-14598604
 ] 

Jerry He commented on HBASE-13832:
--

I have recently seen similar failures on our HBase 1.1.0 clusters:
{noformat}
2015-06-22 18:20:20,240 INFO  
[B.defaultRpcServer.handler=36,queue=0,port=6] master.HMaster: 
Client=bigsql/null create 'bigsql.smoke_1820163205', {TABLE_ATTRIBUTES = 
{METADATA = {'hbase.columns.mapping' = 
'[{key:{c:[{n:c1,t:int}]}},{cmap:[{c:[{n:c2,t:int},{n:c3,t:int}],f:cf1,q:cq1}]},{def:{enc:{encName:binary},sep:\x5Cu}},{sqlc:c1,c2,c3}]'}},
 {NAME = 'cf1', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'ROW', 
REPLICATION_SCOPE = '0', VERSIONS = '1', COMPRESSION = 'NONE', MIN_VERSIONS 
= '0', TTL = 'FOREVER', KEEP_DELETED_CELLS = 'FALSE', BLOCKSIZE = '65536', 
IN_MEMORY = 'false', BLOCKCACHE = 'true'}
2015-06-22 18:20:20,341 ERROR [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: sync slot failed, abort.
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[9.30.255.244:50010,DS-5fd53475-f1d1-4141-b732-4df6998996ca,DISK],
 
DatanodeInfoWithStorage[9.30.255.245:50010,DS-9b9ed517-99e8-4d92-8d85-42650b6e97db,DISK]],
 
original=[DatanodeInfoWithStorage[9.30.255.244:50010,DS-5fd53475-f1d1-4141-b732-4df6998996ca,DISK],
 
DatanodeInfoWithStorage[9.30.255.245:50010,DS-9b9ed517-99e8-4d92-8d85-42650b6e97db,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:918)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:984)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1131)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2015-06-22 18:20:20,341 FATAL [WALProcedureStoreSyncThread] master.HMaster: 
Master server abort: loaded coprocessors are: []
2015-06-22 18:20:20,341 INFO  [WALProcedureStoreSyncThread] 
regionserver.HRegionServer: STOPPED: The Procedure Store lost the lease
2015-06-22 18:20:20,343 INFO  [master/bdls142.svl.ibm.com/9.30.255.242:6] 
regionserver.HRegionServer: Stopping infoServer
2015-06-22 18:20:20,350 INFO  [master/bdls142.svl.ibm.com/9.30.255.242:6] 
mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60010
2015-06-22 18:20:20,351 INFO  [master/bdls142.svl.ibm.com/9.30.255.242:6] 
procedure2.ProcedureExecutor: Stopping the procedure executor
2015-06-22 18:20:20,352 INFO  [master/bdls142.svl.ibm.com/9.30.255.242:6] 
wal.WALProcedureStore: Stopping the WAL Procedure Store
2015-06-22 18:20:20,352 WARN  [master/bdls142.svl.ibm.com/9.30.255.242:6] 
wal.WALProcedureStore: Unable to write the trailer: Failed to replace a bad 
datanode on the existing pipeline due to no more good datanodes being available 
to try. (Nodes: 
current=[DatanodeInfoWithStorage[9.30.255.244:50010,DS-5fd53475-f1d1-4141-b732-4df6998996ca,DISK],
 
DatanodeInfoWithStorage[9.30.255.245:50010,DS-9b9ed517-99e8-4d92-8d85-42650b6e97db,DISK]],
 
original=[DatanodeInfoWithStorage[9.30.255.244:50010,DS-5fd53475-f1d1-4141-b732-4df6998996ca,DISK],
 
DatanodeInfoWithStorage[9.30.255.245:50010,DS-9b9ed517-99e8-4d92-8d85-42650b6e97db,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
2015-06-22 18:20:20,352 ERROR [master/bdls142.svl.ibm.com/9.30.255.242:6] 
wal.WALProcedureStore: Unable to close the stream
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[9.30.255.244:50010,DS-5fd53475-f1d1-4141-b732-4df6998996ca,DISK],
 
DatanodeInfoWithStorage[9.30.255.245:50010,DS-9b9ed517-99e8-4d92-8d85-42650b6e97db,DISK]],
 
original=[DatanodeInfoWithStorage[9.30.255.244:50010,DS-5fd53475-f1d1-4141-b732-4df6998996ca,DISK],
 
DatanodeInfoWithStorage[9.30.255.245:50010,DS-9b9ed517-99e8-4d92-8d85-42650b6e97db,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:918)
at 

[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-23 Thread Jerry He (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598608#comment-14598608
 ] 

Jerry He commented on HBASE-13832:
--

But in most cases, we have 3 data nodes.

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
 Attachments: HBASE-13832-v0.patch, HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-22 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596245#comment-14596245
 ] 

Ted Yu commented on HBASE-13832:


bq. hbase.procedure.store.wal.max.retry.before.roll;
'retry' - 'retries'
{code}
611   } catch (IOException e) {
612 LOG.warn(Unable to roll the log, e);
{code}
Please log i, the roll retry count.
{code}
147   public static class TestProcedure extends ProcedureVoid {
{code}
This class can be private, right ?

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
 Attachments: HBASE-13832-v0.patch, HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-19 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594160#comment-14594160
 ] 

Enis Soztutar commented on HBASE-13832:
---

Thanks Matteo, I'll take a look. 

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Matteo Bertozzi
 Attachments: HBASE-13832-v0.patch, HDFSPipeline.java


 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-03 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571697#comment-14571697
 ] 

Stephen Yuan Jiang commented on HBASE-13832:


[~mbertozzi] [~enis] FYI

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Reporter: Stephen Yuan Jiang
Assignee: Stephen Yuan Jiang

 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

2015-06-03 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571737#comment-14571737
 ] 

Enis Soztutar commented on HBASE-13832:
---

I think we should copy the same semantics for the FSHlog sync / log roll 
behavior. What we have in FSHlog / LogRoller is this: 
 - Log syncer catches IOException, and logs it, and requests log roll. 
 - Log roller tries to roll the log, and if it gets an IOException in file 
close, or generic IOException while rolling, it aborts the RS. 

The reason to have the same semantics is that we do not want to cause the 
master to abort prematurely in case of a recoverable IOException like the one 
in the jira title. If the RS can ride over generic IOExceptions, the master 
should do the same.  

 Procedure V2: master fail to start due to WALProcedureStore sync failures 
 when HDFS data nodes count is low
 ---

 Key: HBASE-13832
 URL: https://issues.apache.org/jira/browse/HBASE-13832
 Project: HBase
  Issue Type: Sub-task
  Components: master, proc-v2
Affects Versions: 2.0.0, 1.1.0, 1.2.0
Reporter: Stephen Yuan Jiang
Assignee: Stephen Yuan Jiang

 when the data node  3, we got failure in WALProcedureStore#syncLoop() during 
 master start.  The failure prevents master to get started.  
 {noformat}
 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
 wal.WALProcedureStore: Sync slot failed, abort.
 java.io.IOException: Failed to replace a bad datanode on the existing 
 pipeline due to no more good datanodes being available to try. (Nodes: 
 current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  
 DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
  
 original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK],
  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-
 490ece56c772,DISK]]). The current failed datanode replacement policy is 
 DEFAULT, and a client may configure this via 
 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
 configuration.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
 {noformat}
 One proposal is to implement some similar logic as FSHLog: if IOException is 
 thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
 abort, we could try to roll the log and see whether this resolve the issue; 
 if the new log cannot be created or more exception from rolling the log, we 
 then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)