Hello, This is originally an HBase issue, but please allow me to forward it to the hadoop mailing list since the error is in the checksum of a file in HDFS...
We are facing a serious issue with our production system, we are on a Windows Azure infrastructure and yesterday, unexpectedly, all of our VMs restarted... this has caused our HBase cluster (1 Master + 2 Region Servers) to crash... Now whenever I want to start our HBase cluster, it goes down. I checked the HBase logs and this is what I found on the master logs: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2016-07-28 09:33:10,052 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fdatanode-2%2C16020%2C1466263181091-splitting%2Fdatanode-2%252C16020%252C1466263181 091.default.1469654596681 2016-07-28 09:33:10,052 WARN [MASTER_SERVER_OPERATIONS-NameNode:16000-3] master.SplitLogManager: error while splitting logs in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting] installed = 1 but only 0 done 2016-07-28 09:33:10,053 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-3] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN java.io.IOException: failed log splitting for datanode-2,16020,1466263181091, will retry at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting] Task = installed = 1 done = 0 error = 1 at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213) ... 4 more 2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3] master.HMaster: Master server abort: loaded coprocessors are: [] 2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3] master.HMaster: Caught throwable while processing event M_SERVER_SHUTDOWN java.io.IOException: failed log splitting for datanode-2,16020,1466263181091, will retry at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting] Task = installed = 1 done = 0 error = 1 at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213) ... 4 more 2016-07-28 09:33:10,055 INFO [MASTER_SERVER_OPERATIONS-NameNode:16000-3] regionserver.HRegionServer: STOPPED: Caught throwable while processing event M_SERVER_SHUTDOWN 2016-07-28 09:33:10,055 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-2] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN java.io.IOException: Server is stopped at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- And this is the one of the Region Servers logs: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2016-07-28 09:33:10,033 WARN [RS_LOG_REPLAY_OPS-datanode-2:16020-0] regionserver.SplitLogWorker: log splitting of WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681 failed, returning error org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 file=/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681 at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:882) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:563) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267) at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839) at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104) at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Now the file " */hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681*" does exists on the dfs and I have run "*hdfs fsck /*" and it says "*The filesystem under path '/' is HEALTHY*". But when I try accessing the file bad things happens: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] at 5120 16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK] at 5120 16/07/28 09:57:30 INFO hdfs.DFSClient: Could not obtain BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node: java.io.IOException: No live nodes contain block BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK], DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK] Dead nodes: DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]. Will get new block locations from namenode and retry... 16/07/28 09:57:30 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 2671.5631767728437 msec. 16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] at 5120 16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK] at 5120 16/07/28 09:57:33 INFO hdfs.DFSClient: Could not obtain BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node: java.io.IOException: No live nodes contain block BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK], DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK] Dead nodes: DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]. Will get new block locations from namenode and retry... 16/07/28 09:57:33 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for 8977.03196466135 msec. 16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] at 5120 16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK] at 5120 16/07/28 09:57:42 INFO hdfs.DFSClient: Could not obtain BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node: java.io.IOException: No live nodes contain block BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK], DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK] Dead nodes: DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK] DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]. Will get new block locations from namenode and retry... 16/07/28 09:57:42 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 13504.266287461855 msec. cat: Checksum error: /hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681 at 5120 exp: 123018583 got: 48747596 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Any help is appreciated in order to recover our HBase cluster...