File Checksum Error

Oussama Jilal Thu, 28 Jul 2016 03:27:44 -0700

Hello,

This is originally an HBase issue, but please allow me to forward it to the
hadoop mailing list since the error is in the checksum of a file in HDFS...


We are facing a serious issue with our production system, we are on a
Windows Azure infrastructure and yesterday, unexpectedly, all of our VMs
restarted... this has caused our HBase cluster (1 Master + 2 Region
Servers) to crash...

Now whenever I want to start our HBase cluster, it goes down. I checked the
HBase logs and this is what I found on the master logs:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-07-28 09:33:10,052 WARN  [main-EventThread]
coordination.SplitLogManagerCoordination: Error splitting
/hbase/splitWAL/WALs%2Fdatanode-2%2C16020%2C1466263181091-splitting%2Fdatanode-2%252C16020%252C1466263181
091.default.1469654596681
2016-07-28 09:33:10,052 WARN  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.SplitLogManager: error while splitting logs in
[hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
installed = 1
but only 0 done
2016-07-28 09:33:10,053 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
executor.EventHandler: Caught throwable while processing event
M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for
datanode-2,16020,1466263181091, will retry
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs
in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
Task = installed = 1 done = 0 error = 1
        at
org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
        ... 4 more
2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.HMaster: Master server abort: loaded coprocessors are: []
2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.HMaster: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for
datanode-2,16020,1466263181091, will retry
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs
in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
Task = installed = 1 done = 0 error = 1
        at
org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
        ... 4 more
2016-07-28 09:33:10,055 INFO  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
regionserver.HRegionServer: STOPPED: Caught throwable while processing
event M_SERVER_SHUTDOWN
2016-07-28 09:33:10,055 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-2]
executor.EventHandler: Caught throwable while processing event
M_SERVER_SHUTDOWN
java.io.IOException: Server is stopped
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

And this is the one of the Region Servers logs:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-07-28 09:33:10,033 WARN  [RS_LOG_REPLAY_OPS-datanode-2:16020-0]
regionserver.SplitLogWorker: log splitting of
WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
failed, returning error
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121
file=/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
        at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:882)
        at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:563)
        at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
        at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298)
        at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
        at
org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839)
        at
org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763)
        at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
        at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
        at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
        at
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now the file "
*/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681*"
does exists on the dfs and I have run "*hdfs fsck /*" and it says "*The
filesystem under path '/' is HEALTHY*". But when I try accessing the file
bad things happens:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:30 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = 
[DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:30 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1
IOException, will wait for 2671.5631767728437 msec.
16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:33 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = 
[DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:33 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2
IOException, will wait for 8977.03196466135 msec.
16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:42 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = 
[DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:42 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3
IOException, will wait for 13504.266287461855 msec.
cat: Checksum error:
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
at 5120 exp: 123018583 got: 48747596
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Any help is appreciated in order to recover our HBase cluster...

File Checksum Error

Reply via email to