Hi,
While doing load testing on HBase the entire cluster crashed with
errors like these in hbase logs:
2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 bad datanode[0]
10.1.104.1:50010
2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 in pipeline
10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
10.1.104.1:50010
2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed
recovery attempt #2 from primary datanode 10.1.104.2:50010
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1104)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy4.commitBlockSynchronization(Unknown Source)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1104)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy8.recoverBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 failed because
recovery from primary datanode 10.1.104.2:50010 failed 3 times.
Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
Will retry...
2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 bad datanode[0]
10.1.104.1:50010
2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 in pipeline
10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
10.1.104.1:50010
2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Failed
recovery attempt #3 from primary datanode 10.1.104.2:50010
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1104)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy4.commitBlockSynchronization(Unknown Source)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1104)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy8.recoverBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 failed because
recovery from primary datanode 10.1.104.2:50010 failed 4 times.
Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
Will retry...
The Datanode log for this block ID shows:
# cat /var/log/hadoop/hadoop-hadoop-datanode-hadoop1-s02.log | grep
1213779416283711358
2011-04-10 07:29:23,858 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_1213779416283711358_54194 src: /10.1.104.5:40533 dest:
/10.1.104.2:50010
2011-04-10 10:12:58,242 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
receiveBlock for block blk_1213779416283711358_54194
java.io.EOFException: while trying to read 871 bytes
2011-04-10 10:12:58,242 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_1213779416283711358_54194 Interrupted.
2011-04-10 10:12:58,242 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_1213779416283711358_54194 terminating
2011-04-10 10:12:58,242 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_1213779416283711358_54194 received exception java.io.EOFException:
while trying to read 871 bytes
2011-04-10 10:12:58,288 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
oldblock=blk_1213779416283711358_54194(length=162696),
newblock=blk_1213779416283711358_54249(length=162696),
datanode=10.1.104.2:50010
2011-04-10 10:12:58,288 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_1213779416283711358_54249 of size 162696 as part of lease
recovery.
2011-04-10 10:12:58,766 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
oldblock=blk_1213779416283711358_54249(length=162696),
newblock=blk_1213779416283711358_54250(length=162696),
datanode=10.1.104.2:50010
2011-04-10 10:12:58,803 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_1213779416283711358_54250 src: /10.1.104.5:33159 dest:
/10.1.104.2:50010
2011-04-10 10:12:58,805 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen Block for
append blk_1213779416283711358_54250
2011-04-10 10:12:58,827 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: setBlockPosition
trying to set position to 162696 for block
blk_1213779416283711358_54250 which is not a multiple of
bytesPerChecksum 512
2011-04-10 10:12:58,827 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
computePartialChunkCrc sizePartialChunk 392 block
blk_1213779416283711358_54250 offset in block 162304 offset in
metafile 1275
2011-04-10 10:12:58,827 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Read in partial CRC
chunk from disk for block blk_1213779416283711358_54250
2011-04-10 10:12:58,828 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.1.104.5:33159, dest: /10.1.104.2:50010, bytes: 162696, op:
HDFS_WRITE, cliID:
DFSClient_hb_m_hadoop1-hbm1.farm-ny.gigya.com:60000_1301599306867,
offset: 0, srvID: DS-1554836204-10.1.104.2-50010-1298986316425,
blockid: blk_1213779416283711358_54250, duration: 21639000
2011-04-10 10:12:58,828 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_1213779416283711358_54250 terminating
2011-04-10 10:13:31,080 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
blk_1213779416283711358_54250 file
/hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358 for
deletion
2011-04-10 10:13:31,250 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
blk_1213779416283711358_54250 at file
/hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358
2011-04-10 10:14:29,811 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:29,814 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:29,814 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:29,815 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@c754900) from
10.1.104.1:34481: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:29,828 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:29,830 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:29,830 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:29,831 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@516485d9) from
10.1.104.1:34489: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:30,854 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:30,856 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:30,856 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:30,857 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@62ee365a) from
10.1.104.1:34499: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:31,879 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:31,882 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:31,882 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:31,883 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@ed5547f) from
10.1.104.1:34500: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:32,905 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:32,907 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:32,907 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:32,908 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2dfd9614) from
10.1.104.1:34511: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:33,935 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:33,937 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:33,937 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:33,938 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@22fe47f3) from
10.1.104.1:34522: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
Running "hadoop fsck /" says the filesystem is healthy.
The same kind of errors happened on all the region servers but for
different blocks and against differnt datanodes, eventually the entire
cluster shut itself down. This is the second time I'm having this
problem while running load tests on the system. Previous time, once it
started happening it would happen every hour or even without any load
being run on the cluster, presumably during compaction. I wasn't sure
if it was somehow my fault so I truncated the entire table which
stopped the cluster crashes but now it's happening again.
Any idea what's going on?
-eran