Thanks for the quick reply Nicolas. We are using HBase 0.94 on Hadoop
1.0.3.
I have uploaded the logs here:
Region Server log: http://pastebin.com/QEQ22UnU
Data Node log: http://pastebin.com/DF0JNL8K
Appreciate your help in figuring this out.
Thanks,
Jay
On 7/30/12 1:02 PM, N Keywal wrote:
Hi Jay,
Yes, the whole log would be interesting, plus the logs of the datanode
on the same box as the dead RS.
What's your hbase& hdfs versions?
The RS should be immune to hdfs errors. There are known issues (see
HDFS-3701), but it seems you have something different...
This:
java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949
remote=/10.128.204.225:50010]
Seems to say that the error was between the datanode on the same box as the RS?
Nicolas
On Mon, Jul 30, 2012 at 6:43 PM, Jay T<[email protected]> wrote:
A couple of our region servers (in a 16 node cluster) crashed due to
underlying Data Node errors. I am trying to understand how errors on remote
data nodes impact other region server processes.
*To briefly describe what happened:
*
1) Cluster was in operation. All 16 nodes were up, reads and writes were
happening extensively.
2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DN and
RS service were running and the power was just pulled out)
3) Nodes 2 and 5 flushed and DFS client started reporting errors. From the
log it seems like DFS blocks were being replicated to the nodes that were
shutdown (7 and 8) and since replication could not go through successfully
DFS client raised errors on 2 and 5 and eventually the RS itself died.
The question I am trying to get an answer for is : Is a Region Server immune
from remote data node errors (that are part of the replication pipeline) or
not. ?
*
Part of the Region Server Log:* (Node 5)
2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Bad
connect ack with firstBadLink
as 10.128.204.228:50010
2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_-316956372096761177_489798
2012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
datanode 10.128.204.228:50010
2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.StoreFile:
NO General Bloom and NO DeleteFamily was added to HFile
(hdfs://Node101:8020/hbase/table/754de060
c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124)
2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store:
Flushed , sequenceid=4046717645, memsize=256.5m, into tmp file
hdfs://Node101:8020/hbase/table/754de0
60c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-26
18:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming
flushed file at
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5c
d1fb2cb4547972a31073d2da124 to
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124
2012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2d
a124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-26
18:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.net.SocketTimeoutException: 15000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949
remote=/10.128.204.225:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857)
2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_5116092240243398556_489796 bad datanode[0]
10.128.204.225:50010
2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_5116092240243398556_489796 in pipeline
10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: bad
datanode 10.128.204.225:50010
I can pastebin the entire log but this is when things started going wrong
for Node 5 and eventually shutdown hook for RS started and the RS was
shutdown.
Any help in troubleshooting this is greatly appreciated.
Thanks,
Jay