A couple of our region servers (in a 16 node cluster) crashed due to underlying Data Node errors. I am trying to understand how errors on remote data nodes impact other region server processes.

*To briefly describe what happened:
*
1) Cluster was in operation. All 16 nodes were up, reads and writes were happening extensively. 2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DN and RS service were running and the power was just pulled out) 3) Nodes 2 and 5 flushed and DFS client started reporting errors. From the log it seems like DFS blocks were being replicated to the nodes that were shutdown (7 and 8) and since replication could not go through successfully DFS client raised errors on 2 and 5 and eventually the RS itself died.

The question I am trying to get an answer for is : Is a Region Server immune from remote data node errors (that are part of the replication pipeline) or not. ?
*
Part of the Region Server Log:* (Node 5)

2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Bad connect ack with firstBadLink
as 10.128.204.228:50010
2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-316956372096761177_489798 2012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.128.204.228:50010 2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO DeleteFamily was added to HFile (hdfs://Node101:8020/hbase/table/754de060
c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124)
2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store: Flushed , sequenceid=4046717645, memsize=256.5m, into tmp file hdfs://Node101:8020/hbase/table/754de0 60c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-26 18:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5c d1fb2cb4547972a31073d2da124 to hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124 2012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2d a124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-26 18:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949 remote=/10.128.204.225:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857) 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_5116092240243398556_489796 bad datanode[0] 10.128.204.225:50010 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_5116092240243398556_489796 in pipeline 10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: bad datanode 10.128.204.225:50010

I can pastebin the entire log but this is when things started going wrong for Node 5 and eventually shutdown hook for RS started and the RS was shutdown.

Any help in troubleshooting this is greatly appreciated.

Thanks,
Jay





Reply via email to