Region Server failure due to remote data node errors

Jay Talreja Mon, 30 Jul 2012 13:14:59 -0700

A couple of our region servers (in a 16 node cluster) crashed due tounderlying Data Node errors. I am trying to understand how errors onremote data nodes impact other region server processes.


*To briefly describe what happened:
*

1) Cluster was in operation. All 16 nodes were up, reads and writes werehappening extensively.2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DNand RS service were running and the power was just pulled out)3) Nodes 2 and 5 flushed and DFS client started reporting errors. Fromthe log it seems like DFS blocks were being replicated to the nodes thatwere shutdown (7 and 8) and since replication could not go throughsuccessfully DFS client raised errors on 2 and 5 and eventually the RSitself died.

The question I am trying to get an answer for is : Is a Region Serverimmune from remote data node errors (that are part of the replicationpipeline) or not. ?

*
Part of the Region Server Log:* (Node 5)

2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exceptionin createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Badconnect ack with firstBadLink

as 10.128.204.228:50010

2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient:Abandoning block blk_-316956372096761177_4897982012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excludingdatanode 10.128.204.228:500102012-07-26 18:53:16,903 INFOorg.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NODeleteFamily was added to HFile (hdfs://Node101:8020/hbase/table/754de060

c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124)

2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store:Flushed , sequenceid=4046717645, memsize=256.5m, into tmp filehdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-2618:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renamingflushed file athdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124 tohdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da1242012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store:Addedhdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-2618:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamerException: java.net.SocketTimeoutException: 15000 millis timeout whilewaiting for channel to be ready for write. ch :java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949remote=/10.128.204.225:50010]atorg.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)atorg.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)atorg.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)atjava.io.DataOutputStream.write(DataOutputStream.java:90) atorg.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857)2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: ErrorRecovery for block blk_5116092240243398556_489796 bad datanode[0]10.128.204.225:500102012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: ErrorRecovery for block blk_5116092240243398556_489796 in pipeline10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: baddatanode 10.128.204.225:50010

I can pastebin the entire log but this is when things started goingwrong for Node 5 and eventually shutdown hook for RS started and the RSwas shutdown.


Any help in troubleshooting this is greatly appreciated.

Thanks,
Jay

Region Server failure due to remote data node errors

Reply via email to