On Thu, May 10, 2012 at 1:17 AM, Eran Kutner <[email protected]> wrote: > Here is an example of the HBase log (showing only errors): > > 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for > block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010 > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2986) > > 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer > Exception: java.io.InterruptedIOException: Interruped while waiting for IO > on channel java.nio.channels.SocketChannel[connected > local=/10.1.104.9:59642remote=/ > 10.1.104.9:50010]. 0 millis timeout left. > at > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2848) > > 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_-8928911185099340956_5189425 bad datanode[2] > 10.1.104.6:50010 > 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_-8928911185099340956_5189425 in pipeline > 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode > 10.1.104.6:50010
Above is complaint about a DN in a write pipeline. Anything else around the above logging? You sure the write didn't go through after the dfsclient purged the 'bad datanode'. A few minutes pass and then you ge the below.... > 2012-05-10 03:48:30,174 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422, > load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983): > regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008 > received expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired Says your session expired with zk. You think there was a big GC pause here? You collecting GC logging? Can you check it? > This is from 10.1.104.9 (same machine running the region server that > crashed): You probably want to look at .6 and see why it went sour. It was reported as the bad DN in the pipeline. What version of hbase? Do you have ganglia or tsdb up and running on your cluster so you can dig in across these times of fail? St.Ack
