I only see the bad datanode error on the one node right before zookeeper brought it down.
On Thu, Jan 27, 2011 at 10:53 AM, Ted Yu <[email protected]> wrote: > About bad datanode error, I found 164 occurrences in 7-node dev cluster > hbase 0.90 region server logs. > In our 14 node staging cluster running hbase 0.20.6, I found none. > > Both use cdh3b2 hadoop. > > On Thu, Jan 27, 2011 at 6:48 AM, Wayne <[email protected]> wrote: > > > We have got .90 up and running well, but again after 24 hours of loading > a > > node went down. Under it all I assume it is a GC issue, but the GC > logging > > rolls every < 60 minutes so I can never see logs from 5 hours ago > (working > > on getting Scribe up to solve that). Most of our issues are a node being > > marked as dead after being un-responsive. It often starts with a socket > > timeout. We can turn up the timeout for zookeeper but that is not dealing > > with the issue. > > > > Here is the first sign of trouble. Is the 1 min 34 second gap below most > > likely a stop the world GC? > > > > 2011-01-27 07:00:43,716 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > > Roll > > /hbase/.logs/x.x.x.6,60020,1295969329357/x.x.x.6%3A60020.1296111623011, > > entries=242, filesize=69508440. New hlog > > /hbase/.logs/x.x.x.6,60020,1295969329357/x.x.x.6%3A60020.1296111643436 > > 2011-01-27 07:02:17,663 WARN org.apache.hadoop.hdfs.DFSClient: > > DFSOutputStream ResponseProcessor exception for block > > blk_-5705652521953118952_104835java.net.SocketTimeoutException: 69000 > > millis > > timeout while waiting for channel to be ready for read. ch : > > java.nio.channels.SocketChannel[connected local=/x.x.x.6:48141 > > remote=/x.x.x.6:50010] > > > > It is followed by zookeeper complaining due to lack of a response. > > > > 2011-01-27 07:02:17,665 INFO org.apache.zookeeper.ClientCnxn: Client > > session > > timed out, have not heard from server in 94590ms for sessionid > > 0x2dbdc88700000e, closing socket connection and attempting reconnect > > > > There is also a message about the data node. > > > > 2011-01-27 07:02:17,665 WARN org.apache.hadoop.hdfs.DFSClient: Error > > Recovery for block blk_4267667433820446273_104837 bad datanode[0] > > x.x.x.6:50010 > > > > And eventually the node is brought down. > > > > 2011-01-27 07:02:17,783 FATAL > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > > server... > > > > The data node also shows some errors. > > > > 2011-01-27 07:02:17,667 ERROR > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration(x.x.x.6:50010, > > storageID=DS-1438948528-x.x.x.6-50010-1295969305669, infoPort=50075, > > ipcPort=50020):DataXceiver java.net.SocketException: Connection reset > > > > > > Any help, advice, ideas, or guesses would be greatly appreciated. Can > > anyone > > sustain 30-40k writes/node/sec for days/weeks on end without using the > bulk > > loader? Am I rolling a rock uphill against the reality of the JVM? > > > > Thanks. > > >
