You are looking at it the wrong way. Per http://hbase.apache.org/book.html#trouble.general, always walk up the log to the first exception. In this case it's a session timeout. Whatever happens next is most probably a side effect of that.
To help debug your issue, I would suggest reading this section of the reference guide: http://hbase.apache.org/book.html#trouble.rs.runtime J-D On Tue, May 21, 2013 at 7:17 PM, Cheng Su <[email protected]> wrote: > Hi all. > > > > I have a small hbase cluster with 3 physical machines. > > On 192.168.1.80, there are HMaster and a region server. On 81 & 82, > there is a region server on each. > > The region server on 80 can't sync HLog after a datanode access > exception, and started to shutdown. > > The datanode itself was not shutdown and response other requests > normally. I'll paste logs below. > > My question is: > > 1. Why this exception causes region server shutdown? Can I prevent > it? > > 2. Is there any tools(shell command is best, like hadoop dfsadmin > -report) can monitor hbase region server? to check whether it is alive or > dead? > > I have done some research that nagios/ganglia can do such things. > > > But actually I just want know the region server is alive or dead, so > they are a little over qualify. > > And I'm not using CDH, so I can't use Cloudera Manager I think. > > > > Here are the logs. > > > > HBase master: > 2013-05-21 17:03:32,675 ERROR org.apache.hadoop.hbase.master.HMaster: Region > server ^@^@hadoop01,60020,1368774173179 reported a fatal error: > > ABORTING region server hadoop01,60020,1368774173179: > regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002 > received expired from ZooKeeper, aborting > > Cause: > > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper > Watcher.java:369) > > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher. > java:266) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521 > ) > > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) > > > > Region Server: > > 2013-05-21 17:00:16,895 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 120000ms for sessionid > 0x3eb14c67540002, closing socket connection and attempting re > > connect > > 2013-05-21 17:00:35,896 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 120000ms for sessionid > 0x13eb14ca4bb0000, closing socket connection and attempting r > > econnect > > 2013-05-21 17:03:31,498 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > blk_9188414668950016309_4925046java.net.SocketTimeoutException: 63000 millis > timeout > > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/192.168.1.80:57020 > remote=/192.168.1.82:50010] > > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > > at java.io.DataInputStream.readFully(DataInputStream.java:178) > > at java.io.DataInputStream.readLong(DataInputStream.java:399) > > at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck. > readFields(DataTransferProtocol.java:124) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSCl > ient.java:2784) > > > > 2013-05-21 17:03:31,520 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_9188414668950016309_4925046 bad datanode[0] > 192.168.1.82:50010 > > 2013-05-21 17:03:32,315 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server /192.168.1.82:2100 > > 2013-05-21 17:03:32,316 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to hadoop03/192.168.1.82:2100, initiating session > > 2013-05-21 17:03:32,317 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server hadoop03/192.168.1.82:2100, sessionid = > 0x13eb14ca4bb0000, negotiated timeout = 180000 > > 2013-05-21 17:03:32,497 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: > Could not sync. Requesting close of hlog > > java.io.IOException: Reflection > > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Sequence > FileLogWriter.java:230) > > at > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1091) > > at > org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1195) > > at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog. > java:1057) > > at java.lang.Thread.run(Thread.java:662) > > Caused by: java.lang.reflect.InvocationTargetException > > at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > .java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Sequence > FileLogWriter.java:228) > > ... 4 more > > Caused by: java.io.IOException: DFSOutputStream is closed > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3483) > > at > org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > > at > org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944) > > ... 8 more > > 2013-05-21 17:03:32,497 DEBUG > org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested > > 2013-05-21 17:03:32,498 DEBUG > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: new > createWriter -- HADOOP-6840 -- not available > > 2013-05-21 17:03:32,548 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server hadoop03/192.168.1.82:2100 > > 2013-05-21 17:03:32,548 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to hadoop03/192.168.1.82:2100, initiating session > > 2013-05-21 17:03:32,549 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x3eb14c67540002 has expired, > closing socket connection > > 2013-05-21 17:03:32,549 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > hadoop01,60020,1368774173179: regionserver:60020-0x3eb14c67540002 > regionserver:60020-0x3eb14c6754 > > 0002 received expired from ZooKeeper, aborting > > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper > Watcher.java:369) > > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher. > java:266) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521 > ) > > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) > > 2013-05-21 17:03:32,549 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: > loaded coprocessors are: [] > > 2013-05-21 17:03:32,675 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > requestsPerSecond=0, numberOfOnlineRegions=111, numberOfStores=111, > numberOfStorefiles=109, storefileInd > > exSizeMB=1, rootIndexSizeKB=1141, totalStaticIndexSizeKB=50806, > totalStaticBloomSizeKB=0, memstoreSizeMB=412, readRequestsCount=302086613, > writeRequestsCount=1802402, compactionQueueSize=0, flushQueueSi > > ze=0, usedHeapMB=1631, maxHeapMB=2991, blockCacheSizeMB=988.97, > blockCacheFreeMB=207.71, blockCacheCount=15710, > blockCacheHitCount=2579547598, blockCacheMissCount=4348105, > blockCacheEvictedCount=2734969 > > , blockCacheHitRatio=99%, blockCacheHitCachingRatio=99%, > hdfsBlocksLocalityIndex=100 > > 2013-05-21 17:03:33,014 DEBUG > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: > Path=hdfs://hadoop01:51001/hbase/.logs/hadoop01,60020,1368774173179/hadoop01 > %2C60020%2C1368774173179.1369127 > > 012498, syncFs=true, hflush=false > > 2013-05-21 17:03:33,014 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: > Failed close of HLog writer > > java.io.IOException: All datanodes 192.168.1.82:50010 are bad. Aborting... > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli > ent.java:2913) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java: > 2406) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient. > java:2610) > > 2013-05-21 17:03:33,015 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > Riding over HLog close failure! error count=1 > > 2013-05-21 17:03:33,015 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Roll > /hbase/.logs/hadoop01,60020,1368774173179/hadoop01%2C60020%2C1368774173179.1 > 369075567280, entries=0, filesize=124. for / > > hbase/.logs/hadoop01,60020,1368774173179/hadoop01%2C60020%2C1368774173179.13 > 69127012498 > > 2013-05-21 17:03:37,586 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: > regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002 > received expired from ZooKeeper, abortin > > g > > 2013-05-21 17:03:38,386 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > > > > datanode: > > 2013-05-21 17:03:36,431 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock > for block blk_9188414668950016309_4925046 java.io.IOException: Connection > reset by peer > > 2013-05-21 17:03:36,411 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.1.82:50010, dest: /192.168.1.82:1430, bytes: 6662742, op: > HDFS_READ, cliID: DFSClient_attempt_2013 > > 05161127_1389_m_000234_0, offset: 0, srvID: > DS-1178868297-192.168.1.82-50010-1350976945636, blockid: > blk_4644468068464468915_1594795, duration: 1103797113 > > 2013-05-21 17:03:36,671 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_9188414668950016309_4925046 0 : Thread is interrupted. > > 2013-05-21 17:03:36,671 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block > blk_9188414668950016309_4925046 terminating > > 2013-05-21 17:03:36,671 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > blk_9188414668950016309_4925046 received exception java.io.IOException: > Connection reset by peer > > 2013-05-21 17:03:36,671 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(192.168.1.82:50010, > storageID=DS-1178868297-192.168.1.82-50010-1350976945636, infoPort=50075, > ipcPort= > > 50020):DataXceiver > > java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcher.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(Unknown Source) > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) > > at sun.nio.ch.IOUtil.read(Unknown Source) > > at sun.nio.ch.SocketChannelImpl.read(Unknown Source) > > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j > ava:55) > > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > > at java.io.BufferedInputStream.fill(Unknown Source) > > at java.io.BufferedInputStream.read1(Unknown Source) > > at java.io.BufferedInputStream.read(Unknown Source) > > at java.io.DataInputStream.read(Unknown Source) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver > .java:265) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockRec > eiver.java:312) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockRece > iver.java:376) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockRecei > ver.java:532) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.ja > va:398) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) > > at java.lang.Thread.run(Unknown Source) > > 2013-05-21 17:03:36,693 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.1.82:50010, dest: /192.168.1.82:1428, bytes: 14134272, op: > HDFS_READ, cliID: DFSClient_attempt_201 > > 305161127_1389_m_000230_1, offset: 0, srvID: > DS-1178868297-192.168.1.82-50010-1350976945636, blockid: > blk_8190430407487842768_2648292, duration: 2088034034 > > 2013-05-21 17:03:36,784 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.1.82:50010, dest: /192.168.1.82:1432, bytes: 8982528, op: > HDFS_READ, cliID: DFSClient_attempt_2013 > > 05161127_1389_m_000228_1, offset: 0, srvID: > DS-1178868297-192.168.1.82-50010-1350976945636, blockid: > blk_4425180518832382171_1731508, duration: 1284097425 >
