Hi all.
I have a small hbase cluster with 3 physical machines.
On 192.168.1.80, there are HMaster and a region server. On 81 & 82,
there is a region server on each.
The region server on 80 can't sync HLog after a datanode access
exception, and started to shutdown.
The datanode itself was not shutdown and response other requests
normally. I'll paste logs below.
My question is:
1. Why this exception causes region server shutdown? Can I prevent
it?
2. Is there any tools(shell command is best, like hadoop dfsadmin
-report) can monitor hbase region server? to check whether it is alive or
dead?
I have done some research that nagios/ganglia can do such things.
But actually I just want know the region server is alive or dead, so
they are a little over qualify.
And I'm not using CDH, so I can't use Cloudera Manager I think.
Here are the logs.
HBase master:
2013-05-21 17:03:32,675 ERROR org.apache.hadoop.hbase.master.HMaster: Region
server ^@^@hadoop01,60020,1368774173179 reported a fatal error:
ABORTING region server hadoop01,60020,1368774173179:
regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002
received expired from ZooKeeper, aborting
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper
Watcher.java:369)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.
java:266)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521
)
at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
Region Server:
2013-05-21 17:00:16,895 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 120000ms for sessionid
0x3eb14c67540002, closing socket connection and attempting re
connect
2013-05-21 17:00:35,896 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 120000ms for sessionid
0x13eb14ca4bb0000, closing socket connection and attempting r
econnect
2013-05-21 17:03:31,498 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception for block
blk_9188414668950016309_4925046java.net.SocketTimeoutException: 63000 millis
timeout
while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.80:57020
remote=/192.168.1.82:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.
readFields(DataTransferProtocol.java:124)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSCl
ient.java:2784)
2013-05-21 17:03:31,520 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_9188414668950016309_4925046 bad datanode[0]
192.168.1.82:50010
2013-05-21 17:03:32,315 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server /192.168.1.82:2100
2013-05-21 17:03:32,316 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop03/192.168.1.82:2100, initiating session
2013-05-21 17:03:32,317 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hadoop03/192.168.1.82:2100, sessionid =
0x13eb14ca4bb0000, negotiated timeout = 180000
2013-05-21 17:03:32,497 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog:
Could not sync. Requesting close of hlog
java.io.IOException: Reflection
at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Sequence
FileLogWriter.java:230)
at
org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1091)
at
org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1195)
at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.
java:1057)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Sequence
FileLogWriter.java:228)
... 4 more
Caused by: java.io.IOException: DFSOutputStream is closed
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3483)
at
org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
at
org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
... 8 more
2013-05-21 17:03:32,497 DEBUG
org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
2013-05-21 17:03:32,498 DEBUG
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: new
createWriter -- HADOOP-6840 -- not available
2013-05-21 17:03:32,548 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server hadoop03/192.168.1.82:2100
2013-05-21 17:03:32,548 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop03/192.168.1.82:2100, initiating session
2013-05-21 17:03:32,549 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x3eb14c67540002 has expired,
closing socket connection
2013-05-21 17:03:32,549 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
hadoop01,60020,1368774173179: regionserver:60020-0x3eb14c67540002
regionserver:60020-0x3eb14c6754
0002 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper
Watcher.java:369)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.
java:266)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521
)
at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
2013-05-21 17:03:32,549 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
loaded coprocessors are: []
2013-05-21 17:03:32,675 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
requestsPerSecond=0, numberOfOnlineRegions=111, numberOfStores=111,
numberOfStorefiles=109, storefileInd
exSizeMB=1, rootIndexSizeKB=1141, totalStaticIndexSizeKB=50806,
totalStaticBloomSizeKB=0, memstoreSizeMB=412, readRequestsCount=302086613,
writeRequestsCount=1802402, compactionQueueSize=0, flushQueueSi
ze=0, usedHeapMB=1631, maxHeapMB=2991, blockCacheSizeMB=988.97,
blockCacheFreeMB=207.71, blockCacheCount=15710,
blockCacheHitCount=2579547598, blockCacheMissCount=4348105,
blockCacheEvictedCount=2734969
, blockCacheHitRatio=99%, blockCacheHitCachingRatio=99%,
hdfsBlocksLocalityIndex=100
2013-05-21 17:03:33,014 DEBUG
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
Path=hdfs://hadoop01:51001/hbase/.logs/hadoop01,60020,1368774173179/hadoop01
%2C60020%2C1368774173179.1369127
012498, syncFs=true, hflush=false
2013-05-21 17:03:33,014 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog:
Failed close of HLog writer
java.io.IOException: All datanodes 192.168.1.82:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli
ent.java:2913)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:
2406)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.
java:2610)
2013-05-21 17:03:33,015 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
Riding over HLog close failure! error count=1
2013-05-21 17:03:33,015 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Roll
/hbase/.logs/hadoop01,60020,1368774173179/hadoop01%2C60020%2C1368774173179.1
369075567280, entries=0, filesize=124. for /
hbase/.logs/hadoop01,60020,1368774173179/hadoop01%2C60020%2C1368774173179.13
69127012498
2013-05-21 17:03:37,586 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002
received expired from ZooKeeper, abortin
g
2013-05-21 17:03:38,386 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
datanode:
2013-05-21 17:03:36,431 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
for block blk_9188414668950016309_4925046 java.io.IOException: Connection
reset by peer
2013-05-21 17:03:36,411 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.1.82:50010, dest: /192.168.1.82:1430, bytes: 6662742, op:
HDFS_READ, cliID: DFSClient_attempt_2013
05161127_1389_m_000234_0, offset: 0, srvID:
DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
blk_4644468068464468915_1594795, duration: 1103797113
2013-05-21 17:03:36,671 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_9188414668950016309_4925046 0 : Thread is interrupted.
2013-05-21 17:03:36,671 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block
blk_9188414668950016309_4925046 terminating
2013-05-21 17:03:36,671 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_9188414668950016309_4925046 received exception java.io.IOException:
Connection reset by peer
2013-05-21 17:03:36,671 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.1.82:50010,
storageID=DS-1178868297-192.168.1.82-50010-1350976945636, infoPort=50075,
ipcPort=
50020):DataXceiver
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
ava:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.DataInputStream.read(Unknown Source)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver
.java:265)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockRec
eiver.java:312)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockRece
iver.java:376)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockRecei
ver.java:532)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.ja
va:398)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
at java.lang.Thread.run(Unknown Source)
2013-05-21 17:03:36,693 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.1.82:50010, dest: /192.168.1.82:1428, bytes: 14134272, op:
HDFS_READ, cliID: DFSClient_attempt_201
305161127_1389_m_000230_1, offset: 0, srvID:
DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
blk_8190430407487842768_2648292, duration: 2088034034
2013-05-21 17:03:36,784 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.1.82:50010, dest: /192.168.1.82:1432, bytes: 8982528, op:
HDFS_READ, cliID: DFSClient_attempt_2013
05161127_1389_m_000228_1, offset: 0, srvID:
DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
blk_4425180518832382171_1731508, duration: 1284097425