Hi all.

 

         I have a small hbase cluster with 3 physical machines. 

         On 192.168.1.80, there are HMaster and a region server. On 81 & 82,
there is a region server on each.

         The region server on 80 can't sync HLog after a datanode access
exception, and started to shutdown.

         The datanode itself was not shutdown and response other requests
normally. I'll paste logs below.

         My question is:

         1. Why this exception causes region server shutdown? Can I prevent
it?

         2. Is there any tools(shell command is best, like hadoop dfsadmin
-report) can monitor hbase region server? to check whether it is alive or
dead?

           I have done some research that nagios/ganglia can do such things.


      But actually I just want know the region server is alive or dead, so
they are a little over qualify.

           And I'm not using CDH, so I can't use Cloudera Manager I think.

 

         Here are the logs.

         

         HBase master:
2013-05-21 17:03:32,675 ERROR org.apache.hadoop.hbase.master.HMaster: Region
server ^@^@hadoop01,60020,1368774173179 reported a fatal error:

ABORTING region server hadoop01,60020,1368774173179:
regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002
received expired from ZooKeeper, aborting

Cause:

org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired

        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper
Watcher.java:369)

        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.
java:266)

        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521
)

        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)

 

         Region Server:

2013-05-21 17:00:16,895 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 120000ms for sessionid
0x3eb14c67540002, closing socket connection and attempting re

connect

2013-05-21 17:00:35,896 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 120000ms for sessionid
0x13eb14ca4bb0000, closing socket connection and attempting r

econnect

2013-05-21 17:03:31,498 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_9188414668950016309_4925046java.net.SocketTimeoutException: 63000 millis
timeout

 while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.80:57020
remote=/192.168.1.82:50010]

        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)

        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)

        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)

        at java.io.DataInputStream.readFully(DataInputStream.java:178)

        at java.io.DataInputStream.readLong(DataInputStream.java:399)

        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.
readFields(DataTransferProtocol.java:124)

        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSCl
ient.java:2784)

 

2013-05-21 17:03:31,520 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_9188414668950016309_4925046 bad datanode[0]
192.168.1.82:50010

2013-05-21 17:03:32,315 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server /192.168.1.82:2100

2013-05-21 17:03:32,316 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop03/192.168.1.82:2100, initiating session

2013-05-21 17:03:32,317 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hadoop03/192.168.1.82:2100, sessionid =
0x13eb14ca4bb0000, negotiated timeout = 180000

2013-05-21 17:03:32,497 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog:
Could not sync. Requesting close of hlog

java.io.IOException: Reflection

        at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Sequence
FileLogWriter.java:230)

        at
org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1091)

        at
org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1195)

        at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.
java:1057)

        at java.lang.Thread.run(Thread.java:662)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)

        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Sequence
FileLogWriter.java:228)

        ... 4 more

Caused by: java.io.IOException: DFSOutputStream is closed

        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3483)

        at
org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)

        at
org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)

        ... 8 more

2013-05-21 17:03:32,497 DEBUG
org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested

2013-05-21 17:03:32,498 DEBUG
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: new
createWriter -- HADOOP-6840 -- not available

2013-05-21 17:03:32,548 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server hadoop03/192.168.1.82:2100

2013-05-21 17:03:32,548 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop03/192.168.1.82:2100, initiating session

2013-05-21 17:03:32,549 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x3eb14c67540002 has expired,
closing socket connection

2013-05-21 17:03:32,549 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
hadoop01,60020,1368774173179: regionserver:60020-0x3eb14c67540002
regionserver:60020-0x3eb14c6754

0002 received expired from ZooKeeper, aborting

org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired

        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper
Watcher.java:369)

        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.
java:266)

        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521
)

        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)

2013-05-21 17:03:32,549 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
loaded coprocessors are: []

2013-05-21 17:03:32,675 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
requestsPerSecond=0, numberOfOnlineRegions=111, numberOfStores=111,
numberOfStorefiles=109, storefileInd

exSizeMB=1, rootIndexSizeKB=1141, totalStaticIndexSizeKB=50806,
totalStaticBloomSizeKB=0, memstoreSizeMB=412, readRequestsCount=302086613,
writeRequestsCount=1802402, compactionQueueSize=0, flushQueueSi

ze=0, usedHeapMB=1631, maxHeapMB=2991, blockCacheSizeMB=988.97,
blockCacheFreeMB=207.71, blockCacheCount=15710,
blockCacheHitCount=2579547598, blockCacheMissCount=4348105,
blockCacheEvictedCount=2734969

, blockCacheHitRatio=99%, blockCacheHitCachingRatio=99%,
hdfsBlocksLocalityIndex=100

2013-05-21 17:03:33,014 DEBUG
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
Path=hdfs://hadoop01:51001/hbase/.logs/hadoop01,60020,1368774173179/hadoop01
%2C60020%2C1368774173179.1369127

012498, syncFs=true, hflush=false

2013-05-21 17:03:33,014 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog:
Failed close of HLog writer

java.io.IOException: All datanodes 192.168.1.82:50010 are bad. Aborting...

        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli
ent.java:2913)

        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:
2406)

        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.
java:2610)

2013-05-21 17:03:33,015 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
Riding over HLog close failure! error count=1

2013-05-21 17:03:33,015 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Roll
/hbase/.logs/hadoop01,60020,1368774173179/hadoop01%2C60020%2C1368774173179.1
369075567280, entries=0, filesize=124.  for /

hbase/.logs/hadoop01,60020,1368774173179/hadoop01%2C60020%2C1368774173179.13
69127012498

2013-05-21 17:03:37,586 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002
received expired from ZooKeeper, abortin

g

2013-05-21 17:03:38,386 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down

 

         datanode:

2013-05-21 17:03:36,431 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
for block blk_9188414668950016309_4925046 java.io.IOException: Connection
reset by peer

2013-05-21 17:03:36,411 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.1.82:50010, dest: /192.168.1.82:1430, bytes: 6662742, op:
HDFS_READ, cliID: DFSClient_attempt_2013

05161127_1389_m_000234_0, offset: 0, srvID:
DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
blk_4644468068464468915_1594795, duration: 1103797113

2013-05-21 17:03:36,671 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_9188414668950016309_4925046 0 : Thread is interrupted.

2013-05-21 17:03:36,671 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block
blk_9188414668950016309_4925046 terminating

2013-05-21 17:03:36,671 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_9188414668950016309_4925046 received exception java.io.IOException:
Connection reset by peer

2013-05-21 17:03:36,671 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.1.82:50010,
storageID=DS-1178868297-192.168.1.82-50010-1350976945636, infoPort=50075,
ipcPort=

50020):DataXceiver

java.io.IOException: Connection reset by peer

        at sun.nio.ch.FileDispatcher.read0(Native Method)

        at sun.nio.ch.SocketDispatcher.read(Unknown Source)

        at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)

        at sun.nio.ch.IOUtil.read(Unknown Source)

        at sun.nio.ch.SocketChannelImpl.read(Unknown Source)

        at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
ava:55)

        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)

        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)

        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)

        at java.io.BufferedInputStream.fill(Unknown Source)

        at java.io.BufferedInputStream.read1(Unknown Source)

        at java.io.BufferedInputStream.read(Unknown Source)

        at java.io.DataInputStream.read(Unknown Source)

        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver
.java:265)

        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockRec
eiver.java:312)

        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockRece
iver.java:376)

        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockRecei
ver.java:532)

        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.ja
va:398)

        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)

        at java.lang.Thread.run(Unknown Source)

2013-05-21 17:03:36,693 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.1.82:50010, dest: /192.168.1.82:1428, bytes: 14134272, op:
HDFS_READ, cliID: DFSClient_attempt_201

305161127_1389_m_000230_1, offset: 0, srvID:
DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
blk_8190430407487842768_2648292, duration: 2088034034

2013-05-21 17:03:36,784 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.1.82:50010, dest: /192.168.1.82:1432, bytes: 8982528, op:
HDFS_READ, cliID: DFSClient_attempt_2013

05161127_1389_m_000228_1, offset: 0, srvID:
DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
blk_4425180518832382171_1731508, duration: 1284097425

Reply via email to