Hi Guys,
I need your kind help. Region servers in my cluster crashed several
times in a week. I have been tracking this problem for a long time, but
do not root the cause.
From the region server log, we can see DFSOutputStream
ResponseProcessor exception for block
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 happened
in dfsClient.
From the datanode log, datanode did not received the block
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 in 30
minitues and recovery of the block failed as thread was interruped while
waiting for IO on channel java.nio.channels.SocketChannel.
I have two questions:
1. Why JVM pause while NO GCs?
2016-04-22 14:32:54,330 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
Detected pause in JVM or host machine (eg GC): pause of approximately
37360ms
No GCs detected
2. We have mapreduce jobs running on the cluster. Is that related with
the crash of region server?How?
Thanks!
hadoop version:2.6.0 hbase version:1.0.1.1
###########region server log###########
2016-04-22 14:28:46,409 WARN [regionserver/slave3/192.168.68.154:16020]
util.Sleeper: We slept 15033ms instead of 3000ms, this is likely due to
a long garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-04-22 14:28:46,409 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
Detected pause in JVM or host machine (eg GC): pause of approximately
12600ms
No GCs detected
2016-04-22 14:30:47,076 INFO [LruBlockCacheStatsExecutor]
hfile.LruBlockCache: totalSize=11.75 MB, freeSize=3.95 GB, max=3.96 GB,
blockCount=144, accesses=152144, hits=152144, hitRatio=100.00%, ,
cachingAccesses=152086, cachingHits=152086, cachingHitsRatio=100.00%,
evictions=1644, evicted=62, evictedPerRun=0.03771289438009262
2016-04-22 14:30:56,218 INFO [BucketCacheStatsExecutor]
bucket.BucketCache: failedBlockAdditions=0, totalSize=20.00 GB,
freeSize=19.95 GB, usedSize=47.56 MB, cacheSize=33.54 MB,
accesses=426765, hits=12205, IOhitsPerSecond=0, IOTimePerHit=0.00,
hitRatio=2.86%, cachingAccesses=13317, cachingHits=12063,
cachingHitsRatio=90.58%, evictions=0, evicted=142, evictedPerRun=Infinity
2016-04-22 14:32:54,329 WARN
[regionserver/slave3/192.168.68.154:16020.periodicFlusher] util.Sleeper:
We slept 37919ms instead of 10000ms, this is likely due to a long
garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-04-22 14:32:54,330 WARN
[regionserver/slave3/192.168.68.154:16020.compactionChecker]
util.Sleeper: We slept 37922ms instead of 10000ms, this is likely due to
a long garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-04-22 14:32:54,330 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
Detected pause in JVM or host machine (eg GC): pause of approximately
37360ms
No GCs detected
2016-04-22 14:32:54,330 INFO
[regionserver/slave3/192.168.68.154:16020-SendThread(slave5:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from
server in 47741ms for sessionid 0x554360adead0239, closing socket
connection and attempting reconnect
2016-04-22 14:32:54,330 INFO [main-SendThread(slave4:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from
server in 47733ms for sessionid 0x453d16e69e530c0, closing socket
connection and attempting reconnect
2016-04-22 14:32:54,330 INFO
[regionserver/slave3/192.168.68.154:16020-SendThread(slave2:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from
server in 47741ms for sessionid 0x2542aa736c111cd, closing socket
connection and attempting reconnect
2016-04-22 14:32:54,330 WARN [regionserver/slave3/192.168.68.154:16020]
util.Sleeper: We slept 40647ms instead of 3000ms, this is likely due to
a long garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-04-22 14:32:54,337 WARN [ResponseProcessor for block
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2000)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798)
2016-04-22 14:32:54,337 FATAL [regionserver/slave3/192.168.68.154:16020]
regionserver.HRegionServer: ABORTING region server
slave3,16020,1461290143957: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing slave3,16020,1461290143957
as dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:384)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:289)
at
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:278)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
############data node log#############
2016-04-22 13:57:40,942 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 src:
/192.168.68.154:52251 dest: /192.168.68.154:50010
...30mins...
2016-04-22 14:32:48,221 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at
master2/192.168.68.151:8020 calls
recoverBlock(BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910,
targets=[192.168.68.154:50010, 192.168.68.157:50010],
newGenerationStamp=804049)
2016-04-22 14:32:48,221 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_1074544298_803910, recoveryId=804049,
replica=ReplicaBeingWritten, blk_1074544298_803910, RBW
getNumBytes() = 83
getBytesOnDisk() = 83
getVisibleLength()= 83
getVolume() = /data6/current
getBlockFile() =
/data6/current/BP-454889425-192.168.68.150-1452620137989/current/rbw/blk_1074544298
bytesAcked=83
bytesOnDisk=83
2016-04-22 14:32:48,221 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910
java.io.InterruptedIOException: Interruped while waiting for IO on
channel java.nio.channels.SocketChannel[connected
local=/192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis
timeout left.
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
at java.lang.Thread.run(Thread.java:745)
2016-04-22 14:32:48,222 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910,
type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
2016-04-22 14:32:48,222 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910,
type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2016-04-22 14:32:48,222 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock
BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 received
exception java.io.InterruptedIOException: Interruped while waiting for
IO on channel java.nio.channels.SocketChannel[connected
local=/192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis
timeout left.
2016-04-22 14:32:48,222 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
slave3:50010:DataXceiver error processing WRITE_BLOCK operation src:
/192.168.68.154:52251 dst: /192.168.68.154:50010
java.io.InterruptedIOException: Interruped while waiting for IO on
channel java.nio.channels.SocketChannel[connected
local=/192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis
timeout left.
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)