Hi Guys,
I need your kind help. Region servers in my cluster crashed several times in a week. I have been tracking this problem for a long time, but do not root the cause. From the region server log, we can see DFSOutputStream ResponseProcessor exception for block BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 happened in dfsClient. From the datanode log, datanode did not received the block BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 in 30 minitues and recovery of the block failed as thread was interruped while waiting for IO on channel java.nio.channels.SocketChannel.

I have two questions:
1. Why JVM pause while NO GCs?
2016-04-22 14:32:54,330 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 37360ms
No GCs detected

2. We have mapreduce jobs running on the cluster. Is that related with the crash of region server?How?

Thanks!

hadoop version:2.6.0         hbase version:1.0.1.1
###########region server log###########
2016-04-22 14:28:46,409 WARN [regionserver/slave3/192.168.68.154:16020] util.Sleeper: We slept 15033ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2016-04-22 14:28:46,409 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12600ms
No GCs detected
2016-04-22 14:30:47,076 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=11.75 MB, freeSize=3.95 GB, max=3.96 GB, blockCount=144, accesses=152144, hits=152144, hitRatio=100.00%, , cachingAccesses=152086, cachingHits=152086, cachingHitsRatio=100.00%, evictions=1644, evicted=62, evictedPerRun=0.03771289438009262 2016-04-22 14:30:56,218 INFO [BucketCacheStatsExecutor] bucket.BucketCache: failedBlockAdditions=0, totalSize=20.00 GB, freeSize=19.95 GB, usedSize=47.56 MB, cacheSize=33.54 MB, accesses=426765, hits=12205, IOhitsPerSecond=0, IOTimePerHit=0.00, hitRatio=2.86%, cachingAccesses=13317, cachingHits=12063, cachingHitsRatio=90.58%, evictions=0, evicted=142, evictedPerRun=Infinity 2016-04-22 14:32:54,329 WARN [regionserver/slave3/192.168.68.154:16020.periodicFlusher] util.Sleeper: We slept 37919ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2016-04-22 14:32:54,330 WARN [regionserver/slave3/192.168.68.154:16020.compactionChecker] util.Sleeper: We slept 37922ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2016-04-22 14:32:54,330 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 37360ms
No GCs detected
2016-04-22 14:32:54,330 INFO [regionserver/slave3/192.168.68.154:16020-SendThread(slave5:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 47741ms for sessionid 0x554360adead0239, closing socket connection and attempting reconnect 2016-04-22 14:32:54,330 INFO [main-SendThread(slave4:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 47733ms for sessionid 0x453d16e69e530c0, closing socket connection and attempting reconnect 2016-04-22 14:32:54,330 INFO [regionserver/slave3/192.168.68.154:16020-SendThread(slave2:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 47741ms for sessionid 0x2542aa736c111cd, closing socket connection and attempting reconnect 2016-04-22 14:32:54,330 WARN [regionserver/slave3/192.168.68.154:16020] util.Sleeper: We slept 40647ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2016-04-22 14:32:54,337 WARN [ResponseProcessor for block BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2000) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798) 2016-04-22 14:32:54,337 FATAL [regionserver/slave3/192.168.68.154:16020] regionserver.HRegionServer: ABORTING region server slave3,16020,1461290143957: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing slave3,16020,1461290143957 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:384) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:289) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:278) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)


############data node log#############
2016-04-22 13:57:40,942 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 src: /192.168.68.154:52251 dest: /192.168.68.154:50010
...30mins...
2016-04-22 14:32:48,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at master2/192.168.68.151:8020 calls recoverBlock(BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910, targets=[192.168.68.154:50010, 192.168.68.157:50010], newGenerationStamp=804049) 2016-04-22 14:32:48,221 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1074544298_803910, recoveryId=804049, replica=ReplicaBeingWritten, blk_1074544298_803910, RBW
  getNumBytes()     = 83
  getBytesOnDisk()  = 83
  getVisibleLength()= 83
  getVolume()       = /data6/current
getBlockFile() = /data6/current/BP-454889425-192.168.68.150-1452620137989/current/rbw/blk_1074544298
  bytesAcked=83
  bytesOnDisk=83
2016-04-22 14:32:48,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:745)
2016-04-22 14:32:48,222 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted. 2016-04-22 14:32:48,222 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2016-04-22 14:32:48,222 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 received exception java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis timeout left. 2016-04-22 14:32:48,222 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave3:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.168.68.154:52251 dst: /192.168.68.154:50010 java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)




Reply via email to