Hi, all:
our hbase cluster often occurs a strange phenomenon,we can’t query or
execute bulkload when one machine goes down,like the hbase cluster is crashed.
So we took a look at the jstack of the regionserver, we found some threads
were blocked, waiting for the lock. Here is the jstack of the regionserver:
"RpcServer.handler=100,port=60020" daemon prio=10 tid=0x00007f6a48b69000
nid=0x6b23 waiting for monitor entry [0x00007f66c9a70000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:354)
- waiting to lock <0x00000004dc908f30> (a
org.apache.hadoop.hdfs.DFSInputStream)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1279)
at
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1224)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355)
at
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:515)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:238)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:153)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
at
org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1761)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3687)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1872)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1858)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1835)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3068)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None
"regionserver60020-smallCompactions-1428948439258" daemon prio=10
tid=0x00007f6a5001b000 nid=0x6ca1 runnable [0x00007f66c5524000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000004e7ca3340> (a sun.nio.ch.Util$2)
- locked <0x00000004e7ca3328> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000004e762abf8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at
org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:173)
at
org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:138)
- locked <0x0000000514bd5730> (a
org.apache.hadoop.hdfs.RemoteBlockReader2)
at
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:686)
at
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:742)
- eliminated <0x00000004dc908f30> (a
org.apache.hadoop.hdfs.DFSInputStream)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:799)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
- locked <0x00000004dc908f30> (a org.apache.hadoop.hdfs.DFSInputStream)
at java.io.DataInputStream.read(DataInputStream.java:149)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:563)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1215)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.readNextDataBlock(HFileReaderV2.java:605)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1065)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:137)
at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:507)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:217)
at
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:76)
at
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109)
at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1080)
at
org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1409)
at
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:475)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x00000004bfe5f1d8> (a
java.util.concurrent.ThreadPoolExecutor$Worker)
- <0x00000004dc0160c8> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync)
- <0x0000000514700ec0> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync)
"RpcServer.handler=102,port=60020" daemon prio=10 tid=0x00007f6a48b6d000
nid=0x6b25 waiting on condition [0x00007f66c986e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000004dcd49270> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:945)
at
org.apache.hadoop.hbase.regionserver.HStore.bulkLoadHFile(HStore.java:678)
at
org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3574)
at
org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3496)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFile(HRegionServer.java:3262)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29499)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None
"RpcServer.handler=113,port=60020" daemon prio=10 tid=0x00007f6a48b85000
nid=0x6b30 waiting on condition [0x00007f66c8d63000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000004dcd49270> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
at
org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1754)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3687)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1872)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1858)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1835)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3068)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None
Almost all RpcServer threads were in the waiting state, because the
smallCompaction held the lock <0x00000004dc908f30> and waiting for the IO to be
ready. I don’t know why it took so long a time for the IO to be ready, and the
more strange thing is that I can’t even find a SocketTimeout exception in
regionserver’s log.
Does anybody know the reason of this problem?
Thank you in advance!
cluster version:
hbase-0.98.1-hadoop2
hadoop-2.3.0-cdh5.0.0