It appears the rpc handlers are getting stuck reading the hfiles backing
the meta table ( based on the block read errors, hedged reads, and
hfilescanner lines in the trace ).  I'd check that the backing hdfs cluster
is healthy and consider allocating more I/O capacity for the meta table.

Meta is heavily cached client-side, but this problem could also emerge if
misbehaving clients are rapidly cycling their hbase client.

--James

On Wed, Jun 26, 2019, 9:47 AM James Kebinger <jkebin...@hubspot.com> wrote:

> Hello,
> We're experimenting with meta table replication but have found that the
> region servers hosting the replicas will get extremely high priority RPC
> handler usage, sometimes all the way to 100% at which point clients start
> to experience errors - the priority RPC handler usage is much higher than
> the region servers hosting non-replicated meta ever experienced.
> Has anyone seen this behavior? Are we doing something wrong?
> Our settings changes are
>
> hbase.regionserver.meta.storefile.refresh.period: 30000
> hbase.meta.replica.count: 3
> hbase.master.hfilecleaner.ttl: 600000
>
> We're on cdh 5.9.0 (hbase 1.2)
>
> We see some occasional errors like this one below (at the time I'm writing
> this 30 minutes after the log entry, that file doesn't exist)
> java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR,
> self=/REDACTED:11702, remote=REDACTED:50010, for file
> /hbase/data/hbase/meta/1588230740/info/85c3e4b648eb437a8ef62ff9b39f525d,
> for pool BP-812715994-REDACTED-1441218207856 block 1083723183_9985162
>         at
>
> org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
>         at
>
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
>         at
>
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
>         at
>
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
>         at
>
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
>
> Doing a stack dump of a region server process to see what the priority
> handlers are up to reveals the following:
>
> 1 times (0% total cpu)
> "RpcServer.Priority.general.handler=12,queue=0,port=60020" #138 daemon
> prio=5 os_prio=0 tid=0x00007fc5471a3000 nid=0x3f33 waiting on condition
> [0x00007fc4f5870000] cpu=0.0 start=2019-06-20 19:31:47
>    java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00000003d94b6f18> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
> at
>
> org.apache.hadoop.hbase.regionserver.HStore.replaceStoreFiles(HStore.java:1348)
> at
>
> org.apache.hadoop.hbase.regionserver.HStore.refreshStoreFilesInternal(HStore.java:652)
> --
>
> 3 times (0% total cpu)
> "RpcServer.Priority.general.handler=3,queue=1,port=60020" #129 daemon
> prio=5 os_prio=0 tid=0x00007fc547190800 nid=0x3f2a waiting on condition
> [0x00007fc4f6178000] cpu=0.0 start=2019-06-20 19:31:47
>    java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0000000723bbee00> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> at
>
> org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1359)
> at
>
> org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1327)
> at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1468)
> --
>
> 5 times (0% total cpu)
> "RpcServer.Priority.general.handler=2,queue=0,port=60020" #128 daemon
> prio=5 os_prio=0 tid=0x00007fc54718f000 nid=0x3f29 in Object.wait()
> [0x00007fc4f6279000] cpu=0.0 start=2019-06-20 19:31:47
>    java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
> - locked <0x00000004c0878978> (a org.apache.hadoop.hbase.util.IdLock$Entry)
> at
> org.apache.hadoop.hbase.io
> .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
> at
> org.apache.hadoop.hbase.io
> .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
> at
> org.apache.hadoop.hbase.io
> .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
> at
> org.apache.hadoop.hbase.io
> .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
> at
>
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
> --
>
> 7 times (0% total cpu)
> "RpcServer.Priority.general.handler=1,queue=1,port=60020" #127 daemon
> prio=5 os_prio=0 tid=0x00007fc54718c800 nid=0x3f28 in Object.wait()
> [0x00007fc4f637a000] cpu=0.0 start=2019-06-20 19:31:47
>    java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
> - locked <0x0000000723bbef88> (a org.apache.hadoop.hbase.util.IdLock$Entry)
> at
> org.apache.hadoop.hbase.io
> .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
> at
> org.apache.hadoop.hbase.io
> .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
> at
> org.apache.hadoop.hbase.io
> .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
> at
> org.apache.hadoop.hbase.io
> .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
> at
>
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
>

Reply via email to