It appears the rpc handlers are getting stuck reading the hfiles backing the meta table ( based on the block read errors, hedged reads, and hfilescanner lines in the trace ). I'd check that the backing hdfs cluster is healthy and consider allocating more I/O capacity for the meta table.
Meta is heavily cached client-side, but this problem could also emerge if misbehaving clients are rapidly cycling their hbase client. --James On Wed, Jun 26, 2019, 9:47 AM James Kebinger <jkebin...@hubspot.com> wrote: > Hello, > We're experimenting with meta table replication but have found that the > region servers hosting the replicas will get extremely high priority RPC > handler usage, sometimes all the way to 100% at which point clients start > to experience errors - the priority RPC handler usage is much higher than > the region servers hosting non-replicated meta ever experienced. > Has anyone seen this behavior? Are we doing something wrong? > Our settings changes are > > hbase.regionserver.meta.storefile.refresh.period: 30000 > hbase.meta.replica.count: 3 > hbase.master.hfilecleaner.ttl: 600000 > > We're on cdh 5.9.0 (hbase 1.2) > > We see some occasional errors like this one below (at the time I'm writing > this 30 minutes after the log entry, that file doesn't exist) > java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, > self=/REDACTED:11702, remote=REDACTED:50010, for file > /hbase/data/hbase/meta/1588230740/info/85c3e4b648eb437a8ef62ff9b39f525d, > for pool BP-812715994-REDACTED-1441218207856 block 1083723183_9985162 > at > > org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467) > at > > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) > at > > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881) > at > > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759) > at > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376) > > Doing a stack dump of a region server process to see what the priority > handlers are up to reveals the following: > > 1 times (0% total cpu) > "RpcServer.Priority.general.handler=12,queue=0,port=60020" #138 daemon > prio=5 os_prio=0 tid=0x00007fc5471a3000 nid=0x3f33 waiting on condition > [0x00007fc4f5870000] cpu=0.0 start=2019-06-20 19:31:47 > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000003d94b6f18> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > > org.apache.hadoop.hbase.regionserver.HStore.replaceStoreFiles(HStore.java:1348) > at > > org.apache.hadoop.hbase.regionserver.HStore.refreshStoreFilesInternal(HStore.java:652) > -- > > 3 times (0% total cpu) > "RpcServer.Priority.general.handler=3,queue=1,port=60020" #129 daemon > prio=5 os_prio=0 tid=0x00007fc547190800 nid=0x3f2a waiting on condition > [0x00007fc4f6178000] cpu=0.0 start=2019-06-20 19:31:47 > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0000000723bbee00> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > > java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) > at > > org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1359) > at > > org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1327) > at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1468) > -- > > 5 times (0% total cpu) > "RpcServer.Priority.general.handler=2,queue=0,port=60020" #128 daemon > prio=5 os_prio=0 tid=0x00007fc54718f000 nid=0x3f29 in Object.wait() > [0x00007fc4f6279000] cpu=0.0 start=2019-06-20 19:31:47 > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81) > - locked <0x00000004c0878978> (a org.apache.hadoop.hbase.util.IdLock$Entry) > at > org.apache.hadoop.hbase.io > .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407) > at > org.apache.hadoop.hbase.io > .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266) > at > org.apache.hadoop.hbase.io > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642) > at > org.apache.hadoop.hbase.io > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592) > at > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294) > -- > > 7 times (0% total cpu) > "RpcServer.Priority.general.handler=1,queue=1,port=60020" #127 daemon > prio=5 os_prio=0 tid=0x00007fc54718c800 nid=0x3f28 in Object.wait() > [0x00007fc4f637a000] cpu=0.0 start=2019-06-20 19:31:47 > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81) > - locked <0x0000000723bbef88> (a org.apache.hadoop.hbase.util.IdLock$Entry) > at > org.apache.hadoop.hbase.io > .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407) > at > org.apache.hadoop.hbase.io > .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266) > at > org.apache.hadoop.hbase.io > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642) > at > org.apache.hadoop.hbase.io > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592) > at > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294) >