On Fri, Jun 28, 2019 at 9:47 AM James Kebinger <jkebin...@hubspot.com> wrote:
> We were thinking that meta replication would make a hot meta table on a > busy cluster less hot rather than more hot. You are running read replicas or replicating the meta to another cluster? S > Is the meta replication feature > for uptime rather than performance if it, as you said, will "up the > temperature even more"? > > I was surprised that all the region servers were reading the same hfile > from disk rather than some mechanism that would allow each hosts of a meta > region replica to have its own copy with high locality and less IdLock > contention. > > In my brief experience experimenting with the feature, the region servers > needed to be killed to recover so they may be hitting that IdLock issue > even though we're on 1.2 -- that IdLock code looks roughly the same in this > older version. > > Given your statement about loading the temperature up even more, if our > goal is to spread out the performance impact of the meta region, will > tweaking cache size and i/o performance be time well spent? > > Thank you, > James > > Thanks James for your response too, I hope you're doing well :) We have a > bunch of cold-meta-cache hadoop jobs that are indeed likely contributing to > the load on meta. > > On Fri, Jun 28, 2019 at 12:14 AM Stack <st...@duboce.net> wrote: > > > On Wed, Jun 26, 2019 at 6:47 AM James Kebinger <jkebin...@hubspot.com> > > wrote: > > > > > Hello, > > > We're experimenting with meta table replication > > > > > > Replicating the meta table itself? What you thinking? > > > > > > > > > but have found that the > > > region servers hosting the replicas will get extremely high priority > RPC > > > handler usage, sometimes all the way to 100% at which point clients > start > > > to experience errors - the priority RPC handler usage is much higher > than > > > the region servers hosting non-replicated meta ever experienced. > > > > > > > Meta Table will be hot on a busy cluster. Adding replication load will up > > the temperature even more. > > > > Check out the James Moore suggestions. HDFS issue could occupy handler > > while it holds a the id lock on a popular block. > > > > Can you figure who is trying to read the removed file? > > > > Can't tell for sure from the snippet below but are handlers frequently > > going out to HDFS to pull in blocks rather than fetch from block cache? > How > > is server-side cache on the server hosting meta? Can you get more meta up > > into block cache? > > > > If you disable replication, all settles down? > > > > S > > > > > > Has anyone seen this behavior? Are we doing something wrong? > > > Our settings changes are > > > > > > hbase.regionserver.meta.storefile.refresh.period: 30000 > > > hbase.meta.replica.count: 3 > > > hbase.master.hfilecleaner.ttl: 600000 > > > > > > We're on cdh 5.9.0 (hbase 1.2) > > > > > > We see some occasional errors like this one below (at the time I'm > > writing > > > this 30 minutes after the log entry, that file doesn't exist) > > > java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, > > > self=/REDACTED:11702, remote=REDACTED:50010, for file > > > > /hbase/data/hbase/meta/1588230740/info/85c3e4b648eb437a8ef62ff9b39f525d, > > > for pool BP-812715994-REDACTED-1441218207856 block 1083723183_9985162 > > > at > > > > > > > > > org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467) > > > at > > > > > > > > > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) > > > at > > > > > > > > > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881) > > > at > > > > > > > > > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759) > > > at > > > > > > > > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376) > > > > > > Doing a stack dump of a region server process to see what the priority > > > handlers are up to reveals the following: > > > > > > 1 times (0% total cpu) > > > "RpcServer.Priority.general.handler=12,queue=0,port=60020" #138 daemon > > > prio=5 os_prio=0 tid=0x00007fc5471a3000 nid=0x3f33 waiting on condition > > > [0x00007fc4f5870000] cpu=0.0 start=2019-06-20 19:31:47 > > > java.lang.Thread.State: WAITING (parking) > > > at sun.misc.Unsafe.park(Native Method) > > > - parking to wait for <0x00000003d94b6f18> (a > > > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > > > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > > at > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > > > at > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > > > at > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > > > at > > > > > > > > > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.HStore.replaceStoreFiles(HStore.java:1348) > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.HStore.refreshStoreFilesInternal(HStore.java:652) > > > -- > > > > > > 3 times (0% total cpu) > > > "RpcServer.Priority.general.handler=3,queue=1,port=60020" #129 daemon > > > prio=5 os_prio=0 tid=0x00007fc547190800 nid=0x3f2a waiting on condition > > > [0x00007fc4f6178000] cpu=0.0 start=2019-06-20 19:31:47 > > > java.lang.Thread.State: WAITING (parking) > > > at sun.misc.Unsafe.park(Native Method) > > > - parking to wait for <0x0000000723bbee00> (a > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > > > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > > at > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > > at > > > > > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > > at > > > > > > > > > java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) > > > at > > > > > > > > > org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1359) > > > at > > > > > > > > > org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1327) > > > at > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1468) > > > -- > > > > > > 5 times (0% total cpu) > > > "RpcServer.Priority.general.handler=2,queue=0,port=60020" #128 daemon > > > prio=5 os_prio=0 tid=0x00007fc54718f000 nid=0x3f29 in Object.wait() > > > [0x00007fc4f6279000] cpu=0.0 start=2019-06-20 19:31:47 > > > java.lang.Thread.State: WAITING (on object monitor) > > > at java.lang.Object.wait(Native Method) > > > at java.lang.Object.wait(Object.java:502) > > > at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81) > > > - locked <0x00000004c0878978> (a > > org.apache.hadoop.hbase.util.IdLock$Entry) > > > at > > > org.apache.hadoop.hbase.io > > <http://org.apache.hadoop.hbase.io> > > > .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407) > > > at > > > org.apache.hadoop.hbase.io > > <http://org.apache.hadoop.hbase.io> > > > > > > .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266) > > > at > > > org.apache.hadoop.hbase.io > > <http://org.apache.hadoop.hbase.io> > > > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642) > > > at > > > org.apache.hadoop.hbase.io > > <http://org.apache.hadoop.hbase.io> > > > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592) > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294) > > > -- > > > > > > 7 times (0% total cpu) > > > "RpcServer.Priority.general.handler=1,queue=1,port=60020" #127 daemon > > > prio=5 os_prio=0 tid=0x00007fc54718c800 nid=0x3f28 in Object.wait() > > > [0x00007fc4f637a000] cpu=0.0 start=2019-06-20 19:31:47 > > > java.lang.Thread.State: WAITING (on object monitor) > > > at java.lang.Object.wait(Native Method) > > > at java.lang.Object.wait(Object.java:502) > > > at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81) > > > - locked <0x0000000723bbef88> (a > > org.apache.hadoop.hbase.util.IdLock$Entry) > > > at > > > org.apache.hadoop.hbase.io > > <http://org.apache.hadoop.hbase.io> > > > .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407) > > > at > > > org.apache.hadoop.hbase.io > > <http://org.apache.hadoop.hbase.io> > > > > > > .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266) > > > at > > > org.apache.hadoop.hbase.io > > <http://org.apache.hadoop.hbase.io> > > > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642) > > > at > > > org.apache.hadoop.hbase.io > > <http://org.apache.hadoop.hbase.io> > > > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592) > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294) > > > > > >