Re: High Priority RPC handler usage with meta replication - block read errors?

Stack Sun, 30 Jun 2019 20:53:45 -0700

On Fri, Jun 28, 2019 at 9:47 AM James Kebinger <jkebin...@hubspot.com>
wrote:


> We were thinking that meta replication would make a hot meta table on a
> busy cluster less hot rather than more hot.


You are running read replicas or replicating the meta to another cluster?
S


> Is the meta replication feature
> for uptime rather than performance if it, as you said, will "up the
> temperature even more"?
>
> I was surprised that all the region servers were reading the same hfile
> from disk rather than some mechanism that would allow each hosts of a meta
> region replica to have its own copy with high locality and less IdLock
> contention.
>
> In my brief experience experimenting with the feature, the region servers
> needed to be killed to recover so they may be hitting that IdLock issue
> even though we're on 1.2 -- that IdLock code looks roughly the same in this
> older version.
>
> Given your statement about loading the temperature up even more, if our
> goal is to spread out the performance impact of the meta region, will
> tweaking cache size and i/o performance be time well spent?
>
> Thank you,
> James
>
> Thanks James for your response too, I hope you're doing well :) We have a
> bunch of cold-meta-cache hadoop jobs that are indeed likely contributing to
> the load on meta.
>
> On Fri, Jun 28, 2019 at 12:14 AM Stack <st...@duboce.net> wrote:
>
> > On Wed, Jun 26, 2019 at 6:47 AM James Kebinger <jkebin...@hubspot.com>
> > wrote:
> >
> > > Hello,
> > > We're experimenting with meta table replication
> >
> >
> > Replicating the meta table itself? What you thinking?
> >
> >
> >
> > > but have found that the
> > > region servers hosting the replicas will get extremely high priority
> RPC
> > > handler usage, sometimes all the way to 100% at which point clients
> start
> > > to experience errors - the priority RPC handler usage is much higher
> than
> > > the region servers hosting non-replicated meta ever experienced.
> > >
> >
> > Meta Table will be hot on a busy cluster. Adding replication load will up
> > the temperature even more.
> >
> > Check out the James Moore suggestions. HDFS issue could occupy handler
> > while it holds a the id lock on a popular block.
> >
> > Can you figure who is trying to read the removed file?
> >
> > Can't tell for sure from the snippet below but are handlers frequently
> > going out to HDFS to pull in blocks rather than fetch from block cache?
> How
> > is server-side cache on the server hosting meta? Can you get more meta up
> > into block cache?
> >
> > If you disable replication, all settles down?
> >
> > S
> >
> >
> > Has anyone seen this behavior? Are we doing something wrong?
> > > Our settings changes are
> > >
> > > hbase.regionserver.meta.storefile.refresh.period: 30000
> > > hbase.meta.replica.count: 3
> > > hbase.master.hfilecleaner.ttl: 600000
> > >
> > > We're on cdh 5.9.0 (hbase 1.2)
> > >
> > > We see some occasional errors like this one below (at the time I'm
> > writing
> > > this 30 minutes after the log entry, that file doesn't exist)
> > > java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR,
> > > self=/REDACTED:11702, remote=REDACTED:50010, for file
> > >
> /hbase/data/hbase/meta/1588230740/info/85c3e4b648eb437a8ef62ff9b39f525d,
> > > for pool BP-812715994-REDACTED-1441218207856 block 1083723183_9985162
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
> > >
> > > Doing a stack dump of a region server process to see what the priority
> > > handlers are up to reveals the following:
> > >
> > > 1 times (0% total cpu)
> > > "RpcServer.Priority.general.handler=12,queue=0,port=60020" #138 daemon
> > > prio=5 os_prio=0 tid=0x00007fc5471a3000 nid=0x3f33 waiting on condition
> > > [0x00007fc4f5870000] cpu=0.0 start=2019-06-20 19:31:47
> > > java.lang.Thread.State: WAITING (parking)
> > > at sun.misc.Unsafe.park(Native Method)
> > > - parking to wait for <0x00000003d94b6f18> (a
> > > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> > > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> > > at
> > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> > > at
> > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> > > at
> > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> > > at
> > >
> > >
> >
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HStore.replaceStoreFiles(HStore.java:1348)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HStore.refreshStoreFilesInternal(HStore.java:652)
> > > --
> > >
> > > 3 times (0% total cpu)
> > > "RpcServer.Priority.general.handler=3,queue=1,port=60020" #129 daemon
> > > prio=5 os_prio=0 tid=0x00007fc547190800 nid=0x3f2a waiting on condition
> > > [0x00007fc4f6178000] cpu=0.0 start=2019-06-20 19:31:47
> > > java.lang.Thread.State: WAITING (parking)
> > > at sun.misc.Unsafe.park(Native Method)
> > > - parking to wait for <0x0000000723bbee00> (a
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> > > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> > > at
> > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> > > at
> > >
> >
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> > > at
> > >
> > >
> >
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1359)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1327)
> > > at
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1468)
> > > --
> > >
> > > 5 times (0% total cpu)
> > > "RpcServer.Priority.general.handler=2,queue=0,port=60020" #128 daemon
> > > prio=5 os_prio=0 tid=0x00007fc54718f000 nid=0x3f29 in Object.wait()
> > > [0x00007fc4f6279000] cpu=0.0 start=2019-06-20 19:31:47
> > > java.lang.Thread.State: WAITING (on object monitor)
> > > at java.lang.Object.wait(Native Method)
> > > at java.lang.Object.wait(Object.java:502)
> > > at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
> > > - locked <0x00000004c0878978> (a
> > org.apache.hadoop.hbase.util.IdLock$Entry)
> > > at
> > > org.apache.hadoop.hbase.io
> > <http://org.apache.hadoop.hbase.io>
> > > .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
> > > at
> > > org.apache.hadoop.hbase.io
> > <http://org.apache.hadoop.hbase.io>
> > >
> >
> .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
> > > at
> > > org.apache.hadoop.hbase.io
> > <http://org.apache.hadoop.hbase.io>
> > > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
> > > at
> > > org.apache.hadoop.hbase.io
> > <http://org.apache.hadoop.hbase.io>
> > > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
> > > --
> > >
> > > 7 times (0% total cpu)
> > > "RpcServer.Priority.general.handler=1,queue=1,port=60020" #127 daemon
> > > prio=5 os_prio=0 tid=0x00007fc54718c800 nid=0x3f28 in Object.wait()
> > > [0x00007fc4f637a000] cpu=0.0 start=2019-06-20 19:31:47
> > > java.lang.Thread.State: WAITING (on object monitor)
> > > at java.lang.Object.wait(Native Method)
> > > at java.lang.Object.wait(Object.java:502)
> > > at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
> > > - locked <0x0000000723bbef88> (a
> > org.apache.hadoop.hbase.util.IdLock$Entry)
> > > at
> > > org.apache.hadoop.hbase.io
> > <http://org.apache.hadoop.hbase.io>
> > > .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
> > > at
> > > org.apache.hadoop.hbase.io
> > <http://org.apache.hadoop.hbase.io>
> > >
> >
> .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
> > > at
> > > org.apache.hadoop.hbase.io
> > <http://org.apache.hadoop.hbase.io>
> > > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
> > > at
> > > org.apache.hadoop.hbase.io
> > <http://org.apache.hadoop.hbase.io>
> > > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
> > >
> >
>

Re: High Priority RPC handler usage with meta replication - block read errors?

Reply via email to