Re: Region server deadlocks in master master replication

Jean-Daniel Cryans Fri, 30 Nov 2012 10:12:30 -0800

Those:

HBASE-6165      Replication can overrun .META. scans on cluster re-start
HBASE-6550      Refactoring ReplicationSink to make it more responsive of
cluster health
HBASE-6860      [replication] HBASE-6550 is too aggressive, DDOSes .META.


Jean-Daniel

On Fri, Nov 30, 2012 at 10:08 AM, Varun Sharma <[email protected]> wrote:
>
> Hi Jean,
>
> I looked at the release notes for 0.94.1 and 0.94.2 and it looks like all
> the fixes there have to do with splitting of regions (I maybe wrong). For
> my cluster(s), splits are off.
>
> Varun
>
> On Fri, Nov 30, 2012 at 10:03 AM, Varun Sharma <[email protected]>
> wrote:
>
> > Hi Jean,
> >
> > Thanks ! Could you point me to some of the fixes ? We currently use
> > hbase-0.94.0 with some other patches.
> >
> > On Fri, Nov 30, 2012 at 8:53 AM, Jean-Daniel Cryans
> > <[email protected]>wrote:
> >
> >> Use 0.94.2, it has all the fixes you need.
> >>
> >> J-D
> >>
> >> On Fri, Nov 30, 2012 at 4:56 AM, Varun Sharma <[email protected]>
> >> wrote:
> >>
> >> > After clearing out some files in /.logs which had size 0 and
> >> > restarting
> >> the
> >> > cluster - all regions came online and starting serving. But now I am
> >> again
> >> > stuck. The master moved some regions to rebalance after the restart
> >> > and
> >> > some of them are PENDING_CLOSE while 2 regions are offline. Again all
> >> PRI
> >> > handlers are stuck in replicateLogEntries() - looking at the region
> >> server
> >> > status page. Moreover jstack shows that these are stuck on
> >> > locateRegionInMeta. The other handlers are waiting as normal. Also
> >> > there
> >> > are 0 byte files now under ./logs  -not sure if these are causing the
> >> > issues...
> >> >
> >> > Thanks !
> >> >
> >> > On Fri, Nov 30, 2012 at 3:46 AM, Varun Sharma <[email protected]>
> >> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I have a master master replication setup with hbase 0.94.0 - if
> >> > > only
> >> > write
> >> > > to cluster A and replication carries over the data to cluster B. I
> >> > > am
> >> > > having some really weird issues with cluster B. Basically, all the
> >> > Priority
> >> > > RPC handlers are stuck in calls in replicateLogEntries while all
> >> > > the
> >> > normal
> >> > > RPC handlers are just waiting on each region server.
> >> > >
> >> > > From the logs I could see the following:
> >> > >
> >> > > 1) Region server shutdown
> >> > > Stopping the region server showed some issues. There were
> >> > > exceptions
> >> > > thrown while closing down regions - the exceptions were in the
> >> > > localRegionInMeta calls and also while trying to get the value of
> >> > > /hbase/root-region-server (I have checked via a manual client,
> >> zookeeper
> >> > is
> >> > > working fine).
> >> > >
> >> > > 2) jstack traces show that there are issues with locating the META
> >> > > and
> >> > the
> >> > > ROOT tables
> >> > >
> >> > > "PRI IPC Server handler 2 on 60020" daemon prio=10
> >> tid=0x00007f4ddcd39000
> >> > > nid=0x2dbf waiting on condition [0x00007f4dd9edc000]
> >> > >    java.lang.Thread.State: TIMED_WAITING (sleeping)
> >> > > at java.lang.Thread.sleep(Native Method)
> >> > > at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1046)
> >> > >  at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
> >> > > at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
> >> > >  at
> >> > > org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
> >> > > at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
> >> > >  at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:36)
> >> > > at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:268)
> >> > >  at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(HTablePool.java:198)
> >> > > at
> >> >
> >> > org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:173)
> >> > >  at
> >> > >
> >> org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:216)
> >> > > at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:171)
> >> > >
> >> > > "IPC Server handler 3 on 60020" daemon prio=10
> >> > > tid=0x00007f4ddcb1d800
> >> > > nid=0x2db6 waiting on condition [0x00007f4dda7e6000]
> >> > >    java.lang.Thread.State: WAITING (parking)
> >> > >  at sun.misc.Unsafe.park(Native Method)
> >> > > - parking to wait for  <0x000000056aa146e8> (a
> >> > >
> >> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> > >  at
> >> > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> > > at
> >> > >
> >> >
> >>
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> > >  at
> >> > >
> >> >
> >>
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
> >> > > at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1348)
> >> > >
> >> > > 3) The region server containing the 'ROOT' also shows the following
> >> trace
> >> > > with jstack
> >> > >
> >> > > "RS_OPEN_REGION-ip-10-60-53-226.ec2.internal,60020,1354263663659-2"
> >> > > prio=10 tid=0x0000000001f07800 nid=0x575c waiting on condition
> >> > > [0x00007fc3333f2000]
> >> > >    java.lang.Thread.State: WAITING (parking)
> >> > >         at sun.misc.Unsafe.park(Native Method)
> >> > >         - parking to wait for  <0x000000056c2ceb70> (a
> >> > >
> >> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> > >         at
> >> > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >> > >         at java.lang.Thread.run(Thread.java:679)
> >> > >
> >> > > "RS_OPEN_REGION-ip-10-60-53-226.ec2.internal,60020,1354263663659-1"
> >> > > prio=10 tid=0x0000000002e9f000 nid=0x572d waiting on condition
> >> > > [0x00007fc3337f6000]
> >> > >    java.lang.Thread.State: WAITING (parking)
> >> > >         at sun.misc.Unsafe.park(Native Method)
> >> > >         - parking to wait for  <0x000000056c2ceb70> (a
> >> > >
> >> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> > >         at
> >> > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
> >> > >         at
> >> > >
> >> >
> >>
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >> > >         at java.lang.Thread.run(Thread.java:679)
> >> > >
> >> > > 4) There are some replication related exceptions but not sure if
> >> > > those
> >> > are
> >> > > critical.
> >> > >
> >> > > 2012-11-30 00:18:04,575 WARN
> >> > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> >> > > 1
> >> > Got:
> >> > > java.io.EOFException
> >> > >
> >> > > Also,
> >> > > 012-11-30 00:06:33,830 ERROR
> >> > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink:
> >> Unable
> >> > to
> >> > > accept edit because:
> >> > > java.net.SocketTimeoutException: Call to
> >> > > ip-10-10-54-176.ec2.internal/
> >> > > 10.10.54.176:60020 failed on socket timeout exception:
> >> > > java.net.SocketTimeoutException: 1500 millis timeout while waiting
> >> > > for
> >> > > channel to be ready for read. ch :
> >> > > java.nio.channels.SocketChannel[connected local=/10.60.53.226:34164
> >> > remote=ip-10-10-54-176.ec2.internal/
> >> > > 10.10.54.176:60020]
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:949)
> >> > >         at
> >> > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:922)
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
> >> > >         at $Proxy12.getClosestRowBefore(Unknown Source)
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:965)
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1042)
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1482)
> >> > >         at
> >> > >
> >> >
> >>
> >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1367)
> >> > >
> >> > > At this point, when I restart region servers, they end up with 0
> >> regions
> >> > > and I am not able to bring back the regions they were serving. Any
> >> help
> >> > > would be deeply appreciated.
> >> > >
> >> > > Thanks
> >> > > Varun
> >> > >
> >> >
> >>
> >
> >

Re: Region server deadlocks in master master replication

Reply via email to