Those: HBASE-6165 Replication can overrun .META. scans on cluster re-start HBASE-6550 Refactoring ReplicationSink to make it more responsive of cluster health HBASE-6860 [replication] HBASE-6550 is too aggressive, DDOSes .META.
Jean-Daniel On Fri, Nov 30, 2012 at 10:08 AM, Varun Sharma <[email protected]> wrote: > > Hi Jean, > > I looked at the release notes for 0.94.1 and 0.94.2 and it looks like all > the fixes there have to do with splitting of regions (I maybe wrong). For > my cluster(s), splits are off. > > Varun > > On Fri, Nov 30, 2012 at 10:03 AM, Varun Sharma <[email protected]> > wrote: > > > Hi Jean, > > > > Thanks ! Could you point me to some of the fixes ? We currently use > > hbase-0.94.0 with some other patches. > > > > On Fri, Nov 30, 2012 at 8:53 AM, Jean-Daniel Cryans > > <[email protected]>wrote: > > > >> Use 0.94.2, it has all the fixes you need. > >> > >> J-D > >> > >> On Fri, Nov 30, 2012 at 4:56 AM, Varun Sharma <[email protected]> > >> wrote: > >> > >> > After clearing out some files in /.logs which had size 0 and > >> > restarting > >> the > >> > cluster - all regions came online and starting serving. But now I am > >> again > >> > stuck. The master moved some regions to rebalance after the restart > >> > and > >> > some of them are PENDING_CLOSE while 2 regions are offline. Again all > >> PRI > >> > handlers are stuck in replicateLogEntries() - looking at the region > >> server > >> > status page. Moreover jstack shows that these are stuck on > >> > locateRegionInMeta. The other handlers are waiting as normal. Also > >> > there > >> > are 0 byte files now under ./logs -not sure if these are causing the > >> > issues... > >> > > >> > Thanks ! > >> > > >> > On Fri, Nov 30, 2012 at 3:46 AM, Varun Sharma <[email protected]> > >> wrote: > >> > > >> > > Hi, > >> > > > >> > > I have a master master replication setup with hbase 0.94.0 - if > >> > > only > >> > write > >> > > to cluster A and replication carries over the data to cluster B. I > >> > > am > >> > > having some really weird issues with cluster B. Basically, all the > >> > Priority > >> > > RPC handlers are stuck in calls in replicateLogEntries while all > >> > > the > >> > normal > >> > > RPC handlers are just waiting on each region server. > >> > > > >> > > From the logs I could see the following: > >> > > > >> > > 1) Region server shutdown > >> > > Stopping the region server showed some issues. There were > >> > > exceptions > >> > > thrown while closing down regions - the exceptions were in the > >> > > localRegionInMeta calls and also while trying to get the value of > >> > > /hbase/root-region-server (I have checked via a manual client, > >> zookeeper > >> > is > >> > > working fine). > >> > > > >> > > 2) jstack traces show that there are issues with locating the META > >> > > and > >> > the > >> > > ROOT tables > >> > > > >> > > "PRI IPC Server handler 2 on 60020" daemon prio=10 > >> tid=0x00007f4ddcd39000 > >> > > nid=0x2dbf waiting on condition [0x00007f4dd9edc000] > >> > > java.lang.Thread.State: TIMED_WAITING (sleeping) > >> > > at java.lang.Thread.sleep(Native Method) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1046) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801) > >> > > at > >> > > org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234) > >> > > at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:36) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:268) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(HTablePool.java:198) > >> > > at > >> > > >> > org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:173) > >> > > at > >> > > > >> org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:216) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:171) > >> > > > >> > > "IPC Server handler 3 on 60020" daemon prio=10 > >> > > tid=0x00007f4ddcb1d800 > >> > > nid=0x2db6 waiting on condition [0x00007f4dda7e6000] > >> > > java.lang.Thread.State: WAITING (parking) > >> > > at sun.misc.Unsafe.park(Native Method) > >> > > - parking to wait for <0x000000056aa146e8> (a > >> > > > >> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > >> > > at > >> > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1348) > >> > > > >> > > 3) The region server containing the 'ROOT' also shows the following > >> trace > >> > > with jstack > >> > > > >> > > "RS_OPEN_REGION-ip-10-60-53-226.ec2.internal,60020,1354263663659-2" > >> > > prio=10 tid=0x0000000001f07800 nid=0x575c waiting on condition > >> > > [0x00007fc3333f2000] > >> > > java.lang.Thread.State: WAITING (parking) > >> > > at sun.misc.Unsafe.park(Native Method) > >> > > - parking to wait for <0x000000056c2ceb70> (a > >> > > > >> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > >> > > at > >> > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > >> > > at java.lang.Thread.run(Thread.java:679) > >> > > > >> > > "RS_OPEN_REGION-ip-10-60-53-226.ec2.internal,60020,1354263663659-1" > >> > > prio=10 tid=0x0000000002e9f000 nid=0x572d waiting on condition > >> > > [0x00007fc3337f6000] > >> > > java.lang.Thread.State: WAITING (parking) > >> > > at sun.misc.Unsafe.park(Native Method) > >> > > - parking to wait for <0x000000056c2ceb70> (a > >> > > > >> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > >> > > at > >> > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103) > >> > > at > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > >> > > at java.lang.Thread.run(Thread.java:679) > >> > > > >> > > 4) There are some replication related exceptions but not sure if > >> > > those > >> > are > >> > > critical. > >> > > > >> > > 2012-11-30 00:18:04,575 WARN > >> > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > >> > > 1 > >> > Got: > >> > > java.io.EOFException > >> > > > >> > > Also, > >> > > 012-11-30 00:06:33,830 ERROR > >> > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: > >> Unable > >> > to > >> > > accept edit because: > >> > > java.net.SocketTimeoutException: Call to > >> > > ip-10-10-54-176.ec2.internal/ > >> > > 10.10.54.176:60020 failed on socket timeout exception: > >> > > java.net.SocketTimeoutException: 1500 millis timeout while waiting > >> > > for > >> > > channel to be ready for read. ch : > >> > > java.nio.channels.SocketChannel[connected local=/10.60.53.226:34164 > >> > remote=ip-10-10-54-176.ec2.internal/ > >> > > 10.10.54.176:60020] > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:949) > >> > > at > >> > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:922) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) > >> > > at $Proxy12.getClosestRowBefore(Unknown Source) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:965) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1042) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1482) > >> > > at > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1367) > >> > > > >> > > At this point, when I restart region servers, they end up with 0 > >> regions > >> > > and I am not able to bring back the regions they were serving. Any > >> help > >> > > would be deeply appreciated. > >> > > > >> > > Thanks > >> > > Varun > >> > > > >> > > >> > > > >
