Hmmm..yes....If you see the interactions over the JIRA Stack felt there was something else there. Check your Hadoop side logs and thread dumps.. That may tel you if your datanodes are bit lazy. :)
Regards Ram On Thu, Dec 6, 2012 at 4:59 PM, Varun Sharma <[email protected]> wrote: > I see - I am going to try the patch then - looks like all the threads have > deadlocked and are holding the lock to the same block integer. The cache > hit ratio is pretty high. Also the server is in this state for the past 1 > hour - I dont think it should take an hour to load one HDFS block - I am > seeing the issue repeatedly - it looks like something is probably wrong > with the locking mechanism when you have have higher number of IPC handlers > like 200. > > On Thu, Dec 6, 2012 at 2:59 AM, ramkrishna vasudevan < > [email protected]> wrote: > > > Actually when we observed that our block cache was OFF... If possible try > > applying your patch and see what is happening? > > If you have more memory just trying increasing the ratio allocated to > block > > cache? > > > > Regards > > Ralm > > > > On Thu, Dec 6, 2012 at 4:02 PM, Varun Sharma <[email protected]> > wrote: > > > > > Hi Ram, > > > > > > Yes BlockCache is on but there is another in memory column which might > be > > > preempting the stuff from block cache. So, we might be hitting more > disk > > > seeks - I see that you have seen this trace before on HBASE 5898 - did > > that > > > issue resolve things for you ? > > > > > > Thanks > > > Varun > > > > > > On Wed, Dec 5, 2012 at 10:04 PM, ramkrishna vasudevan < > > > [email protected]> wrote: > > > > > > > Is block cache ON? Check out HBASe-5898? > > > > > > > > Regards > > > > Ram > > > > > > > > On Thu, Dec 6, 2012 at 9:55 AM, Anoop Sam John <[email protected]> > > > wrote: > > > > > > > > > > > > > > >is the META table cached just like other tables > > > > > Yes Varun I think so. > > > > > > > > > > -Anoop- > > > > > ________________________________________ > > > > > From: Varun Sharma [[email protected]] > > > > > Sent: Thursday, December 06, 2012 6:10 AM > > > > > To: [email protected]; lars hofhansl > > > > > Subject: Re: .META. region server DDOSed by too many clients > > > > > > > > > > We only see this on the .META. region not otherwise... > > > > > > > > > > On Wed, Dec 5, 2012 at 4:37 PM, Varun Sharma <[email protected]> > > > > wrote: > > > > > > > > > > > I see but is this pointing to the fact that we are heading to > disk > > > for > > > > > > scanning META - if yes, that would be pretty bad, no ? Currently > I > > am > > > > > > trying to see if the freeze coincides with Block Cache being full > > (we > > > > > have > > > > > > an inmemory column) - is the META table cached just like other > > > tables ? > > > > > > > > > > > > Varun > > > > > > > > > > > > > > > > > > On Wed, Dec 5, 2012 at 4:20 PM, lars hofhansl < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > >> Looks like you're running into HBASE-5898. > > > > > >> > > > > > >> > > > > > >> > > > > > >> ----- Original Message ----- > > > > > >> From: Varun Sharma <[email protected]> > > > > > >> To: [email protected] > > > > > >> Cc: > > > > > >> Sent: Wednesday, December 5, 2012 3:51 PM > > > > > >> Subject: .META. region server DDOSed by too many clients > > > > > >> > > > > > >> Hi, > > > > > >> > > > > > >> I am running hbase 0.94.0 and I have a significant write load > > being > > > > put > > > > > on > > > > > >> a table with 98 regions on a 15 node cluster - also this write > > load > > > > > comes > > > > > >> from a very large number of clients (~ 1000). I am running with > 10 > > > > > >> priority > > > > > >> IPC handlers and 200 IPC handlers. It seems the region server > > > holding > > > > > >> .META > > > > > >> is DDOSed. All the 200 handlers are busy serving the .META. > region > > > and > > > > > >> they > > > > > >> are all locked onto on object. The Jstack is here for the regoin > > > > server > > > > > >> > > > > > >> "IPC Server handler 182 on 60020" daemon prio=10 > > > > tid=0x00007f329872c800 > > > > > >> nid=0x4401 waiting on condition [0x00007f328807f000] > > > > > >> java.lang.Thread.State: WAITING (parking) > > > > > >> at sun.misc.Unsafe.park(Native Method) > > > > > >> - parking to wait for <0x0000000542d72e30> (a > > > > > >> java.util.concurrent.locks.ReentrantLock$NonfairSync) > > > > > >> at > > > > > >> > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:871) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1201) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) > > > > > >> at > > > > > >> > > > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > java.util.concurrent.ConcurrentHashMap$Segment.put(ConcurrentHashMap.java:445) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:925) > > > > > >> at > > > > > >> org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:71) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:290) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:213) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:455) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:493) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:242) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:167) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:299) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:244) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:521) > > > > > >> - locked <0x000000063b4965d0> (a > > > > > >> org.apache.hadoop.hbase.regionserver.StoreScanner) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:402) > > > > > >> - locked <0x000000063b4965d0> (a > > > > > >> org.apache.hadoop.hbase.regionserver.StoreScanner) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3354) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3310) > > > > > >> - locked <0x0000000523c211e0> (a > > > > > >> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3327) > > > > > >> - locked <0x0000000523c211e0> (a > > > > > >> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > > > > > >> at > > > > > >> > > org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4066) > > > > > >> at > > > > > >> > > org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4039) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1941) > > > > > >> > > > > > >> The client side trace shows that we are looking for META region. > > > > > >> > > > > > >> thrift-worker-3499" daemon prio=10 tid=0x00007f789dd98800 > > nid=0xb52 > > > > > >> waiting > > > > > >> for monitor entry [0x00007f778672d000] > > > > > >> java.lang.Thread.State: BLOCKED (on object monitor) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943) > > > > > >> - waiting to lock <0x0000000707978298> (a > > java.lang.Object) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1482) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1367) > > > > > >> at > > > > org.apache.hadoop.hbase.client.HTable.batch(HTable.java:729) > > > > > >> - locked <0x000000070821d5a0> (a > > > > > >> org.apache.hadoop.hbase.client.HTable) > > > > > >> at > > > org.apache.hadoop.hbase.client.HTable.get(HTable.java:698) > > > > > >> at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:371) > > > > > >> > > > > > >> On the RS page, I see 68 million read requests for the META > region > > > > while > > > > > >> for the other 98 regions - we have done like 20 million write > > > requests > > > > > in > > > > > >> total - regions have not moved around at all and no crashes have > > > > > happened. > > > > > >> Why do we have such an incredible number of scans over META and > is > > > > there > > > > > >> something I can do about this issue ? > > > > > >> > > > > > >> Varun > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
