Are we losing handler threads, the workers that take from the pool we are blocked on?
The attached thread dump has ten with non-sequential numbers: Thread 97 (defaultRpcServer.handler=27,queue=0,port=60020): Thread 94 (defaultRpcServer.handler=24,queue=0,port=60020): Thread 91 (defaultRpcServer.handler=21,queue=0,port=60020): Thread 90 (defaultRpcServer.handler=20,queue=2,port=60020): Thread 88 (defaultRpcServer.handler=18,queue=0,port=60020): Thread 82 (defaultRpcServer.handler=12,queue=0,port=60020): Thread 81 (defaultRpcServer.handler=11,queue=2,port=60020): Thread 76 (defaultRpcServer.handler=6,queue=0,port=60020): Perhaps this is an artifact of how the thread dump is being taken via the UI servlet. If you jstack, do you see hbase.regionserver.handler.count instances of defaultRpcServer going from 0 up to hbase.regionserver.handler.count If handlers are not taking from the call queue, yeah, it will fill. St.Ack On Fri, Aug 22, 2014 at 12:54 PM, Stack <[email protected]> wrote: > nvm. misread. Trying to figure why the scheduling queue is filled to the > brim such that no more calls can be added/dispatched... > St.Ack > > > On Fri, Aug 22, 2014 at 12:45 PM, Stack <[email protected]> wrote: > >> Are you replicating? >> St.Ack >> >> >> On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < >> [email protected]> wrote: >> >>> Dear HBase-Pros, >>> >>> we face a serious issue with our HBase production cluster for two days >>> now. >>> Every couple minutes, a random RegionServer gets stuck and does not >>> process >>> any requests. In addition this causes the other RegionServers to >>> freeze within a minute which brings down the entire cluster. Stopping the >>> affected RegionServer unblocks the cluster and everything comes back to >>> normal. >>> >>> We run 27 RegionServers, each having 31 GB JVM memory. The HBase Version >>> is >>> 0.98.5 on Hadoop 2.4.1. We basically have two tables, the first having >>> about 4,500 Regions and holding 8 TB with 1000 requests per second, the >>> second table is around 200 Regions with about 50,000 to 120,000 requests >>> per sec over all Regions, 800 GB worth of data and with IN_MEMORY >>> enabled. >>> >>> While investigating the problem, I found out, that every healthy >>> RegionServer has the following thread: >>> >>> Thread 12 (RpcServer.listener,port=60020): >>> State: RUNNABLE >>> Blocked count: 35 >>> Waited count: 0 >>> Stack: >>> sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) >>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) >>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) >>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) >>> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) >>> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) >>> >>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:684) >>> >>> >>> When suddenly becoming a blocked RegionServer, this particular thread >>> then >>> looks like >>> >>> Thread 12 (RpcServer.listener,port=60020): >>> State: BLOCKED >>> Blocked count: 2889 >>> Waited count: 0 >>> Blocked on >>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@38cba1a1 >>> Blocked by 14 (RpcServer.reader=1,port=60020) >>> Stack: >>> >>> >>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:619) >>> >>> >>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:774) >>> >>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:692) >>> >>> >>> Also, JMX shows for an unhealthy RegionServer that >>> >>> - "queueSize" grows quickly and constantly to values greater than 60k, >>> and >>> - "numCallsInGeneralQueue" quickly reaches 300 >>> >>> Both values are usually very small or 0 under normal circumstances, but >>> in >>> case of a RS "getting stuck" they explode, which leads me to believe that >>> the IPC-queue does not get processed properly causing the RegionServer to >>> become "deaf". >>> >>> These two symptoms appear to bring down the entire cluster. When killign >>> that RS, everyhing goes back to normal. >>> >>> I could not find any correlation between this phenomenon and compactions, >>> load or other factors. hbck says it is all fine as well. >>> >>> The servers are all 3.2.0-4-amd64 Debian, 12 cores, 96 GB RAM. Besides >>> the >>> RS and a DataNode, there isn't too much running on the boxes so the load >>> (top) is usually around 5 to 10 and bandwidth does not exceed 10 MB on >>> average. >>> >>> We currently survive by polling /jmx of all RegionServers constantly and >>> restarting those off that show the symptioms :( >>> >>> Do you have any idea what could be causing this? >>> >>> Thank you very much in advance! >>> >>> Johannes >>> >> >> >
