Re: single RegionServer stuck, causing cluster to hang

Stack Fri, 22 Aug 2014 13:42:46 -0700

Are we losing handler threads, the workers that take from the pool we are
blocked on?


The attached thread dump has ten with non-sequential numbers:

Thread 97 (defaultRpcServer.handler=27,queue=0,port=60020):
Thread 94 (defaultRpcServer.handler=24,queue=0,port=60020):
Thread 91 (defaultRpcServer.handler=21,queue=0,port=60020):
Thread 90 (defaultRpcServer.handler=20,queue=2,port=60020):
Thread 88 (defaultRpcServer.handler=18,queue=0,port=60020):
Thread 82 (defaultRpcServer.handler=12,queue=0,port=60020):
Thread 81 (defaultRpcServer.handler=11,queue=2,port=60020):
Thread 76 (defaultRpcServer.handler=6,queue=0,port=60020):

Perhaps this is an artifact of how the thread dump is being taken via the
UI servlet.

If you jstack, do you see hbase.regionserver.handler.count instances of
defaultRpcServer going from 0 up to hbase.regionserver.handler.count

If handlers are not taking from the call queue, yeah, it will fill.

St.Ack




On Fri, Aug 22, 2014 at 12:54 PM, Stack <[email protected]> wrote:

> nvm. misread.  Trying to figure why the scheduling queue is filled to the
> brim such that no more calls can be added/dispatched...
> St.Ack
>
>
> On Fri, Aug 22, 2014 at 12:45 PM, Stack <[email protected]> wrote:
>
>> Are you replicating?
>> St.Ack
>>
>>
>> On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback <
>> [email protected]> wrote:
>>
>>> Dear HBase-Pros,
>>>
>>> we face a serious issue with our HBase production cluster for two days
>>> now.
>>> Every couple minutes, a random RegionServer gets stuck and does not
>>> process
>>> any requests. In addition this causes the other RegionServers to
>>> freeze within a minute which brings down the entire cluster. Stopping the
>>> affected RegionServer unblocks the cluster and everything comes back to
>>> normal.
>>>
>>> We run 27 RegionServers, each having 31 GB JVM memory. The HBase Version
>>> is
>>> 0.98.5 on Hadoop 2.4.1. We basically have two tables, the first having
>>> about 4,500 Regions and holding 8 TB with 1000 requests per second, the
>>> second table is around 200 Regions with about 50,000 to 120,000 requests
>>> per sec over all Regions, 800 GB worth of data and with IN_MEMORY
>>> enabled.
>>>
>>> While investigating the problem, I found out, that every healthy
>>> RegionServer has the following thread:
>>>
>>> Thread 12 (RpcServer.listener,port=60020):
>>>   State: RUNNABLE
>>>   Blocked count: 35
>>>   Waited count: 0
>>>   Stack:
>>>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>>>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>>>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>>>
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:684)
>>>
>>>
>>> When suddenly becoming a blocked RegionServer, this particular thread
>>> then
>>> looks like
>>>
>>> Thread 12 (RpcServer.listener,port=60020):
>>>   State: BLOCKED
>>>   Blocked count: 2889
>>>   Waited count: 0
>>>   Blocked on
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@38cba1a1
>>>   Blocked by 14 (RpcServer.reader=1,port=60020)
>>>   Stack:
>>>
>>>
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:619)
>>>
>>>
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:774)
>>>
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:692)
>>>
>>>
>>> Also, JMX shows for an unhealthy RegionServer that
>>>
>>>    - "queueSize" grows quickly and constantly to values greater than 60k,
>>>    and
>>>    - "numCallsInGeneralQueue" quickly reaches 300
>>>
>>> Both values are usually very small or 0 under normal circumstances, but
>>> in
>>> case of a RS "getting stuck" they explode, which leads me to believe that
>>> the IPC-queue does not get processed properly causing the RegionServer to
>>> become "deaf".
>>>
>>> These two symptoms appear to bring down the entire cluster. When killign
>>> that RS, everyhing goes back to normal.
>>>
>>> I could not find any correlation between this phenomenon and compactions,
>>> load or other factors. hbck says it is all fine as well.
>>>
>>> The servers are all 3.2.0-4-amd64 Debian, 12 cores, 96 GB RAM. Besides
>>> the
>>> RS and a DataNode, there isn't too much running on the boxes so the load
>>> (top) is usually around 5 to 10 and bandwidth does not exceed 10 MB on
>>> average.
>>>
>>> We currently survive by polling /jmx of all RegionServers constantly and
>>> restarting those off that show the symptioms :(
>>>
>>> Do you have any idea what could be causing this?
>>>
>>> Thank you very much in advance!
>>>
>>> Johannes
>>>
>>
>>
>

Re: single RegionServer stuck, causing cluster to hang

Reply via email to