Re: single RegionServer stuck, causing cluster to hang

Stack Fri, 22 Aug 2014 12:55:21 -0700

nvm. misread.  Trying to figure why the scheduling queue is filled to the
brim such that no more calls can be added/dispatched...
St.Ack



On Fri, Aug 22, 2014 at 12:45 PM, Stack <[email protected]> wrote:

> Are you replicating?
> St.Ack
>
>
> On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback <
> [email protected]> wrote:
>
>> Dear HBase-Pros,
>>
>> we face a serious issue with our HBase production cluster for two days
>> now.
>> Every couple minutes, a random RegionServer gets stuck and does not
>> process
>> any requests. In addition this causes the other RegionServers to
>> freeze within a minute which brings down the entire cluster. Stopping the
>> affected RegionServer unblocks the cluster and everything comes back to
>> normal.
>>
>> We run 27 RegionServers, each having 31 GB JVM memory. The HBase Version
>> is
>> 0.98.5 on Hadoop 2.4.1. We basically have two tables, the first having
>> about 4,500 Regions and holding 8 TB with 1000 requests per second, the
>> second table is around 200 Regions with about 50,000 to 120,000 requests
>> per sec over all Regions, 800 GB worth of data and with IN_MEMORY enabled.
>>
>> While investigating the problem, I found out, that every healthy
>> RegionServer has the following thread:
>>
>> Thread 12 (RpcServer.listener,port=60020):
>>   State: RUNNABLE
>>   Blocked count: 35
>>   Waited count: 0
>>   Stack:
>>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:684)
>>
>>
>> When suddenly becoming a blocked RegionServer, this particular thread then
>> looks like
>>
>> Thread 12 (RpcServer.listener,port=60020):
>>   State: BLOCKED
>>   Blocked count: 2889
>>   Waited count: 0
>>   Blocked on
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@38cba1a1
>>   Blocked by 14 (RpcServer.reader=1,port=60020)
>>   Stack:
>>
>>
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:619)
>>
>>
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:774)
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:692)
>>
>>
>> Also, JMX shows for an unhealthy RegionServer that
>>
>>    - "queueSize" grows quickly and constantly to values greater than 60k,
>>    and
>>    - "numCallsInGeneralQueue" quickly reaches 300
>>
>> Both values are usually very small or 0 under normal circumstances, but in
>> case of a RS "getting stuck" they explode, which leads me to believe that
>> the IPC-queue does not get processed properly causing the RegionServer to
>> become "deaf".
>>
>> These two symptoms appear to bring down the entire cluster. When killign
>> that RS, everyhing goes back to normal.
>>
>> I could not find any correlation between this phenomenon and compactions,
>> load or other factors. hbck says it is all fine as well.
>>
>> The servers are all 3.2.0-4-amd64 Debian, 12 cores, 96 GB RAM. Besides the
>> RS and a DataNode, there isn't too much running on the boxes so the load
>> (top) is usually around 5 to 10 and bandwidth does not exceed 10 MB on
>> average.
>>
>> We currently survive by polling /jmx of all RegionServers constantly and
>> restarting those off that show the symptioms :(
>>
>> Do you have any idea what could be causing this?
>>
>> Thank you very much in advance!
>>
>> Johannes
>>
>
>

Re: single RegionServer stuck, causing cluster to hang

Reply via email to