Re: single RegionServer stuck, causing cluster to hang

Stack Fri, 22 Aug 2014 12:46:05 -0700

Are you replicating?
St.Ack


On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback <
[email protected]> wrote:

> Dear HBase-Pros,
>
> we face a serious issue with our HBase production cluster for two days now.
> Every couple minutes, a random RegionServer gets stuck and does not process
> any requests. In addition this causes the other RegionServers to
> freeze within a minute which brings down the entire cluster. Stopping the
> affected RegionServer unblocks the cluster and everything comes back to
> normal.
>
> We run 27 RegionServers, each having 31 GB JVM memory. The HBase Version is
> 0.98.5 on Hadoop 2.4.1. We basically have two tables, the first having
> about 4,500 Regions and holding 8 TB with 1000 requests per second, the
> second table is around 200 Regions with about 50,000 to 120,000 requests
> per sec over all Regions, 800 GB worth of data and with IN_MEMORY enabled.
>
> While investigating the problem, I found out, that every healthy
> RegionServer has the following thread:
>
> Thread 12 (RpcServer.listener,port=60020):
>   State: RUNNABLE
>   Blocked count: 35
>   Waited count: 0
>   Stack:
>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:684)
>
>
> When suddenly becoming a blocked RegionServer, this particular thread then
> looks like
>
> Thread 12 (RpcServer.listener,port=60020):
>   State: BLOCKED
>   Blocked count: 2889
>   Waited count: 0
>   Blocked on org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@38cba1a1
>   Blocked by 14 (RpcServer.reader=1,port=60020)
>   Stack:
>
>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:619)
>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:774)
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:692)
>
>
> Also, JMX shows for an unhealthy RegionServer that
>
>    - "queueSize" grows quickly and constantly to values greater than 60k,
>    and
>    - "numCallsInGeneralQueue" quickly reaches 300
>
> Both values are usually very small or 0 under normal circumstances, but in
> case of a RS "getting stuck" they explode, which leads me to believe that
> the IPC-queue does not get processed properly causing the RegionServer to
> become "deaf".
>
> These two symptoms appear to bring down the entire cluster. When killign
> that RS, everyhing goes back to normal.
>
> I could not find any correlation between this phenomenon and compactions,
> load or other factors. hbck says it is all fine as well.
>
> The servers are all 3.2.0-4-amd64 Debian, 12 cores, 96 GB RAM. Besides the
> RS and a DataNode, there isn't too much running on the boxes so the load
> (top) is usually around 5 to 10 and bandwidth does not exceed 10 MB on
> average.
>
> We currently survive by polling /jmx of all RegionServers constantly and
> restarting those off that show the symptioms :(
>
> Do you have any idea what could be causing this?
>
> Thank you very much in advance!
>
> Johannes
>

Re: single RegionServer stuck, causing cluster to hang

Reply via email to