nvm. misread. Trying to figure why the scheduling queue is filled to the brim such that no more calls can be added/dispatched... St.Ack
On Fri, Aug 22, 2014 at 12:45 PM, Stack <[email protected]> wrote: > Are you replicating? > St.Ack > > > On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < > [email protected]> wrote: > >> Dear HBase-Pros, >> >> we face a serious issue with our HBase production cluster for two days >> now. >> Every couple minutes, a random RegionServer gets stuck and does not >> process >> any requests. In addition this causes the other RegionServers to >> freeze within a minute which brings down the entire cluster. Stopping the >> affected RegionServer unblocks the cluster and everything comes back to >> normal. >> >> We run 27 RegionServers, each having 31 GB JVM memory. The HBase Version >> is >> 0.98.5 on Hadoop 2.4.1. We basically have two tables, the first having >> about 4,500 Regions and holding 8 TB with 1000 requests per second, the >> second table is around 200 Regions with about 50,000 to 120,000 requests >> per sec over all Regions, 800 GB worth of data and with IN_MEMORY enabled. >> >> While investigating the problem, I found out, that every healthy >> RegionServer has the following thread: >> >> Thread 12 (RpcServer.listener,port=60020): >> State: RUNNABLE >> Blocked count: 35 >> Waited count: 0 >> Stack: >> sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) >> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) >> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) >> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) >> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) >> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) >> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:684) >> >> >> When suddenly becoming a blocked RegionServer, this particular thread then >> looks like >> >> Thread 12 (RpcServer.listener,port=60020): >> State: BLOCKED >> Blocked count: 2889 >> Waited count: 0 >> Blocked on >> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@38cba1a1 >> Blocked by 14 (RpcServer.reader=1,port=60020) >> Stack: >> >> >> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:619) >> >> >> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:774) >> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:692) >> >> >> Also, JMX shows for an unhealthy RegionServer that >> >> - "queueSize" grows quickly and constantly to values greater than 60k, >> and >> - "numCallsInGeneralQueue" quickly reaches 300 >> >> Both values are usually very small or 0 under normal circumstances, but in >> case of a RS "getting stuck" they explode, which leads me to believe that >> the IPC-queue does not get processed properly causing the RegionServer to >> become "deaf". >> >> These two symptoms appear to bring down the entire cluster. When killign >> that RS, everyhing goes back to normal. >> >> I could not find any correlation between this phenomenon and compactions, >> load or other factors. hbck says it is all fine as well. >> >> The servers are all 3.2.0-4-amd64 Debian, 12 cores, 96 GB RAM. Besides the >> RS and a DataNode, there isn't too much running on the boxes so the load >> (top) is usually around 5 to 10 and bandwidth does not exceed 10 MB on >> average. >> >> We currently survive by polling /jmx of all RegionServers constantly and >> restarting those off that show the symptioms :( >> >> Do you have any idea what could be causing this? >> >> Thank you very much in advance! >> >> Johannes >> > >
