Are you replicating? St.Ack
On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < [email protected]> wrote: > Dear HBase-Pros, > > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to > freeze within a minute which brings down the entire cluster. Stopping the > affected RegionServer unblocks the cluster and everything comes back to > normal. > > We run 27 RegionServers, each having 31 GB JVM memory. The HBase Version is > 0.98.5 on Hadoop 2.4.1. We basically have two tables, the first having > about 4,500 Regions and holding 8 TB with 1000 requests per second, the > second table is around 200 Regions with about 50,000 to 120,000 requests > per sec over all Regions, 800 GB worth of data and with IN_MEMORY enabled. > > While investigating the problem, I found out, that every healthy > RegionServer has the following thread: > > Thread 12 (RpcServer.listener,port=60020): > State: RUNNABLE > Blocked count: 35 > Waited count: 0 > Stack: > sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) > sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) > sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) > sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) > org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:684) > > > When suddenly becoming a blocked RegionServer, this particular thread then > looks like > > Thread 12 (RpcServer.listener,port=60020): > State: BLOCKED > Blocked count: 2889 > Waited count: 0 > Blocked on org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@38cba1a1 > Blocked by 14 (RpcServer.reader=1,port=60020) > Stack: > > > org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:619) > > org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:774) > org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:692) > > > Also, JMX shows for an unhealthy RegionServer that > > - "queueSize" grows quickly and constantly to values greater than 60k, > and > - "numCallsInGeneralQueue" quickly reaches 300 > > Both values are usually very small or 0 under normal circumstances, but in > case of a RS "getting stuck" they explode, which leads me to believe that > the IPC-queue does not get processed properly causing the RegionServer to > become "deaf". > > These two symptoms appear to bring down the entire cluster. When killign > that RS, everyhing goes back to normal. > > I could not find any correlation between this phenomenon and compactions, > load or other factors. hbck says it is all fine as well. > > The servers are all 3.2.0-4-amd64 Debian, 12 cores, 96 GB RAM. Besides the > RS and a DataNode, there isn't too much running on the boxes so the load > (top) is usually around 5 to 10 and bandwidth does not exceed 10 MB on > average. > > We currently survive by polling /jmx of all RegionServers constantly and > restarting those off that show the symptioms :( > > Do you have any idea what could be causing this? > > Thank you very much in advance! > > Johannes >
