Hi, We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase is heavy and a mix of reads and writes. For a few months we have had a problem where occasionally (once a day or more) one of the region servers starts consuming close to 100% CPU. This causes all the client thread pool to get filled up serving the slow region server, causing overall response times to slow to a crawl and many calls either start timing out right in the client, or at a higher level.
We have done lots of analysis and looked at various metrics but could never pin it down to any particular kind of traffic or specific "hot keys". Looking at region server logs has not resulted in any findings. The only sort of vague evidence we have is that from the reported metrics, reads per second on the hot server looks more than the other but not in a steady state but in a spiky but steady fashion, but gets per second looks no different than any other server. Until now our hacky way that we discovered to get around this was to just restart the region server. This works because while some calls error out while the regions are in transition, this is a batch oriented system with a retry strategy built in. But just yesterday we discovered something interesting, if we connect to the region server in VisualVM and press the "Perform GC" button, there seems to be a brief pause and then CPU settles down back to normal. This is despite the fact that memory appears to be under no pressure and before we do this, VisualVM indicates very low percentage of CPU time spent in GC, so we're baffled, and hoping someone with deeper insight into the HBase code could explain this behavior. Our region server processes are configured with 32GB of RAM and the following GC related JVM settings : HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 -XX:InitiatingHeapOccupancyPercent=70 Any insight anyone can provide would be most appreciated. ---- Saad
