Region Server Hotspot/CPU Problem

Saad Mufti Wed, 01 Mar 2017 04:07:47 -0800

Hi,

We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase
is heavy and a mix of reads and writes. For a few months we have had a
problem where occasionally (once a day or more) one of the region servers
starts consuming close to 100% CPU. This causes all the client thread pool
to get filled up serving the slow region server, causing overall response
times to slow to a crawl and many calls either start timing out right in
the client, or at a higher level.


We have done lots of analysis and looked at various metrics but could never
pin it down to any particular kind of traffic or specific "hot keys".
Looking at region server logs has not resulted in any findings. The only
sort of vague evidence we have is that from the reported metrics, reads per
second on the hot server looks more than the other but not in a steady
state but in a spiky but steady fashion, but gets per second looks no
different than any other server.

Until now our hacky way that we discovered to get around this was to just
restart the region server. This works because while some calls error out
while the regions are in transition, this is a batch oriented system with a
retry strategy built in.

But just yesterday we discovered something interesting, if we connect to
the region server in VisualVM and press the "Perform GC" button, there
seems to be a brief pause and then CPU settles down back to normal. This is
despite the fact that memory appears to be under no pressure and before we
do this, VisualVM indicates very low percentage of CPU time spent in GC, so
we're baffled, and hoping someone with deeper insight into the HBase code
could explain this behavior.

Our region server processes are configured with 32GB of RAM and the
following GC related JVM settings :

HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC
-XX:MaxGCPauseMillis=100
-XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14
-XX:InitiatingHeapOccupancyPercent=70

Any insight anyone can provide would be most appreciated.

----
Saad

Region Server Hotspot/CPU Problem

Reply via email to