First obvious thing to check is "major compaction" happening at the same time when it goes to 100% CPU? See this helps: https://community.hortonworks.com/articles/52616/hbase-compaction-tuning-tips.html
Sent from my iPhone > On Mar 1, 2017, at 6:06 AM, Saad Mufti <[email protected]> wrote: > > Hi, > > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase > is heavy and a mix of reads and writes. For a few months we have had a > problem where occasionally (once a day or more) one of the region servers > starts consuming close to 100% CPU. This causes all the client thread pool > to get filled up serving the slow region server, causing overall response > times to slow to a crawl and many calls either start timing out right in > the client, or at a higher level. > > We have done lots of analysis and looked at various metrics but could never > pin it down to any particular kind of traffic or specific "hot keys". > Looking at region server logs has not resulted in any findings. The only > sort of vague evidence we have is that from the reported metrics, reads per > second on the hot server looks more than the other but not in a steady > state but in a spiky but steady fashion, but gets per second looks no > different than any other server. > > Until now our hacky way that we discovered to get around this was to just > restart the region server. This works because while some calls error out > while the regions are in transition, this is a batch oriented system with a > retry strategy built in. > > But just yesterday we discovered something interesting, if we connect to > the region server in VisualVM and press the "Perform GC" button, there > seems to be a brief pause and then CPU settles down back to normal. This is > despite the fact that memory appears to be under no pressure and before we > do this, VisualVM indicates very low percentage of CPU time spent in GC, so > we're baffled, and hoping someone with deeper insight into the HBase code > could explain this behavior. > > Our region server processes are configured with 32GB of RAM and the > following GC related JVM settings : > > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC > -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 > -XX:InitiatingHeapOccupancyPercent=70 > > Any insight anyone can provide would be most appreciated. > > ---- > Saad
