Could try CMS GC; the thread local collection is less prominent when CMS is in place it seems (Try thread dumping and comparing to thread dumps posted to HBASE-17072 and related issues; the original poster did a nice job describing the problem).
St.Ack On Wed, Mar 1, 2017 at 2:49 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > Someone in our team found this: > > http://community.cloudera.com/t5/Storage-Random-Access-HDFS/ > CPU-Usage-high-when-using-G1GC/td-p/48101 > > Looks like we're bitten by this bug. Unfortunately this is only fixed in > HBase 1.4.0 so we'll have to undertake a version upgrade which is not > trivial. > > ----- > Saad > > > On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni < > sbpothin...@gmail.com > > wrote: > > > First obvious thing to check is "major compaction" happening at the same > > time when it goes to 100% CPU? > > See this helps: > > https://community.hortonworks.com/articles/52616/hbase- > > compaction-tuning-tips.html > > > > > > > > Sent from my iPhone > > > > > On Mar 1, 2017, at 6:06 AM, Saad Mufti <saad.mu...@teamaol.com> wrote: > > > > > > Hi, > > > > > > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on > HBase > > > is heavy and a mix of reads and writes. For a few months we have had a > > > problem where occasionally (once a day or more) one of the region > servers > > > starts consuming close to 100% CPU. This causes all the client thread > > pool > > > to get filled up serving the slow region server, causing overall > response > > > times to slow to a crawl and many calls either start timing out right > in > > > the client, or at a higher level. > > > > > > We have done lots of analysis and looked at various metrics but could > > never > > > pin it down to any particular kind of traffic or specific "hot keys". > > > Looking at region server logs has not resulted in any findings. The > only > > > sort of vague evidence we have is that from the reported metrics, reads > > per > > > second on the hot server looks more than the other but not in a steady > > > state but in a spiky but steady fashion, but gets per second looks no > > > different than any other server. > > > > > > Until now our hacky way that we discovered to get around this was to > just > > > restart the region server. This works because while some calls error > out > > > while the regions are in transition, this is a batch oriented system > > with a > > > retry strategy built in. > > > > > > But just yesterday we discovered something interesting, if we connect > to > > > the region server in VisualVM and press the "Perform GC" button, there > > > seems to be a brief pause and then CPU settles down back to normal. > This > > is > > > despite the fact that memory appears to be under no pressure and before > > we > > > do this, VisualVM indicates very low percentage of CPU time spent in > GC, > > so > > > we're baffled, and hoping someone with deeper insight into the HBase > code > > > could explain this behavior. > > > > > > Our region server processes are configured with 32GB of RAM and the > > > following GC related JVM settings : > > > > > > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC > > > -XX:MaxGCPauseMillis=100 > > > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 > > > -XX:InitiatingHeapOccupancyPercent=70 > > > > > > Any insight anyone can provide would be most appreciated. > > > > > > ---- > > > Saad > > >