Several of those jiras are fixed in later versions of CDH. Since the inclusion of jiras in packaging by particular vendors is a vendor specific issue, please seek out help from the vendor (e.g. on the community forum you've just mentioned).
On Wed, Mar 1, 2017 at 8:49 AM, Saad Mufti <[email protected]> wrote: > Someone in our team found this: > > http://community.cloudera.com/t5/Storage-Random-Access-HDFS/CPU-Usage-high-when-using-G1GC/td-p/48101 > > Looks like we're bitten by this bug. Unfortunately this is only fixed in > HBase 1.4.0 so we'll have to undertake a version upgrade which is not > trivial. > > ----- > Saad > > > On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni <[email protected] >> wrote: > >> First obvious thing to check is "major compaction" happening at the same >> time when it goes to 100% CPU? >> See this helps: >> https://community.hortonworks.com/articles/52616/hbase- >> compaction-tuning-tips.html >> >> >> >> Sent from my iPhone >> >> > On Mar 1, 2017, at 6:06 AM, Saad Mufti <[email protected]> wrote: >> > >> > Hi, >> > >> > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase >> > is heavy and a mix of reads and writes. For a few months we have had a >> > problem where occasionally (once a day or more) one of the region servers >> > starts consuming close to 100% CPU. This causes all the client thread >> pool >> > to get filled up serving the slow region server, causing overall response >> > times to slow to a crawl and many calls either start timing out right in >> > the client, or at a higher level. >> > >> > We have done lots of analysis and looked at various metrics but could >> never >> > pin it down to any particular kind of traffic or specific "hot keys". >> > Looking at region server logs has not resulted in any findings. The only >> > sort of vague evidence we have is that from the reported metrics, reads >> per >> > second on the hot server looks more than the other but not in a steady >> > state but in a spiky but steady fashion, but gets per second looks no >> > different than any other server. >> > >> > Until now our hacky way that we discovered to get around this was to just >> > restart the region server. This works because while some calls error out >> > while the regions are in transition, this is a batch oriented system >> with a >> > retry strategy built in. >> > >> > But just yesterday we discovered something interesting, if we connect to >> > the region server in VisualVM and press the "Perform GC" button, there >> > seems to be a brief pause and then CPU settles down back to normal. This >> is >> > despite the fact that memory appears to be under no pressure and before >> we >> > do this, VisualVM indicates very low percentage of CPU time spent in GC, >> so >> > we're baffled, and hoping someone with deeper insight into the HBase code >> > could explain this behavior. >> > >> > Our region server processes are configured with 32GB of RAM and the >> > following GC related JVM settings : >> > >> > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC >> > -XX:MaxGCPauseMillis=100 >> > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 >> > -XX:InitiatingHeapOccupancyPercent=70 >> > >> > Any insight anyone can provide would be most appreciated. >> > >> > ---- >> > Saad >>
