CPU resources never was a problem, munin shows there is enough idle time.
Although the graphs might exclude sharp peaks, I'm pretty certain CPU is
not a problem.

After some experiments it seems that disabling swap indeed seems to be the
solution to the aborting regionserver problem. I reverted the GC settings
to be more 'default'.  Currently that is "-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC". Overcommit is set to 50%. The consequence of this is
that the memory policy should be strictly enforced, to prevent important
processes being killed by the OOM-killer.

Ferdy.



On Fri, Mar 9, 2012 at 5:30 AM, Laxman <[email protected]> wrote:

> Hi Ferdy,
>
> > > I'm running regionservers with 2GB heap and following tuning options:
> > > -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:NewRatio=16
> > > -XX:CMSInitiatingOccupancyFraction=70 -
> > XX:+UseCMSInitiatingOccupancyOnly
> > > -XX:MaxGCPauseMillis=100
>
> GC with huge heaps may take longer GC pauses.
>
> But in your case, heap is just 2GB. Doesn't look like a general problem to
> me.
>
> [Times: user=1.82 sys=0.31, real=51.38 secs]
>
> User - 1.82 seconds - Excluding context switch
> Real - 51.38 seconds - Including context switch
>
> Based on these numbers, it looks to me like a problem with CPU
> underprovisioning.
>
> So, its worth investigating the following
>
> - What is CPU usage at that point of time?
> - Are there any processes consuming all your CPU resources?
>
> Alternatively, you can also try increasing the zookeeper session timeout.
> --
> Regards,
> Laxman
>
> > -----Original Message-----
> > From: Gaojinchao [mailto:[email protected]]
> > Sent: Friday, March 09, 2012 8:29 AM
> > To: [email protected]
> > Subject: re: gc pause killing regionserver
> >
> > We encountered the same thing. we set swappiness priority to 0. But
> > swap is still working.
> > So we disable swap.
> >
> > -----邮件原件-----
> > 发件人: [email protected] [mailto:[email protected]] 代表 Jean-Daniel
> > Cryans
> > 发送时间: 2012年3月9日 6:29
> > 收件人: [email protected]
> > 主题: Re: gc pause killing regionserver
> >
> > When real cpu is bigger than user cpu it very often points to
> > swapping. Even if you think you turned that off or that there's no
> > possible way you could be swapping, check it again.
> >
> > I could also be that your CPUs were busy doing something else, I've
> > seen crazy context switching CPUs freezing up my nodes, but in my
> > experience it's not very likely.
> >
> > Setting swappiness to 0 just means it's not going to page anything out
> > until it really needs to do it, meaning it's possible to swap. The
> > only way to guarantee no swapping whatsoever is giving your system 0
> > swap space.
> >
> > Regarding that promotion failure, you could try reducing the eden
> > size. Try -Xmn128m
> >
> > J-D
> >
> > On Sat, Mar 3, 2012 at 5:05 AM, Ferdy Galema <[email protected]>
> > wrote:
> > > Hi,
> > >
> > > I'm running regionservers with 2GB heap and following tuning options:
> > > -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:NewRatio=16
> > > -XX:CMSInitiatingOccupancyFraction=70 -
> > XX:+UseCMSInitiatingOccupancyOnly
> > > -XX:MaxGCPauseMillis=100
> > >
> > > A regionserver aborted (YouAreDeadException) and this was printed in
> > the gc
> > > logs (all is shown up until the abort)
> > >
> > > 211663.516: [GC 211663.516: [ParNew: 118715K->13184K(118912K),
> > 0.0445390
> > > secs] 1373940K->1289814K(2233472K), 0.0446420 secs] [Times: user=0.14
> > > sys=0.01, real=0.05 secs]
> > > 211663.686: [GC 211663.686: [ParNew: 118912K->13184K(118912K),
> > 0.0594280
> > > secs] 1395542K->1310185K(2233472K), 0.0595420 secs] [Times: user=0.15
> > > sys=0.00, real=0.06 secs]
> > > 211663.869: [GC 211663.869: [ParNew: 118790K->13184K(118912K),
> > 0.0434820
> > > secs] 1415792K->1331317K(2233472K), 0.0435930 secs] [Times: user=0.13
> > > sys=0.01, real=0.04 secs]
> > > 211667.598: [GC 211667.598: [ParNew (promotion failed):
> > > 118912K->118912K(118912K), 0.0225390 secs]211667.621: [CMS:
> > > 1330845K->1127914K(2114560K), 51.3610670 secs]
> > > 1437045K->1127914K(2233472K), [CMS Perm : 20680K->20622K(34504K)],
> > > 51.3838170 secs] [Times: user=1.82 sys=0.31, real=51.38 secs]
> > > 211719.713: [GC 211719.714: [ParNew: 105723K->13184K(118912K),
> > 0.0176130
> > > secs] 1233638K->1149393K(2233472K), 0.0177230 secs] [Times: user=0.07
> > > sys=0.00, real=0.02 secs]
> > > 211719.851: [GC 211719.852: [ParNew: 118912K->13184K(118912K),
> > 0.0281860
> > > secs] 1255121K->1170269K(2233472K), 0.0282970 secs] [Times: user=0.10
> > > sys=0.01, real=0.03 secs]
> > > 211719.993: [GC 211719.993: [ParNew: 118795K->13184K(118912K),
> > 0.0276320
> > > secs] 1275880K->1191268K(2233472K), 0.0277350 secs] [Times: user=0.09
> > > sys=0.00, real=0.03 secs]
> > > 211720.490: [GC 211720.490: [ParNew: 118912K->13184K(118912K),
> > 0.0624650
> > > secs] 1296996K->1210640K(2233472K), 0.0625560 secs] [Times: user=0.15
> > > sys=0.00, real=0.06 secs]
> > > 211720.687: [GC 211720.687: [ParNew: 118702K->13184K(118912K),
> > 0.1651750
> > > secs] 1316159K->1231993K(2233472K), 0.1652660 secs] [Times: user=0.25
> > > sys=0.01, real=0.17 secs]
> > > 211721.038: [GC 211721.038: [ParNew: 118912K->13184K(118912K),
> > 0.0952750
> > > secs] 1337721K->1252598K(2233472K), 0.0953660 secs] [Times: user=0.15
> > > sys=0.00, real=0.09 secs]
> > > Heap
> > >  par new generation  total 118912K, used 86199K [0x00002aaaae1f0000,
> > > 0x00002aaab62f0000, 0x00002aaab62f0000)
> > >  eden space 105728K,  69% used [0x00002aaaae1f0000,
> > 0x00002aaab293dfa8,
> > > 0x00002aaab4930000)
> > >  from space 13184K, 100% used [0x00002aaab4930000,
> > 0x00002aaab5610000,
> > > 0x00002aaab5610000)
> > >  to  space 13184K,  0% used [0x00002aaab5610000, 0x00002aaab5610000,
> > > 0x00002aaab62f0000)
> > >  concurrent mark-sweep generation total 2114560K, used 1239414K
> > > [0x00002aaab62f0000, 0x00002aab373f0000, 0x00002aab373f0000)
> > >  concurrent-mark-sweep perm gen total 34504K, used 20728K
> > > [0x00002aab373f0000, 0x00002aab395a2000, 0x00002aab3c7f0000)
> > >
> > >
> > > Why did a GC took 51 seconds? The machine still had enough memory
> > available
> > > so it could not be swapping. (swapiness is set to 0). From the 15
> > > regionservers in total, I often see this specific regionserver fail.
> > What
> > > do you recommended in this situation?
> > >
> > > Ferdy.
>
>

Reply via email to