One bit of advice that is easy to overlook but is important is thus: "This can be due to the size of the ParNew, which should be relatively small. If your ParNew is very large after running HBase for a while, in one example a ParNew was about 150MB, then you might have to constrain the size of ParNew.
Add this to HBASE_OPTS: export HBASE_OPTS="-XX:NewSize=6m -XX:MaxNewSize=6m <cms options from above> <gc logging options from above>"" >From our wiki about performance tuning. On Fri, Jul 16, 2010 at 1:50 PM, Jinsong Hu <[email protected]> wrote: > I was doing stress testing, so the load is not small. But I purposely > limited the data rate on client > side so load is not big either. using "iostat -x 5" and I can see there are > lots of situations that > the CPU goes to very high level and stay there for long time. but then it > ultimately go down. > It is highly possible that during a certain period the CPU was too busy and > the GC process > was starved for CPU. > > I researched this failure and found > > an excellent thread talking about GC: > > http://forums.sun.com/thread.jspa?threadID=698490 > > that is more detailed than http://wiki.apache.org/hadoop/PerformanceTuning . > I will do some tunning and see if it helps following the posted config > there. > > Jimmy. > > -------------------------------------------------- > From: "Ryan Rawson" <[email protected]> > Sent: Friday, July 16, 2010 1:35 PM > To: <[email protected]> > Subject: Re: YouAreDeadException with hbase > >> These 2 lines are different GC collections: >> >> 5517.355: [GC 5517.355: [ParNew (promotion failed): >> 18113K->18113K(19136K), 0.77 >> 00840 secs] >> 5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961 secs] >> [Times: user=95.87 sys=3.06, real=310.97 secs] >> (concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs] >> 20233 >> 25K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 , >> 281.03 >> 57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs] >> >> It's a little hard to read that, it looks like the CMS concurrent mark >> took 310 seconds then failed, then we got a 281 second real time >> pause, but interestingly enough the user and system time is fairly >> low. >> >> How loaded are these machines? You need to be giving enough >> uncontended CPU time to hbase. >> >> >> On Fri, Jul 16, 2010 at 1:30 PM, Jinsong Hu <[email protected]> >> wrote: >>> >>> I already implemented all these configs before my test. I checked >>> gc-hbase.log, >>> >>> I see a GC failure which looks very suspecious: >>> >>> 5515.974: [GC 5515.974: [ParNew: 19120K->2112K(19136K), 0.8344240 secs] >>> 2016283K >>> ->2007324K(2069308K) icms_dc=100 , 0.8345620 secs] [Times: user=0.08 >>> sys=0.00, r >>> eal=0.83 secs] >>> >>> 5517.355: [GC 5517.355: [ParNew (promotion failed): >>> 18113K->18113K(19136K), >>> 0.77 >>> 00840 secs]5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961 >>> secs] [ >>> Times: user=95.87 sys=3.06, real=310.97 secs] >>> (concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs] >>> 20233 >>> 25K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 , >>> 281.03 >>> 57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs] >>> >>> 5798.909: [GC [1 CMS-initial-mark: 572240K(2054592K)] 573896K(2092928K), >>> 0.05792 >>> 20 secs] [Times: user=0.01 sys=0.00, real=0.08 secs] >>> >>> >>> the concurrent mode failure, and 281.05 seconds of GC time, looks like >>> the >>> culprit for the problem. >>> I just wonder how to resolve this issue. >>> >>> Jimmy. >>> >>> >>> -------------------------------------------------- >>> From: "Ryan Rawson" <[email protected]> >>> Sent: Friday, July 16, 2010 12:57 PM >>> To: <[email protected]> >>> Subject: Re: YouAreDeadException with hbase >>> >>>> Sometimes the GC can chain multiple medium pauses into one large >>>> pause. I've seen this before where there are 2 long pauses back to >>>> back and the result was a 50second+ pause. >>>> >>>> This article talks a lot about GC performance and tuning, check it out: >>>> http://wiki.apache.org/hadoop/PerformanceTuning >>>> >>>> -ryan >>>> >>>> On Fri, Jul 16, 2010 at 11:55 AM, Jinsong Hu <[email protected]> >>>> wrote: >>>>> >>>>> Yes, the root cause seems to be the gap of 4 minutes between >>>>> 2010-07-16 05:49:26,805 and 2010-07-16 05:53:23,476 . but I checked >>>>> GC >>>>> gc-hbase.log and don't see >>>>> 4 minute gap in gc. I just wonder what could cause this large gap. >>>>> >>>>> I also wonder if there is a configuration that I can do to avoid this >>>>> long >>>>> pause, or get around the problem >>>>> cause by this long pause. >>>>> >>>>> Jimmy >>>>> >>>>> -------------------------------------------------- >>>>> From: "Stack" <[email protected]> >>>>> Sent: Friday, July 16, 2010 11:44 AM >>>>> To: <[email protected]> >>>>> Subject: Re: YouAreDeadException with hbase >>>>> >>>>>> You'll see this if the server reports to the master after the master >>>>>> has ruled it 'dead'. >>>>>> >>>>>> Here is the code that produces the exception: >>>>>> >>>>>> if (!isDead(serverName)) return; >>>>>> String message = "Server " + what + " rejected; currently processing >>>>>> " >>>>>> + >>>>>> serverName + " as dead server"; >>>>>> LOG.debug(message); >>>>>> throw new YouAreDeadException(message); >>>>>> >>>>>> Servers are on the 'dead' list if zk reports their session has >>>>>> expired. The master moves then to cleanup after the dead server and >>>>>> process its logs. If during this cleanup time the server reports in, >>>>>> master will return the youaredead exception. >>>>>> >>>>>> Usually the RS has lost its zk session but has yet to realize it. >>>>>> >>>>>> St.Ack >>>>>> >>>>>> On Thu, Jul 15, 2010 at 11:52 PM, Jinsong Hu <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Hi, There: >>>>>>> I got some YouAreDeadException with hbase. what can cause it ? I do >>>>>>> notice >>>>>>> between 5:49 to 5:53 , >>>>>>> for 4 minutes, there is no log. This doesn't look like GC issue as I >>>>>>> checked >>>>>>> the GC log, the longest GC >>>>>>> is only 9.6 seconds. >>>>>>> >>>>>>> Jimmy. >>>>>>> >>>>>>> >>>>>>> 2010-07-16 05:49:26,805 DEBUG >>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca >>>>>>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB >>>>>>> (425113472), >>>>>>> Max=4 >>>>>>> 08.775MB (428631648), Counts: Blocks=1, Access=2178914, Hit=1034, >>>>>>> Miss=2177880, >>>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04745483165606856%, Miss >>>>>>> Ratio=99.95 >>>>>>> 254278182983%, Evicted/Run=NaN >>>>>>> 2010-07-16 05:53:23,476 DEBUG >>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca >>>>>>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB >>>>>>> (425113472), >>>>>>> Max=4 >>>>>>> 08.775MB (428631648), Counts: Blocks=1, Access=2178915, Hit=1035, >>>>>>> Miss=2177880, >>>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04750070511363447%, Miss >>>>>>> Ratio=99.95 >>>>>>> 250105857849%, Evicted/Run=NaN >>>>>>> >>>>>>> .... >>>>>>> 2010-07-16 05:53:26,171 INFO org.apache.zookeeper.ClientCnxn: Client >>>>>>> session >>>>>>> tim >>>>>>> ed out, have not heard from server in 240540ms for sessionid >>>>>>> 0x329c88039b0006c, >>>>>>> closing socket connection and attempting reconnect >>>>>>> 2010-07-16 05:53:27,333 INFO org.apache.zookeeper.ClientCnxn: Opening >>>>>>> socket >>>>>>> con >>>>>>> nection to server t-zookeeper2.cloud.ppops.net/10.110.24.57:2181 >>>>>>> 2010-07-16 05:53:27,334 INFO org.apache.zookeeper.ClientCnxn: Socket >>>>>>> connection >>>>>>> established to t-zookeeper2.cloud.ppops.net/10.110.24.57:2181, >>>>>>> initiating >>>>>>> sessio >>>>>>> n >>>>>>> 2010-07-16 05:53:27,335 INFO org.apache.zookeeper.ClientCnxn: Unable >>>>>>> to >>>>>>> reconnec >>>>>>> t to ZooKeeper service, session 0x329c88039b0006c has expired, >>>>>>> closing >>>>>>> socket co >>>>>>> nnection >>>>>>> 2010-07-16 05:53:27,896 INFO org.apache.zookeeper.ClientCnxn: Client >>>>>>> session >>>>>>> tim >>>>>>> ed out, have not heard from server in 240520ms for sessionid >>>>>>> 0x129c87a7f98007a, >>>>>>> closing socket connection and attempting reconnect >>>>>>> >>>>>>> >>>>>>> 2010-07-16 05:53:39,090 FATAL >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer >>>>>>> : Aborting region server >>>>>>> serverName=m0002028.ppops.net,60020,1279237223465, >>>>>>> load >>>>>>> =(requests=952, regions=21, usedHeap=575, maxHeap=2043): Unhandled >>>>>>> exception >>>>>>> org.apache.hadoop.hbase.YouAreDeadException: >>>>>>> org.apache.hadoop.hbase.YouAreDeadE >>>>>>> xception: Server REPORT rejected; currently processing >>>>>>> m0002028.ppops.net,60020, >>>>>>> 1279237223465 as dead server >>>>>>> at >>>>>>> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManage >>>>>>> r.java:217) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serve >>>>>>> rManager.java:271) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav >>>>>>> a:684) >>>>>>> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) >>>>>>> at >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >>>>>>> sorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java: >>>>>>> 919) >>>>>>> >
