Sometimes the GC can chain multiple medium pauses into one large pause. I've seen this before where there are 2 long pauses back to back and the result was a 50second+ pause.
This article talks a lot about GC performance and tuning, check it out: http://wiki.apache.org/hadoop/PerformanceTuning -ryan On Fri, Jul 16, 2010 at 11:55 AM, Jinsong Hu <[email protected]> wrote: > Yes, the root cause seems to be the gap of 4 minutes between > 2010-07-16 05:49:26,805 and 2010-07-16 05:53:23,476 . but I checked GC > gc-hbase.log and don't see > 4 minute gap in gc. I just wonder what could cause this large gap. > > I also wonder if there is a configuration that I can do to avoid this long > pause, or get around the problem > cause by this long pause. > > Jimmy > > -------------------------------------------------- > From: "Stack" <[email protected]> > Sent: Friday, July 16, 2010 11:44 AM > To: <[email protected]> > Subject: Re: YouAreDeadException with hbase > >> You'll see this if the server reports to the master after the master >> has ruled it 'dead'. >> >> Here is the code that produces the exception: >> >> if (!isDead(serverName)) return; >> String message = "Server " + what + " rejected; currently processing " + >> serverName + " as dead server"; >> LOG.debug(message); >> throw new YouAreDeadException(message); >> >> Servers are on the 'dead' list if zk reports their session has >> expired. The master moves then to cleanup after the dead server and >> process its logs. If during this cleanup time the server reports in, >> master will return the youaredead exception. >> >> Usually the RS has lost its zk session but has yet to realize it. >> >> St.Ack >> >> On Thu, Jul 15, 2010 at 11:52 PM, Jinsong Hu <[email protected]> >> wrote: >>> >>> Hi, There: >>> I got some YouAreDeadException with hbase. what can cause it ? I do >>> notice >>> between 5:49 to 5:53 , >>> for 4 minutes, there is no log. This doesn't look like GC issue as I >>> checked >>> the GC log, the longest GC >>> is only 9.6 seconds. >>> >>> Jimmy. >>> >>> >>> 2010-07-16 05:49:26,805 DEBUG >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca >>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB >>> (425113472), >>> Max=4 >>> 08.775MB (428631648), Counts: Blocks=1, Access=2178914, Hit=1034, >>> Miss=2177880, >>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04745483165606856%, Miss >>> Ratio=99.95 >>> 254278182983%, Evicted/Run=NaN >>> 2010-07-16 05:53:23,476 DEBUG >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca >>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB >>> (425113472), >>> Max=4 >>> 08.775MB (428631648), Counts: Blocks=1, Access=2178915, Hit=1035, >>> Miss=2177880, >>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04750070511363447%, Miss >>> Ratio=99.95 >>> 250105857849%, Evicted/Run=NaN >>> >>> .... >>> 2010-07-16 05:53:26,171 INFO org.apache.zookeeper.ClientCnxn: Client >>> session >>> tim >>> ed out, have not heard from server in 240540ms for sessionid >>> 0x329c88039b0006c, >>> closing socket connection and attempting reconnect >>> 2010-07-16 05:53:27,333 INFO org.apache.zookeeper.ClientCnxn: Opening >>> socket >>> con >>> nection to server t-zookeeper2.cloud.ppops.net/10.110.24.57:2181 >>> 2010-07-16 05:53:27,334 INFO org.apache.zookeeper.ClientCnxn: Socket >>> connection >>> established to t-zookeeper2.cloud.ppops.net/10.110.24.57:2181, initiating >>> sessio >>> n >>> 2010-07-16 05:53:27,335 INFO org.apache.zookeeper.ClientCnxn: Unable to >>> reconnec >>> t to ZooKeeper service, session 0x329c88039b0006c has expired, closing >>> socket co >>> nnection >>> 2010-07-16 05:53:27,896 INFO org.apache.zookeeper.ClientCnxn: Client >>> session >>> tim >>> ed out, have not heard from server in 240520ms for sessionid >>> 0x129c87a7f98007a, >>> closing socket connection and attempting reconnect >>> >>> >>> 2010-07-16 05:53:39,090 FATAL >>> org.apache.hadoop.hbase.regionserver.HRegionServer >>> : Aborting region server >>> serverName=m0002028.ppops.net,60020,1279237223465, >>> load >>> =(requests=952, regions=21, usedHeap=575, maxHeap=2043): Unhandled >>> exception >>> org.apache.hadoop.hbase.YouAreDeadException: >>> org.apache.hadoop.hbase.YouAreDeadE >>> xception: Server REPORT rejected; currently processing >>> m0002028.ppops.net,60020, >>> 1279237223465 as dead server >>> at >>> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManage >>> r.java:217) >>> at >>> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serve >>> rManager.java:271) >>> at >>> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav >>> a:684) >>> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >>> sorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java: >>> 919) >>> >>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>> Method) >>> >> >
