Unfortunately there is no way for me to tell if it was swapping for now. but during the period that I was watching, I don't see swapping. the crash happened at mid-night. so right now I really can't tell.
I just put in the -XX:NewSize=6m -XX:MaxNewSize=6m in one of the regionserver and I can see the PartNew seems to be indeed limited to 6M. I will repeat this test later and see if I can repeat the crash. Jimmy -------------------------------------------------- From: "Stack" <[email protected]> Sent: Friday, July 16, 2010 2:05 PM To: <[email protected]> Subject: Re: YouAreDeadException with hbase
Are you swapping? The article you cite is from 2006, 5 years ago. As is noted at the end of that thread, the JVMs change over time; many of the cited configs. may no longer exist or function. St.AckOn Fri, Jul 16, 2010 at 1:50 PM, Jinsong Hu <[email protected]> wrote:I was doing stress testing, so the load is not small. But I purposely limited the data rate on clientside so load is not big either. using "iostat -x 5" and I can see there arelots of situations that the CPU goes to very high level and stay there for long time. but then it ultimately go down.It is highly possible that during a certain period the CPU was too busy andthe GC process was starved for CPU. I researched this failure and found an excellent thread talking about GC: http://forums.sun.com/thread.jspa?threadID=698490that is more detailed than http://wiki.apache.org/hadoop/PerformanceTuning .I will do some tunning and see if it helps following the posted config there. Jimmy. -------------------------------------------------- From: "Ryan Rawson" <[email protected]> Sent: Friday, July 16, 2010 1:35 PM To: <[email protected]> Subject: Re: YouAreDeadException with hbaseThese 2 lines are different GC collections: 5517.355: [GC 5517.355: [ParNew (promotion failed): 18113K->18113K(19136K), 0.77 00840 secs] 5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961 secs] [Times: user=95.87 sys=3.06, real=310.97 secs](concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs]2023325K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 ,281.03 57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs] It's a little hard to read that, it looks like the CMS concurrent mark took 310 seconds then failed, then we got a 281 second real time pause, but interestingly enough the user and system time is fairly low. How loaded are these machines? You need to be giving enough uncontended CPU time to hbase. On Fri, Jul 16, 2010 at 1:30 PM, Jinsong Hu <[email protected]> wrote:I already implemented all these configs before my test. I checked gc-hbase.log, I see a GC failure which looks very suspecious: 5515.974: [GC 5515.974: [ParNew: 19120K->2112K(19136K), 0.8344240 secs] 2016283K ->2007324K(2069308K) icms_dc=100 , 0.8345620 secs] [Times: user=0.08 sys=0.00, r eal=0.83 secs] 5517.355: [GC 5517.355: [ParNew (promotion failed): 18113K->18113K(19136K), 0.7700840 secs]5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961secs] [ Times: user=95.87 sys=3.06, real=310.97 secs](concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs]2023325K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 ,281.03 57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs]5798.909: [GC [1 CMS-initial-mark: 572240K(2054592K)] 573896K(2092928K),0.05792 20 secs] [Times: user=0.01 sys=0.00, real=0.08 secs] the concurrent mode failure, and 281.05 seconds of GC time, looks like the culprit for the problem. I just wonder how to resolve this issue. Jimmy. -------------------------------------------------- From: "Ryan Rawson" <[email protected]> Sent: Friday, July 16, 2010 12:57 PM To: <[email protected]> Subject: Re: YouAreDeadException with hbaseSometimes the GC can chain multiple medium pauses into one large pause. I've seen this before where there are 2 long pauses back to back and the result was a 50second+ pause.This article talks a lot about GC performance and tuning, check it out:http://wiki.apache.org/hadoop/PerformanceTuning -ryan On Fri, Jul 16, 2010 at 11:55 AM, Jinsong Hu <[email protected]> wrote:Yes, the root cause seems to be the gap of 4 minutes between2010-07-16 05:49:26,805 and 2010-07-16 05:53:23,476 . but I checkedGC gc-hbase.log and don't see 4 minute gap in gc. I just wonder what could cause this large gap. I also wonder if there is a configuration that I can do to avoid this long pause, or get around the problem cause by this long pause. Jimmy -------------------------------------------------- From: "Stack" <[email protected]> Sent: Friday, July 16, 2010 11:44 AM To: <[email protected]> Subject: Re: YouAreDeadException with hbaseYou'll see this if the server reports to the master after the master has ruled it 'dead'. Here is the code that produces the exception: if (!isDead(serverName)) return;String message = "Server " + what + " rejected; currently processing" + serverName + " as dead server"; LOG.debug(message); throw new YouAreDeadException(message); Servers are on the 'dead' list if zk reports their session has expired. The master moves then to cleanup after the dead server andprocess its logs. If during this cleanup time the server reports in,master will return the youaredead exception. Usually the RS has lost its zk session but has yet to realize it. St.AckOn Thu, Jul 15, 2010 at 11:52 PM, Jinsong Hu <[email protected]>wrote:Hi, There:I got some YouAreDeadException with hbase. what can cause it ? I donotice between 5:49 to 5:53 ,for 4 minutes, there is no log. This doesn't look like GC issue as Ichecked the GC log, the longest GC is only 9.6 seconds. Jimmy. 2010-07-16 05:49:26,805 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB (425113472), Max=4 08.775MB (428631648), Counts: Blocks=1, Access=2178914, Hit=1034, Miss=2177880,Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04745483165606856%, MissRatio=99.95 254278182983%, Evicted/Run=NaN 2010-07-16 05:53:23,476 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB (425113472), Max=4 08.775MB (428631648), Counts: Blocks=1, Access=2178915, Hit=1035, Miss=2177880,Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04750070511363447%, MissRatio=99.95 250105857849%, Evicted/Run=NaN ....2010-07-16 05:53:26,171 INFO org.apache.zookeeper.ClientCnxn: Clientsession tim ed out, have not heard from server in 240540ms for sessionid 0x329c88039b0006c, closing socket connection and attempting reconnect2010-07-16 05:53:27,333 INFO org.apache.zookeeper.ClientCnxn: Openingsocket con nection to server t-zookeeper2.cloud.ppops.net/10.110.24.57:21812010-07-16 05:53:27,334 INFO org.apache.zookeeper.ClientCnxn: Socketconnection established to t-zookeeper2.cloud.ppops.net/10.110.24.57:2181, initiating sessio n2010-07-16 05:53:27,335 INFO org.apache.zookeeper.ClientCnxn: Unableto reconnec t to ZooKeeper service, session 0x329c88039b0006c has expired, closing socket co nnection2010-07-16 05:53:27,896 INFO org.apache.zookeeper.ClientCnxn: Clientsession tim ed out, have not heard from server in 240520ms for sessionid 0x129c87a7f98007a, closing socket connection and attempting reconnect 2010-07-16 05:53:39,090 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer : Aborting region server serverName=m0002028.ppops.net,60020,1279237223465, load =(requests=952, regions=21, usedHeap=575, maxHeap=2043): Unhandled exception org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadE xception: Server REPORT rejected; currently processing m0002028.ppops.net,60020, 1279237223465 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManage r.java:217) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serve rManager.java:271) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav a:684) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java: 919)
