Re: YouAreDeadException with hbase

Ryan Rawson Fri, 16 Jul 2010 13:55:28 -0700

One bit of advice that is easy to overlook but is important is thus:

"This can be due to the size of the ParNew, which should be relatively
small. If your ParNew is very large after running HBase for a while,
in one example a ParNew was about 150MB, then you might have to
constrain the size of ParNew.


Add this to HBASE_OPTS:

export HBASE_OPTS="-XX:NewSize=6m -XX:MaxNewSize=6m <cms options from
above> <gc logging options from above>""

>From our wiki about performance tuning.

On Fri, Jul 16, 2010 at 1:50 PM, Jinsong Hu <[email protected]> wrote:
> I was doing stress testing, so the load is not small. But I purposely
> limited the data rate on client
> side so load is not big either. using "iostat -x 5" and I can see there are
> lots of situations that
> the CPU goes to very high level and stay there for long time. but then it
> ultimately go down.
> It is highly possible that during a certain period the CPU was too busy and
> the GC process
> was starved for CPU.
>
> I researched this failure and found
>
> an excellent thread talking about GC:
>
> http://forums.sun.com/thread.jspa?threadID=698490
>
> that is more detailed than http://wiki.apache.org/hadoop/PerformanceTuning .
> I will do some tunning and see if it helps following the posted config
> there.
>
> Jimmy.
>
> --------------------------------------------------
> From: "Ryan Rawson" <[email protected]>
> Sent: Friday, July 16, 2010 1:35 PM
> To: <[email protected]>
> Subject: Re: YouAreDeadException with hbase
>
>> These 2 lines are different GC collections:
>>
>> 5517.355: [GC 5517.355: [ParNew (promotion failed):
>> 18113K->18113K(19136K), 0.77
>> 00840 secs]
>> 5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961 secs]
>> [Times: user=95.87 sys=3.06, real=310.97 secs]
>> (concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs]
>> 20233
>> 25K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 ,
>> 281.03
>> 57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs]
>>
>> It's a little hard to read that, it looks like the CMS concurrent mark
>> took 310 seconds then failed, then we got a 281 second real time
>> pause, but interestingly enough the user and system time is fairly
>> low.
>>
>> How loaded are these machines?  You need to be giving enough
>> uncontended CPU time to hbase.
>>
>>
>> On Fri, Jul 16, 2010 at 1:30 PM, Jinsong Hu <[email protected]>
>> wrote:
>>>
>>> I already implemented all these configs before my test. I checked
>>> gc-hbase.log,
>>>
>>> I see a GC failure which looks very suspecious:
>>>
>>> 5515.974: [GC 5515.974: [ParNew: 19120K->2112K(19136K), 0.8344240 secs]
>>> 2016283K
>>> ->2007324K(2069308K) icms_dc=100 , 0.8345620 secs] [Times: user=0.08
>>> sys=0.00, r
>>> eal=0.83 secs]
>>>
>>> 5517.355: [GC 5517.355: [ParNew (promotion failed):
>>> 18113K->18113K(19136K),
>>> 0.77
>>> 00840 secs]5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961
>>> secs] [
>>> Times: user=95.87 sys=3.06, real=310.97 secs]
>>> (concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs]
>>> 20233
>>> 25K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 ,
>>> 281.03
>>> 57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs]
>>>
>>> 5798.909: [GC [1 CMS-initial-mark: 572240K(2054592K)] 573896K(2092928K),
>>> 0.05792
>>> 20 secs] [Times: user=0.01 sys=0.00, real=0.08 secs]
>>>
>>>
>>> the concurrent mode failure, and  281.05 seconds of GC time, looks like
>>> the
>>> culprit for the problem.
>>> I just wonder how to resolve this issue.
>>>
>>> Jimmy.
>>>
>>>
>>> --------------------------------------------------
>>> From: "Ryan Rawson" <[email protected]>
>>> Sent: Friday, July 16, 2010 12:57 PM
>>> To: <[email protected]>
>>> Subject: Re: YouAreDeadException with hbase
>>>
>>>> Sometimes the GC can chain multiple medium pauses into one large
>>>> pause.  I've seen this before where there are 2 long pauses back to
>>>> back and the result was a 50second+ pause.
>>>>
>>>> This article talks a lot about GC performance and tuning, check it out:
>>>> http://wiki.apache.org/hadoop/PerformanceTuning
>>>>
>>>> -ryan
>>>>
>>>> On Fri, Jul 16, 2010 at 11:55 AM, Jinsong Hu <[email protected]>
>>>> wrote:
>>>>>
>>>>> Yes, the root cause seems to be the gap of 4 minutes between
>>>>> 2010-07-16 05:49:26,805   and 2010-07-16 05:53:23,476 .  but I checked
>>>>> GC
>>>>> gc-hbase.log and don't see
>>>>> 4 minute gap in gc. I just wonder what could cause this large gap.
>>>>>
>>>>> I also wonder if there is a configuration that I can do to avoid this
>>>>> long
>>>>> pause, or get around the problem
>>>>> cause by this long pause.
>>>>>
>>>>> Jimmy
>>>>>
>>>>> --------------------------------------------------
>>>>> From: "Stack" <[email protected]>
>>>>> Sent: Friday, July 16, 2010 11:44 AM
>>>>> To: <[email protected]>
>>>>> Subject: Re: YouAreDeadException with hbase
>>>>>
>>>>>> You'll see this if the server reports to the master after the master
>>>>>> has ruled it 'dead'.
>>>>>>
>>>>>> Here is the code that produces the exception:
>>>>>>
>>>>>>  if (!isDead(serverName)) return;
>>>>>>  String message = "Server " + what + " rejected; currently processing
>>>>>> "
>>>>>> +
>>>>>>   serverName + " as dead server";
>>>>>>  LOG.debug(message);
>>>>>>  throw new YouAreDeadException(message);
>>>>>>
>>>>>> Servers are on the 'dead' list if zk reports their session has
>>>>>> expired.  The master moves then to cleanup after the dead server and
>>>>>> process its logs.  If during this cleanup time the server reports in,
>>>>>> master will return the youaredead exception.
>>>>>>
>>>>>> Usually the RS has lost its zk session but has yet to realize it.
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>> On Thu, Jul 15, 2010 at 11:52 PM, Jinsong Hu <[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi, There:
>>>>>>>  I got some YouAreDeadException  with hbase. what can cause it ? I do
>>>>>>> notice
>>>>>>> between 5:49 to 5:53 ,
>>>>>>> for 4 minutes, there is no log. This doesn't look like GC issue as I
>>>>>>> checked
>>>>>>> the GC log, the longest GC
>>>>>>> is only 9.6 seconds.
>>>>>>>
>>>>>>> Jimmy.
>>>>>>>
>>>>>>>
>>>>>>> 2010-07-16 05:49:26,805 DEBUG
>>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca
>>>>>>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB
>>>>>>> (425113472),
>>>>>>> Max=4
>>>>>>> 08.775MB (428631648), Counts: Blocks=1, Access=2178914, Hit=1034,
>>>>>>> Miss=2177880,
>>>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04745483165606856%, Miss
>>>>>>> Ratio=99.95
>>>>>>> 254278182983%, Evicted/Run=NaN
>>>>>>> 2010-07-16 05:53:23,476 DEBUG
>>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca
>>>>>>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB
>>>>>>> (425113472),
>>>>>>> Max=4
>>>>>>> 08.775MB (428631648), Counts: Blocks=1, Access=2178915, Hit=1035,
>>>>>>> Miss=2177880,
>>>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04750070511363447%, Miss
>>>>>>> Ratio=99.95
>>>>>>> 250105857849%, Evicted/Run=NaN
>>>>>>>
>>>>>>> ....
>>>>>>> 2010-07-16 05:53:26,171 INFO org.apache.zookeeper.ClientCnxn: Client
>>>>>>> session
>>>>>>> tim
>>>>>>> ed out, have not heard from server in 240540ms for sessionid
>>>>>>> 0x329c88039b0006c,
>>>>>>> closing socket connection and attempting reconnect
>>>>>>> 2010-07-16 05:53:27,333 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>>>>> socket
>>>>>>> con
>>>>>>> nection to server t-zookeeper2.cloud.ppops.net/10.110.24.57:2181
>>>>>>> 2010-07-16 05:53:27,334 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>>>>> connection
>>>>>>> established to t-zookeeper2.cloud.ppops.net/10.110.24.57:2181,
>>>>>>> initiating
>>>>>>> sessio
>>>>>>> n
>>>>>>> 2010-07-16 05:53:27,335 INFO org.apache.zookeeper.ClientCnxn: Unable
>>>>>>> to
>>>>>>> reconnec
>>>>>>> t to ZooKeeper service, session 0x329c88039b0006c has expired,
>>>>>>> closing
>>>>>>> socket co
>>>>>>> nnection
>>>>>>> 2010-07-16 05:53:27,896 INFO org.apache.zookeeper.ClientCnxn: Client
>>>>>>> session
>>>>>>> tim
>>>>>>> ed out, have not heard from server in 240520ms for sessionid
>>>>>>> 0x129c87a7f98007a,
>>>>>>> closing socket connection and attempting reconnect
>>>>>>>
>>>>>>>
>>>>>>> 2010-07-16 05:53:39,090 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer
>>>>>>> : Aborting region server
>>>>>>> serverName=m0002028.ppops.net,60020,1279237223465,
>>>>>>> load
>>>>>>> =(requests=952, regions=21, usedHeap=575, maxHeap=2043): Unhandled
>>>>>>> exception
>>>>>>> org.apache.hadoop.hbase.YouAreDeadException:
>>>>>>> org.apache.hadoop.hbase.YouAreDeadE
>>>>>>> xception: Server REPORT rejected; currently processing
>>>>>>> m0002028.ppops.net,60020,
>>>>>>> 1279237223465 as dead server
>>>>>>>    at
>>>>>>> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManage
>>>>>>> r.java:217)
>>>>>>>    at
>>>>>>> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serve
>>>>>>> rManager.java:271)
>>>>>>>    at
>>>>>>> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
>>>>>>> a:684)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>>>>>>>    at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>>>>>> sorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
>>>>>>>    at
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:
>>>>>>> 919)
>>>>>>>
>

Re: YouAreDeadException with hbase

Reply via email to