Re: YouAreDeadException with hbase

Jinsong Hu Fri, 16 Jul 2010 13:51:29 -0700

I was doing stress testing, so the load is not small. But I purposelylimited the data rate on clientside so load is not big either. using "iostat -x 5" and I can see there arelots of situations thatthe CPU goes to very high level and stay there for long time. but then itultimately go down.It is highly possible that during a certain period the CPU was too busy andthe GC process

was starved for CPU.

I researched this failure and found


an excellent thread talking about GC:

http://forums.sun.com/thread.jspa?threadID=698490

that is more detailed than http://wiki.apache.org/hadoop/PerformanceTuning .

I will do some tunning and see if it helps following the posted configthere.


Jimmy.

--------------------------------------------------
From: "Ryan Rawson" <[email protected]>
Sent: Friday, July 16, 2010 1:35 PM
To: <[email protected]>
Subject: Re: YouAreDeadException with hbase

These 2 lines are different GC collections:

5517.355: [GC 5517.355: [ParNew (promotion failed):18113K->18113K(19136K), 0.77

00840 secs]
5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961 secs]
[Times: user=95.87 sys=3.06, real=310.97 secs]

(concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs]2023325K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 ,281.03

57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs]

It's a little hard to read that, it looks like the CMS concurrent mark
took 310 seconds then failed, then we got a 281 second real time
pause, but interestingly enough the user and system time is fairly
low.

How loaded are these machines?  You need to be giving enough
uncontended CPU time to hbase.

On Fri, Jul 16, 2010 at 1:30 PM, Jinsong Hu <[email protected]>wrote:

I already implemented all these configs before my test. I checked
gc-hbase.log,

I see a GC failure which looks very suspecious:

5515.974: [GC 5515.974: [ParNew: 19120K->2112K(19136K), 0.8344240 secs]
2016283K
->2007324K(2069308K) icms_dc=100 , 0.8345620 secs] [Times: user=0.08
sys=0.00, r
eal=0.83 secs]

5517.355: [GC 5517.355: [ParNew (promotion failed):18113K->18113K(19136K),

0.77
00840 secs]5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961
secs] [
Times: user=95.87 sys=3.06, real=310.97 secs]
(concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs]
20233
25K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 ,
281.03
57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs]

5798.909: [GC [1 CMS-initial-mark: 572240K(2054592K)] 573896K(2092928K),
0.05792
20 secs] [Times: user=0.01 sys=0.00, real=0.08 secs]

the concurrent mode failure, and 281.05 seconds of GC time, looks likethe

culprit for the problem.
I just wonder how to resolve this issue.

Jimmy.


--------------------------------------------------
From: "Ryan Rawson" <[email protected]>
Sent: Friday, July 16, 2010 12:57 PM
To: <[email protected]>
Subject: Re: YouAreDeadException with hbase

Sometimes the GC can chain multiple medium pauses into one large
pause.  I've seen this before where there are 2 long pauses back to
back and the result was a 50second+ pause.

This article talks a lot about GC performance and tuning, check it out:
http://wiki.apache.org/hadoop/PerformanceTuning

-ryan

On Fri, Jul 16, 2010 at 11:55 AM, Jinsong Hu <[email protected]>
wrote:


Yes, the root cause seems to be the gap of 4 minutes between

2010-07-16 05:49:26,805 and 2010-07-16 05:53:23,476 . but I checkedGC

gc-hbase.log and don't see
4 minute gap in gc. I just wonder what could cause this large gap.

I also wonder if there is a configuration that I can do to avoid this
long
pause, or get around the problem
cause by this long pause.

Jimmy

--------------------------------------------------
From: "Stack" <[email protected]>
Sent: Friday, July 16, 2010 11:44 AM
To: <[email protected]>
Subject: Re: YouAreDeadException with hbase

You'll see this if the server reports to the master after the master
has ruled it 'dead'.

Here is the code that produces the exception:

 if (!isDead(serverName)) return;

String message = "Server " + what + " rejected; currently processing"

+
   serverName + " as dead server";
 LOG.debug(message);
 throw new YouAreDeadException(message);

Servers are on the 'dead' list if zk reports their session has
expired.  The master moves then to cleanup after the dead server and
process its logs.  If during this cleanup time the server reports in,
master will return the youaredead exception.

Usually the RS has lost its zk session but has yet to realize it.

St.Ack

On Thu, Jul 15, 2010 at 11:52 PM, Jinsong Hu <[email protected]>
wrote:


Hi, There:
 I got some YouAreDeadException  with hbase. what can cause it ? I do
notice
between 5:49 to 5:53 ,
for 4 minutes, there is no log. This doesn't look like GC issue as I
checked
the GC log, the longest GC
is only 9.6 seconds.

Jimmy.


2010-07-16 05:49:26,805 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca
che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB
(425113472),
Max=4
08.775MB (428631648), Counts: Blocks=1, Access=2178914, Hit=1034,
Miss=2177880,
Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04745483165606856%, Miss
Ratio=99.95
254278182983%, Evicted/Run=NaN
2010-07-16 05:53:23,476 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca
che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB
(425113472),
Max=4
08.775MB (428631648), Counts: Blocks=1, Access=2178915, Hit=1035,
Miss=2177880,
Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04750070511363447%, Miss
Ratio=99.95
250105857849%, Evicted/Run=NaN

....
2010-07-16 05:53:26,171 INFO org.apache.zookeeper.ClientCnxn: Client
session
tim
ed out, have not heard from server in 240540ms for sessionid
0x329c88039b0006c,
closing socket connection and attempting reconnect
2010-07-16 05:53:27,333 INFO org.apache.zookeeper.ClientCnxn: Opening
socket
con
nection to server t-zookeeper2.cloud.ppops.net/10.110.24.57:2181
2010-07-16 05:53:27,334 INFO org.apache.zookeeper.ClientCnxn: Socket
connection
established to t-zookeeper2.cloud.ppops.net/10.110.24.57:2181,
initiating
sessio
n

2010-07-16 05:53:27,335 INFO org.apache.zookeeper.ClientCnxn: Unableto

reconnec

t to ZooKeeper service, session 0x329c88039b0006c has expired,closing

socket co
nnection
2010-07-16 05:53:27,896 INFO org.apache.zookeeper.ClientCnxn: Client
session
tim
ed out, have not heard from server in 240520ms for sessionid
0x129c87a7f98007a,
closing socket connection and attempting reconnect


2010-07-16 05:53:39,090 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer
: Aborting region server
serverName=m0002028.ppops.net,60020,1279237223465,
load
=(requests=952, regions=21, usedHeap=575, maxHeap=2043): Unhandled
exception
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadE
xception: Server REPORT rejected; currently processing
m0002028.ppops.net,60020,
1279237223465 as dead server
    at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManage
r.java:217)
    at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serve
rManager.java:271)
    at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
a:684)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:
919)

Re: YouAreDeadException with hbase

Reply via email to