Thanks Stack for noticing the ZooKeeper timeout, don't know how could I
have missed that.

After analyzing this for a while it is definitely unrelated to GC. In fact
during the last 4 days no GC operation took more than 2 seconds, and those
that got close were all concurrent mark sweeps, so they should not be
stopping other threads.

These are the interesting log lines:
2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 23706ms for sessionid
0x1372aa57bee0308, closing socket connection and attempting reconnect
2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 24638ms for sessionid
0x3372bf3891304bf, closing socket connection and attempting reconnect
2012-05-22 01:25:12,047 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop1-zk1/10.1.104.201:2181
2012-05-22 01:25:12,048 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop1-zk1/10.1.104.201:2181, initiating session
2012-05-22 01:25:12,080 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x3372bf3891304bf has expired,
closing socket connection
2012-05-22 01:25:12,081 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=hadoop1-s05.farm-ny.gigya.com,60020,1336990798475,
load=(requests=4015, regions=708, usedHeap=2342, maxHeap=7983):
regionserver:60020-0x3372bf3891304bf regionserver:60020-0x3372bf3891304bf
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)

This is what the zookeeper logs show at the same time:
2012-05-22 01:24:46,014 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
read additional data from client sessionid 0x1372aa57bef6611, likely client
has closed socket
2012-05-22 01:24:46,014 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /10.1.104.4:57598 which had sessionid 0x1372aa57bef6611
2012-05-22 01:25:08,010 - ERROR [CommitProcessor:1:NIOServerCnxn@445] -
Unexpected Exception:
2012-05-22 01:25:08,016 - INFO  [CommitProcessor:1:NIOServerCnxn@1435] -
Closed socket connection for client /10.1.104.5:33945 which had sessionid
0x1372aa57bee0308
2012-05-22 01:25:12,046 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
connection from /10.1.104.5:43070
2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
session 0x3372bf3891304bf at /10.1.104.5:43070
2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating client: 231702230809642175
2012-05-22 01:25:12,077 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1573] - Invalid session
0x3372bf3891304bf for client /10.1.104.5:43070, probably expired
2012-05-22 01:25:12,078 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /10.1.104.5:43070 which had sessionid 0x3372bf3891304bf

I have zookeeper.session.timeout set to 20 seconds because I wanted quick
recovery in case of a failure.

Any idea why it would not respond in 20 seconds? Seems like quite a lot of
time.
Don't know if it's related or not but major compaction was happening while
this error occurred.

Thanks.

-eran



On Fri, May 11, 2012 at 2:36 PM, Michael Segel <[email protected]>wrote:

> So I see you're looking at Eran's problem.... ;-)
>
> Since you say he's fairly capable, I'm assuming when he said he had GC and
> MSLABS set up, he did it right, so a GC pause wouldn't cause the error.
>
> Bad node? possible.  It could easily be a networking/hardware issue which
> are pain in the ass problems to track down and solve.
>
> With respect to the dfs.bandwidthPerSec... yes its an HDFS setting.  As
> you point out, its an indirect issue. However that doesn't mean it wouldn't
> have an impact on performance.
>
> OP states that this occurs under heavy writes. What happens to the writes
> when a table is splitting?
>
>
> On May 11, 2012, at 12:12 AM, Stack wrote:
>
> > On Thu, May 10, 2012 at 6:26 AM, Michael Segel
> > <[email protected]> wrote:.
> >> 4) google dfs.balance.bandwidthPerSec  I believe its also used by HBase
> when they need to move regions.
> >
> > Nah.  This is an hdfs setting.  HBase don't use it directly.
> >
> >> Speaking of which what happens when HBase decides to move a region?
> Does it make a copy on the new RS and then after its there, point to the
> new RS and then remove the old region?
> >>
> >
> > When one RS closes the region and another opens it, there is no copy
> > to be done since the region data is in the HDFS they both share.
> >
> > St.Ack
> >
>
>

Reply via email to