http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A8 <property> <name>zookeeper.session.timeout</name> <value>1200000</value> </property> <property> <name>hbase.zookeeper.property.tickTime</name> <value>6000</value> </property> The default is 60 seconds which you reduced to 20. (Assuming this is the right parameter)
As you said you were doing a major compaction at the time. On May 24, 2012, at 6:15 AM, Eran Kutner wrote: > Thanks Stack for noticing the ZooKeeper timeout, don't know how could I > have missed that. > > After analyzing this for a while it is definitely unrelated to GC. In fact > during the last 4 days no GC operation took more than 2 seconds, and those > that got close were all concurrent mark sweeps, so they should not be > stopping other threads. > > These are the interesting log lines: > 2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client > session timed out, have not heard from server in 23706ms for sessionid > 0x1372aa57bee0308, closing socket connection and attempting reconnect > 2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client > session timed out, have not heard from server in 24638ms for sessionid > 0x3372bf3891304bf, closing socket connection and attempting reconnect > 2012-05-22 01:25:12,047 INFO org.apache.zookeeper.ClientCnxn: Opening > socket connection to server hadoop1-zk1/10.1.104.201:2181 > 2012-05-22 01:25:12,048 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to hadoop1-zk1/10.1.104.201:2181, initiating session > 2012-05-22 01:25:12,080 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x3372bf3891304bf has expired, > closing socket connection > 2012-05-22 01:25:12,081 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=hadoop1-s05.farm-ny.gigya.com,60020,1336990798475, > load=(requests=4015, regions=708, usedHeap=2342, maxHeap=7983): > regionserver:60020-0x3372bf3891304bf regionserver:60020-0x3372bf3891304bf > received expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > > This is what the zookeeper logs show at the same time: > 2012-05-22 01:24:46,014 - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to > read additional data from client sessionid 0x1372aa57bef6611, likely client > has closed socket > 2012-05-22 01:24:46,014 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for > client /10.1.104.4:57598 which had sessionid 0x1372aa57bef6611 > 2012-05-22 01:25:08,010 - ERROR [CommitProcessor:1:NIOServerCnxn@445] - > Unexpected Exception: > 2012-05-22 01:25:08,016 - INFO [CommitProcessor:1:NIOServerCnxn@1435] - > Closed socket connection for client /10.1.104.5:33945 which had sessionid > 0x1372aa57bee0308 > 2012-05-22 01:25:12,046 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket > connection from /10.1.104.5:43070 > 2012-05-22 01:25:12,076 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew > session 0x3372bf3891304bf at /10.1.104.5:43070 > 2012-05-22 01:25:12,076 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating client: 231702230809642175 > 2012-05-22 01:25:12,077 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1573] - Invalid session > 0x3372bf3891304bf for client /10.1.104.5:43070, probably expired > 2012-05-22 01:25:12,078 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for > client /10.1.104.5:43070 which had sessionid 0x3372bf3891304bf > > I have zookeeper.session.timeout set to 20 seconds because I wanted quick > recovery in case of a failure. > > Any idea why it would not respond in 20 seconds? Seems like quite a lot of > time. > Don't know if it's related or not but major compaction was happening while > this error occurred. > > Thanks. > > -eran > > > > On Fri, May 11, 2012 at 2:36 PM, Michael Segel > <[email protected]>wrote: > >> So I see you're looking at Eran's problem.... ;-) >> >> Since you say he's fairly capable, I'm assuming when he said he had GC and >> MSLABS set up, he did it right, so a GC pause wouldn't cause the error. >> >> Bad node? possible. It could easily be a networking/hardware issue which >> are pain in the ass problems to track down and solve. >> >> With respect to the dfs.bandwidthPerSec... yes its an HDFS setting. As >> you point out, its an indirect issue. However that doesn't mean it wouldn't >> have an impact on performance. >> >> OP states that this occurs under heavy writes. What happens to the writes >> when a table is splitting? >> >> >> On May 11, 2012, at 12:12 AM, Stack wrote: >> >>> On Thu, May 10, 2012 at 6:26 AM, Michael Segel >>> <[email protected]> wrote:. >>>> 4) google dfs.balance.bandwidthPerSec I believe its also used by HBase >> when they need to move regions. >>> >>> Nah. This is an hdfs setting. HBase don't use it directly. >>> >>>> Speaking of which what happens when HBase decides to move a region? >> Does it make a copy on the new RS and then after its there, point to the >> new RS and then remove the old region? >>>> >>> >>> When one RS closes the region and another opens it, there is no copy >>> to be done since the region data is in the HDFS they both share. >>> >>> St.Ack >>> >> >>
