Finally my HMaster has stabilized and been running for 7 hours. I believe my networking issues are behind me now. Thank you everyone for the help.
New issue is my RSes continue to die after about 20 minutes. Again the cluster is idle. No jobs are running and I get this on all of my RSes at almost the same time: 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server devrackA-04/172.18.0.5:2181 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to devrackA-04/172.18.0.5:2181, initiating session 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server devrackA-04/172.18.0.5:2181, sessionid = 0x13858fc240f0003, negotiated timeout = 180000 2012-07-05 19:34:05,399 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver60020 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x13858fc240f0003, likely server has closed socket, closing socket connection and attempting reconnect 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server devrackA-03/172.18.0.4:2181 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to devrackA-03/172.18.0.4:2181, initiating session 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired, closing socket connection 2012-07-05 20:06:40,586 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=devrackB-07,60020,1341542045088, load=(requests=0, regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003 regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired Could the fact that the cluster is idle cause the sessions to expire? It's almost like a timing trigger pops, the sessions expire, and then can reconnect. Is there a timer I need to adjust? Could this be related to a TCP or IP timer that needs to be adjusted? The session goes into a Fin/Wait state and then closes? Thank you --- Jay Wilson
