I don't see that in the RS logs. Would I see that in the ZK logs? At this point there is no network. Just a switch. I reduced the number of nodes to 40 and had all of them placed on the same switch with a single vlan. I even had the network techs use a completely different switch just to be safe.
Is there some heatbeat timer I can tweak? --- Jay Wilson On 7/5/2012 8:34 PM, Amandeep Khurana wrote: > > > On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote: > >> Finally my HMaster has stabilized and been running for 7 hours. I >> believe my networking issues are behind me now. Thank you everyone for >> the help. >> >> > > Awesome. > > Looks like the same issue is biting you with the RS too. The RS isn't > heartbeating to ZK and the ZK session expires, causing the RS to die. > Do you see a YouAreDeadException in the logs? >> >> New issue is my RSes continue to die after about 20 minutes. Again the >> cluster is idle. No jobs are running and I get this on all of my RSes >> at almost the same time: >> >> 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening >> socket connection to server devrackA-04/172.18.0.5:2181 >> 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket >> connection established to devrackA-04/172.18.0.5:2181, initiating session >> 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session >> establishment complete on server devrackA-04/172.18.0.5:2181, sessionid >> = 0x13858fc240f0003, negotiated timeout = 180000 >> 2012-07-05 19:34:05,399 INFO >> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown >> hook thread: Shutdownhook:regionserver60020 >> 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to >> read additional data from server sessionid 0x13858fc240f0003, likely >> server has closed socket, closing socket connection and attempting reconnect >> 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening >> socket connection to server devrackA-03/172.18.0.4:2181 >> 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket >> connection established to devrackA-03/172.18.0.4:2181, initiating session >> 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to >> reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired, >> closing socket connection >> 2012-07-05 20:06:40,586 FATAL >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >> server serverName=devrackB-07,60020,1341542045088, load=(requests=0, >> regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003 >> regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper, >> aborting >> org.apache.zookeeper.KeeperException$SessionExpiredException: >> KeeperErrorCode = Session expired >> >> Could the fact that the cluster is idle cause the sessions to expire? >> It's almost like a timing trigger pops, the sessions expire, and then >> can reconnect. Is there a timer I need to adjust? >> >> Could this be related to a TCP or IP timer that needs to be adjusted? >> The session goes into a Fin/Wait state and then closes? >> >> Thank you >> --- >> Jay Wilson >> >> > > >