Funny you mention that. I asked the techs to set it up that why. I went to pull my ZK logs and found that 1 RS is still running. What is interesting is that RS is connected to ZK on devrackA-05. The 2 RSes that died where connected to ZK on devrackA-03. devrackA-03 has ZK and HMaster on it.
I did not find the YouAreDeadException in the ZK logs. What I found was: 2012-07-05 20:06:40,577 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /172.18.0.72:54449 2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to renew session 0x13858fc240f0003 at /172.18.0.72:54449 2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.quorum.Learner: Revalidating client: 87918032693690371 2012-07-05 20:06:40,580 INFO org.apache.zookeeper.server.NIOServerCnxn: Invalid session 0x13858fc240f0003 for client /172.18.0.72:54449, probably expired In the RS logs I can see it attempt to reconnect with ZK on devrackA-03, get the reject and then attempt ZK on devrackA-04. --- Jay Wilson On 7/5/2012 9:08 PM, Amandeep Khurana wrote: > The timeout can be configured using the session timeout configuration. The > default for that is 180s, but that means that if the RS doesn't heartbeat to > ZK for 180s, it's considered dead. Unless the machines are really loaded or > GCs are pausing the RS processes, I don't see any other reason except the > network. I'm assuming you gave ZK a dedicated disk so it could write its edit > logs (based on a previous thread). > > > On Thursday, July 5, 2012 at 9:03 PM, Jay Wilson wrote: > >> I don't see that in the RS logs. Would I see that in the ZK logs? >> >> At this point there is no network. Just a switch. I reduced the number >> of nodes to 40 and had all of them placed on the same switch with a >> single vlan. I even had the network techs use a completely different >> switch just to be safe. >> >> Is there some heatbeat timer I can tweak? >> >> --- >> Jay Wilson >> >> On 7/5/2012 8:34 PM, Amandeep Khurana wrote: >>> >>> >>> On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote: >>> >>>> Finally my HMaster has stabilized and been running for 7 hours. I >>>> believe my networking issues are behind me now. Thank you everyone for >>>> the help. >>>> >>> >>> >>> Awesome. >>> >>> Looks like the same issue is biting you with the RS too. The RS isn't >>> heartbeating to ZK and the ZK session expires, causing the RS to die. >>> Do you see a YouAreDeadException in the logs? >>>> >>>> New issue is my RSes continue to die after about 20 minutes. Again the >>>> cluster is idle. No jobs are running and I get this on all of my RSes >>>> at almost the same time: >>>> >>>> 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening >>>> socket connection to server devrackA-04/172.18.0.5:2181 >>>> 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket >>>> connection established to devrackA-04/172.18.0.5:2181, initiating session >>>> 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session >>>> establishment complete on server devrackA-04/172.18.0.5:2181, sessionid >>>> = 0x13858fc240f0003, negotiated timeout = 180000 >>>> 2012-07-05 19:34:05,399 INFO >>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown >>>> hook thread: Shutdownhook:regionserver60020 >>>> 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to >>>> read additional data from server sessionid 0x13858fc240f0003, likely >>>> server has closed socket, closing socket connection and attempting >>>> reconnect >>>> 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening >>>> socket connection to server devrackA-03/172.18.0.4:2181 >>>> 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket >>>> connection established to devrackA-03/172.18.0.4:2181, initiating session >>>> 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to >>>> reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired, >>>> closing socket connection >>>> 2012-07-05 20:06:40,586 FATAL >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >>>> server serverName=devrackB-07,60020,1341542045088, load=(requests=0, >>>> regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003 >>>> regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper, >>>> aborting >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode = Session expired >>>> >>>> Could the fact that the cluster is idle cause the sessions to expire? >>>> It's almost like a timing trigger pops, the sessions expire, and then >>>> can reconnect. Is there a timer I need to adjust? >>>> >>>> Could this be related to a TCP or IP timer that needs to be adjusted? >>>> The session goes into a Fin/Wait state and then closes? >>>> >>>> Thank you >>>> --- >>>> Jay Wilson >>>> >>> >>> >> >> >> > > >
