Is your ZK managed by HBase or are you managing it yourself? BTW - All ZK nodes should be reachable by all nodes in the cluster.
The YouAreDeadException would be in RS logs if at all. On Thursday, July 5, 2012 at 9:38 PM, Jay Wilson wrote: > Funny you mention that. I asked the techs to set it up that why. > > I went to pull my ZK logs and found that 1 RS is still running. What is > interesting is that RS is connected to ZK on devrackA-05. The 2 RSes > that died where connected to ZK on devrackA-03. devrackA-03 has ZK and > HMaster on it. > > I did not find the YouAreDeadException in the ZK logs. What I found was: > > 2012-07-05 20:06:40,577 INFO org.apache.zookeeper.server.NIOServerCnxn: > Accepted socket connection from /172.18.0.72:54449 > 2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.NIOServerCnxn: > Client attempting to renew session 0x13858fc240f0003 at /172.18.0.72:54449 > 2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.quorum.Learner: > Revalidating client: 87918032693690371 > 2012-07-05 20:06:40,580 INFO org.apache.zookeeper.server.NIOServerCnxn: > Invalid session 0x13858fc240f0003 for client /172.18.0.72:54449, > probably expired > > In the RS logs I can see it attempt to reconnect with ZK on devrackA-03, > get the reject and then attempt ZK on devrackA-04. > > --- > Jay Wilson > > > > On 7/5/2012 9:08 PM, Amandeep Khurana wrote: > > The timeout can be configured using the session timeout configuration. The > > default for that is 180s, but that means that if the RS doesn't heartbeat > > to ZK for 180s, it's considered dead. Unless the machines are really loaded > > or GCs are pausing the RS processes, I don't see any other reason except > > the network. I'm assuming you gave ZK a dedicated disk so it could write > > its edit logs (based on a previous thread). > > > > > > On Thursday, July 5, 2012 at 9:03 PM, Jay Wilson wrote: > > > > > I don't see that in the RS logs. Would I see that in the ZK logs? > > > > > > At this point there is no network. Just a switch. I reduced the number > > > of nodes to 40 and had all of them placed on the same switch with a > > > single vlan. I even had the network techs use a completely different > > > switch just to be safe. > > > > > > Is there some heatbeat timer I can tweak? > > > > > > --- > > > Jay Wilson > > > > > > On 7/5/2012 8:34 PM, Amandeep Khurana wrote: > > > > > > > > > > > > On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote: > > > > > > > > > Finally my HMaster has stabilized and been running for 7 hours. I > > > > > believe my networking issues are behind me now. Thank you everyone for > > > > > the help. > > > > > > > > > > > > > > > > > > > > > Awesome. > > > > > > > > Looks like the same issue is biting you with the RS too. The RS isn't > > > > heartbeating to ZK and the ZK session expires, causing the RS to die. > > > > Do you see a YouAreDeadException in the logs? > > > > > > > > > > New issue is my RSes continue to die after about 20 minutes. Again the > > > > > cluster is idle. No jobs are running and I get this on all of my RSes > > > > > at almost the same time: > > > > > > > > > > 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening > > > > > socket connection to server devrackA-04/172.18.0.5:2181 > > > > > 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket > > > > > connection established to devrackA-04/172.18.0.5:2181, initiating > > > > > session > > > > > 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session > > > > > establishment complete on server devrackA-04/172.18.0.5:2181, > > > > > sessionid > > > > > = 0x13858fc240f0003, negotiated timeout = 180000 > > > > > 2012-07-05 19:34:05,399 INFO > > > > > org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown > > > > > hook thread: Shutdownhook:regionserver60020 > > > > > 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable > > > > > to > > > > > read additional data from server sessionid 0x13858fc240f0003, likely > > > > > server has closed socket, closing socket connection and attempting > > > > > reconnect > > > > > 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening > > > > > socket connection to server devrackA-03/172.18.0.4:2181 > > > > > 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket > > > > > connection established to devrackA-03/172.18.0.4:2181, initiating > > > > > session > > > > > 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable > > > > > to > > > > > reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired, > > > > > closing socket connection > > > > > 2012-07-05 20:06:40,586 FATAL > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > > > > > server serverName=devrackB-07,60020,1341542045088, load=(requests=0, > > > > > regions=0, usedHeap=0, maxHeap=0): > > > > > regionserver:60020-0x13858fc240f0003 > > > > > regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper, > > > > > aborting > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException: > > > > > KeeperErrorCode = Session expired > > > > > > > > > > Could the fact that the cluster is idle cause the sessions to expire? > > > > > It's almost like a timing trigger pops, the sessions expire, and then > > > > > can reconnect. Is there a timer I need to adjust? > > > > > > > > > > Could this be related to a TCP or IP timer that needs to be adjusted? > > > > > The session goes into a Fin/Wait state and then closes? > > > > > > > > > > Thank you > > > > > --- > > > > > Jay Wilson > > > > > > > > > > > > > > > > > > > > > > > > > >