Is your ZK managed by HBase or are you managing it yourself?

BTW - All ZK nodes should be reachable by all nodes in the cluster.

The YouAreDeadException would be in RS logs if at all.


On Thursday, July 5, 2012 at 9:38 PM, Jay Wilson wrote:

> Funny you mention that. I asked the techs to set it up that why.
> 
> I went to pull my ZK logs and found that 1 RS is still running. What is
> interesting is that RS is connected to ZK on devrackA-05. The 2 RSes
> that died where connected to ZK on devrackA-03. devrackA-03 has ZK and
> HMaster on it.
> 
> I did not find the YouAreDeadException in the ZK logs. What I found was:
> 
> 2012-07-05 20:06:40,577 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Accepted socket connection from /172.18.0.72:54449
> 2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Client attempting to renew session 0x13858fc240f0003 at /172.18.0.72:54449
> 2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.quorum.Learner:
> Revalidating client: 87918032693690371
> 2012-07-05 20:06:40,580 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Invalid session 0x13858fc240f0003 for client /172.18.0.72:54449,
> probably expired
> 
> In the RS logs I can see it attempt to reconnect with ZK on devrackA-03,
> get the reject and then attempt ZK on devrackA-04.
> 
> ---
> Jay Wilson
> 
> 
> 
> On 7/5/2012 9:08 PM, Amandeep Khurana wrote:
> > The timeout can be configured using the session timeout configuration. The 
> > default for that is 180s, but that means that if the RS doesn't heartbeat 
> > to ZK for 180s, it's considered dead. Unless the machines are really loaded 
> > or GCs are pausing the RS processes, I don't see any other reason except 
> > the network. I'm assuming you gave ZK a dedicated disk so it could write 
> > its edit logs (based on a previous thread). 
> > 
> > 
> > On Thursday, July 5, 2012 at 9:03 PM, Jay Wilson wrote:
> > 
> > > I don't see that in the RS logs. Would I see that in the ZK logs?
> > > 
> > > At this point there is no network. Just a switch. I reduced the number
> > > of nodes to 40 and had all of them placed on the same switch with a
> > > single vlan. I even had the network techs use a completely different
> > > switch just to be safe.
> > > 
> > > Is there some heatbeat timer I can tweak?
> > > 
> > > ---
> > > Jay Wilson
> > > 
> > > On 7/5/2012 8:34 PM, Amandeep Khurana wrote:
> > > > 
> > > > 
> > > > On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
> > > > 
> > > > > Finally my HMaster has stabilized and been running for 7 hours. I
> > > > > believe my networking issues are behind me now. Thank you everyone for
> > > > > the help.
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > Awesome.
> > > > 
> > > > Looks like the same issue is biting you with the RS too. The RS isn't 
> > > > heartbeating to ZK and the ZK session expires, causing the RS to die.
> > > > Do you see a YouAreDeadException in the logs? 
> > > > > 
> > > > > New issue is my RSes continue to die after about 20 minutes. Again the
> > > > > cluster is idle. No jobs are running and I get this on all of my RSes
> > > > > at almost the same time:
> > > > > 
> > > > > 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
> > > > > socket connection to server devrackA-04/172.18.0.5:2181
> > > > > 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
> > > > > connection established to devrackA-04/172.18.0.5:2181, initiating 
> > > > > session
> > > > > 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
> > > > > establishment complete on server devrackA-04/172.18.0.5:2181, 
> > > > > sessionid
> > > > > = 0x13858fc240f0003, negotiated timeout = 180000
> > > > > 2012-07-05 19:34:05,399 INFO
> > > > > org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
> > > > > hook thread: Shutdownhook:regionserver60020
> > > > > 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable 
> > > > > to
> > > > > read additional data from server sessionid 0x13858fc240f0003, likely
> > > > > server has closed socket, closing socket connection and attempting 
> > > > > reconnect
> > > > > 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
> > > > > socket connection to server devrackA-03/172.18.0.4:2181
> > > > > 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
> > > > > connection established to devrackA-03/172.18.0.4:2181, initiating 
> > > > > session
> > > > > 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable 
> > > > > to
> > > > > reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
> > > > > closing socket connection
> > > > > 2012-07-05 20:06:40,586 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> > > > > server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
> > > > > regions=0, usedHeap=0, maxHeap=0): 
> > > > > regionserver:60020-0x13858fc240f0003
> > > > > regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
> > > > > aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > 
> > > > > Could the fact that the cluster is idle cause the sessions to expire?
> > > > > It's almost like a timing trigger pops, the sessions expire, and then
> > > > > can reconnect. Is there a timer I need to adjust?
> > > > > 
> > > > > Could this be related to a TCP or IP timer that needs to be adjusted?
> > > > > The session goes into a Fin/Wait state and then closes?
> > > > > 
> > > > > Thank you
> > > > > ---
> > > > > Jay Wilson
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 
> 


Reply via email to