I don't see that in the RS logs.  Would I see that in the ZK logs?

At this point there is no network.  Just a switch.  I reduced the number
of nodes to 40 and had all of them placed on the same switch with a
single vlan.  I even had the network techs use a completely different
switch just to be safe.

Is there some heatbeat timer I can tweak?

---
Jay Wilson

On 7/5/2012 8:34 PM, Amandeep Khurana wrote:
> 
> 
> On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
> 
>> Finally my HMaster has stabilized and been running for 7 hours. I
>> believe my networking issues are behind me now. Thank you everyone for
>> the help.
>>
>>
> 
> Awesome.
> 
> Looks like the same issue is biting you with the RS too. The RS isn't 
> heartbeating to ZK and the ZK session expires, causing the RS to die.
> Do you see a YouAreDeadException in the logs? 
>>
>> New issue is my RSes continue to die after about 20 minutes. Again the
>> cluster is idle. No jobs are running and I get this on all of my RSes
>> at almost the same time:
>>
>> 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server devrackA-04/172.18.0.5:2181
>> 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to devrackA-04/172.18.0.5:2181, initiating session
>> 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server devrackA-04/172.18.0.5:2181, sessionid
>> = 0x13858fc240f0003, negotiated timeout = 180000
>> 2012-07-05 19:34:05,399 INFO
>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
>> hook thread: Shutdownhook:regionserver60020
>> 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to
>> read additional data from server sessionid 0x13858fc240f0003, likely
>> server has closed socket, closing socket connection and attempting reconnect
>> 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server devrackA-03/172.18.0.4:2181
>> 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to devrackA-03/172.18.0.4:2181, initiating session
>> 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to
>> reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
>> closing socket connection
>> 2012-07-05 20:06:40,586 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
>> regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003
>> regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
>> aborting
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired
>>
>> Could the fact that the cluster is idle cause the sessions to expire?
>> It's almost like a timing trigger pops, the sessions expire, and then
>> can reconnect. Is there a timer I need to adjust?
>>
>> Could this be related to a TCP or IP timer that needs to be adjusted?
>> The session goes into a Fin/Wait state and then closes?
>>
>> Thank you
>> ---
>> Jay Wilson
>>
>>
> 
> 
> 


Reply via email to