Funny you mention that.  I asked the techs to set it up that why.

I went to pull my ZK logs and found that 1 RS is still running.  What is
interesting is that RS is connected to ZK on devrackA-05.  The 2 RSes
that died where connected to ZK on devrackA-03.  devrackA-03 has ZK and
HMaster on it.

I did not find the YouAreDeadException in the ZK logs.  What I found was:

2012-07-05 20:06:40,577 INFO org.apache.zookeeper.server.NIOServerCnxn:
Accepted socket connection from /172.18.0.72:54449
2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.NIOServerCnxn:
Client attempting to renew session 0x13858fc240f0003 at /172.18.0.72:54449
2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.quorum.Learner:
Revalidating client: 87918032693690371
2012-07-05 20:06:40,580 INFO org.apache.zookeeper.server.NIOServerCnxn:
Invalid session 0x13858fc240f0003 for client /172.18.0.72:54449,
probably expired

In the RS logs I can see it attempt to reconnect with ZK on devrackA-03,
get the reject and then attempt ZK on devrackA-04.

---
Jay Wilson



On 7/5/2012 9:08 PM, Amandeep Khurana wrote:
> The timeout can be configured using the session timeout configuration. The 
> default for that is 180s, but that means that if the RS doesn't heartbeat to 
> ZK for 180s, it's considered dead. Unless the machines are really loaded or 
> GCs are pausing the RS processes, I don't see any other reason except the 
> network. I'm assuming you gave ZK a dedicated disk so it could write its edit 
> logs (based on a previous thread). 
> 
> 
> On Thursday, July 5, 2012 at 9:03 PM, Jay Wilson wrote:
> 
>> I don't see that in the RS logs. Would I see that in the ZK logs?
>>
>> At this point there is no network. Just a switch. I reduced the number
>> of nodes to 40 and had all of them placed on the same switch with a
>> single vlan. I even had the network techs use a completely different
>> switch just to be safe.
>>
>> Is there some heatbeat timer I can tweak?
>>
>> ---
>> Jay Wilson
>>
>> On 7/5/2012 8:34 PM, Amandeep Khurana wrote:
>>>
>>>
>>> On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
>>>
>>>> Finally my HMaster has stabilized and been running for 7 hours. I
>>>> believe my networking issues are behind me now. Thank you everyone for
>>>> the help.
>>>>
>>>
>>>
>>> Awesome.
>>>
>>> Looks like the same issue is biting you with the RS too. The RS isn't 
>>> heartbeating to ZK and the ZK session expires, causing the RS to die.
>>> Do you see a YouAreDeadException in the logs? 
>>>>
>>>> New issue is my RSes continue to die after about 20 minutes. Again the
>>>> cluster is idle. No jobs are running and I get this on all of my RSes
>>>> at almost the same time:
>>>>
>>>> 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server devrackA-04/172.18.0.5:2181
>>>> 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>> connection established to devrackA-04/172.18.0.5:2181, initiating session
>>>> 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
>>>> establishment complete on server devrackA-04/172.18.0.5:2181, sessionid
>>>> = 0x13858fc240f0003, negotiated timeout = 180000
>>>> 2012-07-05 19:34:05,399 INFO
>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
>>>> hook thread: Shutdownhook:regionserver60020
>>>> 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to
>>>> read additional data from server sessionid 0x13858fc240f0003, likely
>>>> server has closed socket, closing socket connection and attempting 
>>>> reconnect
>>>> 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server devrackA-03/172.18.0.4:2181
>>>> 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>> connection established to devrackA-03/172.18.0.4:2181, initiating session
>>>> 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to
>>>> reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
>>>> closing socket connection
>>>> 2012-07-05 20:06:40,586 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
>>>> regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003
>>>> regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
>>>> aborting
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired
>>>>
>>>> Could the fact that the cluster is idle cause the sessions to expire?
>>>> It's almost like a timing trigger pops, the sessions expire, and then
>>>> can reconnect. Is there a timer I need to adjust?
>>>>
>>>> Could this be related to a TCP or IP timer that needs to be adjusted?
>>>> The session goes into a Fin/Wait state and then closes?
>>>>
>>>> Thank you
>>>> ---
>>>> Jay Wilson
>>>>
>>>
>>>
>>
>>
>>
> 
> 
> 


Reply via email to