In a distributed system nodes (servers, locators) are continually watching other nodes to ensure that something bad hasn't happened.  One of the ways this is done in Geode is for each node to watch one other node and expect periodic signs that it's still alive.  This is done through TCP messaging.  Any message from the node being watched counts as proof that it's still alive.  If no messages are seen within the "member-timeout" period (see Distributed System settings, default 5000ms) then a "heartbeat" is requested over UDP.  If no message is received in another "member-timeout" interval we attempt to directly contact the suspect with a tcp/ip connection requesting that it verify its identity.  If this fails the suspect is kicked out of the cluster.

So, you could increase your member-timeout setting or maybe investigate why messages, especially hearbeats, aren't being received.  A tcp/ip performance measuring tool might help in that regard - run one to see what the packet-loss percentage is and if it's high look into why that's happening.

It's also possible that garbage-collection is kicking in on the member that "isn't responding to heartbeat requests" or that it's not getting enough CPU for other reasons.

On 2/25/19 2:39 AM, Avital Amity wrote:

Hi,

I have an environment where I servers and locator go down from time to time with the below error:

Member isn't responding to heartbeat requests

Any suggestion regarding relevant configuration/other thing to check? What can lead to this issue?

Thanks

Avital

*This email and the information contained herein is proprietary and confidential and subject to the Amdocs Email Terms of Service, which you may review at**https://www.amdocs.com/about/email-terms-of-service*

Reply via email to