In a distributed system nodes (servers, locators) are continually
watching other nodes to ensure that something bad hasn't happened. One
of the ways this is done in Geode is for each node to watch one other
node and expect periodic signs that it's still alive. This is done
through TCP messaging. Any message from the node being watched counts
as proof that it's still alive. If no messages are seen within the
"member-timeout" period (see Distributed System settings, default
5000ms) then a "heartbeat" is requested over UDP. If no message is
received in another "member-timeout" interval we attempt to directly
contact the suspect with a tcp/ip connection requesting that it verify
its identity. If this fails the suspect is kicked out of the cluster.
So, you could increase your member-timeout setting or maybe investigate
why messages, especially hearbeats, aren't being received. A tcp/ip
performance measuring tool might help in that regard - run one to see
what the packet-loss percentage is and if it's high look into why that's
happening.
It's also possible that garbage-collection is kicking in on the member
that "isn't responding to heartbeat requests" or that it's not getting
enough CPU for other reasons.
On 2/25/19 2:39 AM, Avital Amity wrote:
Hi,
I have an environment where I servers and locator go down from time to
time with the below error:
Member isn't responding to heartbeat requests
Any suggestion regarding relevant configuration/other thing to check?
What can lead to this issue?
Thanks
Avital
*This email and the information contained herein is proprietary and
confidential and subject to the Amdocs Email Terms of Service, which
you may review at**https://www.amdocs.com/about/email-terms-of-service*