Hi Dharam, We have seen the similar issue multiple times but in our case insufficient heap was the main cause for heartbeat failures. Another reason could be if your VM gets freezed for sometime then it may not respond to heartbeat requests. In our organization we generally turn off HA,DRS,VMOTION to avoid this. I am assuming these are already taken care off.
https://geode.apache.org/docs/guide/14/managing/monitor_tune/performance_on_vsphere.html Increase member-timeout as Anil suggested with fine level logging. Ashish On Fri 28 Sep, 2018, 4:14 AM Anilkumar Gingade, <[email protected]> wrote: > Hi Dharam, > > I assume the top command is on the physical host; I am not sure in that > case how it reports the virtual machine stats and the process running in it. > > The error indicates that server is unable to ping the peer/other member; > do you see any errors/stack trace in the peer member. > The ping response wait time is based on "member-timeout"; you can try > increasing that to see if it fixes. If it, then it indicates the responses > are taking more time. > > When you move the client to other host; do you still see the problem? > > If you haven't seen; here is few guidelines about running Geode on virtual > environment. > > -Anil. > > > > > On Thu, Sep 27, 2018 at 12:13 PM Thacker, Dharam < > [email protected]> wrote: > >> Hi Anil, >> >> >> >> Here are the ones in member which failed. >> >> >> >> [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <Geode Failure Detection >> Server thread 0> tid=0x37] GMSHealthMonitor server thread exiting >> >> >> >> [severe 2018/09/27 08:09:06.561 EDT xxx-server-1 <unicast >> receiver,xxx-27036> tid=0x33] Membership service failure: Member isn't >> responding to heartbeat requests >> >> org.apache.geode.ForcedDisconnectException: Member isn't responding to >> heartbeat requests >> >> at >> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2520) >> >> at >> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:998) >> >> at >> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:635) >> >> at >> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1702) >> >> at >> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1286) >> >> at org.jgroups.JChannel.invokeCallback(JChannel.java:816) >> >> at org.jgroups.JChannel.up(JChannel.java:741) >> >> at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030) >> >> at org.jgroups.protocols.FRAG2.up(FRAG2.java:165) >> >> at org.jgroups.protocols.FlowControl.up(FlowControl.java:390) >> >> at >> org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1070) >> >> at >> org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:785) >> >> at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426) >> >> at >> org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:74) >> >> at >> org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72) >> >> at org.jgroups.protocols.TP.passMessageUp(TP.java:1601) >> >> at org.jgroups.protocols.TP >> $SingleMessageHandler.run(TP.java:1817) >> >> at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10) >> >> at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1729) >> >> at org.jgroups.protocols.TP.receive(TP.java:1654) >> >> at >> org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:160) >> >> at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701) >> >> at java.lang.Thread.run(Thread.java:745) >> >> >> >> [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <Geode Failure Detection >> Server thread 0> tid=0x37] GMSHealthMonitor server socket closed. >> >> >> >> [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <unicast >> receiver,xxx-27036> tid=0x33] CacheServer configuration saved >> >> >> >> *Top command output:* >> >> >> >> top - 15:09:56 up 24 days, 18:47, 3 users, load average: 0.03, 0.08, >> 0.08 >> >> Tasks: 427 total, 1 running, 426 sleeping, 0 stopped, 0 zombie >> >> Cpu(s): 2.8%us, 0.6%sy, 0.0%ni, 96.6%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> >> Mem: 132159004k total, 83190332k used, 48968672k free, 21108k buffers >> >> >> >> Swap: 2097148k total, 0k used, 2097148k free, 7780808k cached >> >> >> >> *Visual VM [I have seen below trend only most of the time]* >> >> >> >> Thanks, >> >> Dharam >> >> >> >> *From:* Anilkumar Gingade [mailto:[email protected]] >> *Sent:* Thursday, September 27, 2018 11:24 PM >> *To:* [email protected] >> *Subject:* Re: Frequent GMS membership error >> >> >> >> Can you share the exact error message you are seeing; do you see any >> exception stack trace in the server log. >> >> >> >> Most probable cause is n/w; or memory. Can you verify the specified >> memory is getting allocated to the JVM and the host (virtual machines) has >> sufficient memory to run all the servers/clients.Hell >> >> >> >> -Anil. >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Sep 27, 2018 at 9:48 AM Dharam Thacker <[email protected]> >> wrote: >> >> Hello Anthony, >> >> >> >> Yes I am running in virtualized infrastructure. But when I checked %id >> and %st and logged graph for it, i see %st as always 0.0 and %id in range >> of (95-98) most of the time. >> >> >> >> Could number of connections for every client app or member-timeout or >> ack-wait-threshold help here? >> >> >> >> Thanks, >> >> - Dharam Thacker >> >> >> >> >> >> On Thu, Sep 27, 2018 at 8:37 PM Anthony Baker <[email protected]> wrote: >> >> Are you running on cloud or virtualized infrastructure? If so, check if >> your steal time stats—you may have “noisy neighbors” causing members to >> become unresponsive. Geode detects this and fences off the unhealthy >> members to maintain consistency and availability. >> >> >> >> Anthony >> >> >> >> >> >> On Sep 27, 2018, at 10:31 AM, Dharam Thacker <[email protected]> >> wrote: >> >> >> >> Hi Team, >> >> >> >> I have following topology for geode currently and all regions are >> replicated. >> >> >> >> Note : Unfortunately I am still on version 1.1.1 >> >> >> >> *Host1*: >> >> Locator1 >> >> Server1.1 (Group1) -- 24G >> >> Server2.1 (Group2) -- 24G >> >> Client1 (CQ listener only -- 20 CQs registered via locator pool) >> >> Client2 (Fires OQL queries and functions only via locator pool) >> >> >> >> *Host2*: >> >> Locator2 >> >> Server1.2 (Group1) -- 24G >> >> Server2.2 (Group2) -- 24G >> >> >> >> As shown above I have spring boot web app geode clients (client1 and >> client2) only on HOST1. >> >> >> >> If I scale them by putting them on HOST2 as well it works. >> >> >> >> Now I see 40 CQs registered for CQ listener client. >> >> >> >> But I frequently see now "GMS Membership error" complaining about "No >> heartbeat request and force disconnection of member" for all server nodes. >> >> >> >> Transient though but really painful! >> >> >> >> Somehow with 1.1.1 it can't auto reconnect which I know is fixed in later >> version but that's still fine. >> >> >> >> I did GC,CPU load and Memory analysis very well and at least these 3 >> looks quite healthy as expected. >> >> >> >> What could be the possible other reasons where scalling client apps might >> result into this? >> >> >> >> Or if you can suggest anything else to look at? >> >> >> >> Thanks, >> >> Dharam >> >> >> >> >> >> This message is confidential and subject to terms at: http:// >> www.jpmorgan.com/emaildisclaimer including on confidentiality, legal >> privilege, viruses and monitoring of electronic messages. If you are not >> the intended recipient, please delete this message and notify the sender >> immediately. Any unauthorized use is strictly prohibited. >> >
